Tuesday, September 26, 2006

Notes on "spiders"

Different types of requests:
A request can simply request the pre-defined URLs, or request a subset links of the content from a starting URL. To a larger scale, a single-site spider can request all the links from the site, or even request anything it can possibly find.

The following issues need to be concerned when writing a spider:
1. the scope of the links, whether on the same host or not
2. the file type of the content to request (html, imgs, documents etc.)
3. maximum size of the document to fetch

Reference:
Robot Rules: http://www.robotstxt.org/

Wednesday, September 06, 2006

Summary for User Profiling papers

1. Non-violative User Profiling Approach for Website Design Improvement
Target:
Collect user profile in order to improve the design of webs
Problems:
(a) how to define the user's role while profiling
(b) how to construct an effective dynamic user model to realize data mining and information update
(c) how to reuse the profile
Solutions:
(a) "Doule-identifier" --- to protect user's privacy
(b) mathematical modelling --- to learn user's behavior

2. Ontology Based Personalized Search
Target:
Incorporate users' interests into search process to improve the results
Problem:
(a) how to define user profile
(b) how to improve search results in a most efficient way
Solution:
(a) Use hierarchy of concepts, each node of it is associated with superdocuments. The superdocuments are represented as weighted keyword using the vector space model.
For each surfed page, a keyword vector is calculated and compared to node for similarities.
The strength of top 5 categories are combined with the time spent and the length of the page in order to create the user profile.
(b) Filter or re-rank the search results

3. Personalized Search Based on User Search Results
Target:
Explore a less-invasive way to gather user information for personalized search
Problem:
how user profiles are represented
Solution:
(a) user profiles are represented as weighted concept hierarchies, the search history is categorized to increase the concept weight
(b) each returned search result is compared to the user interests and re-ranked by its similarities


Friday, June 16, 2006

ContactInfo

Name: Kelan
Student id: 03994449d
Mobile: +852 91743197
email: kelanyu@gmail.com