Notes on "spiders"
Different types of requests:
A request can simply request the pre-defined URLs, or request a subset links of the content from a starting URL. To a larger scale, a single-site spider can request all the links from the site, or even request anything it can possibly find.
The following issues need to be concerned when writing a spider:
1. the scope of the links, whether on the same host or not
2. the file type of the content to request (html, imgs, documents etc.)
3. maximum size of the document to fetch
Reference:
Robot Rules: http://www.robotstxt.org/
A request can simply request the pre-defined URLs, or request a subset links of the content from a starting URL. To a larger scale, a single-site spider can request all the links from the site, or even request anything it can possibly find.
The following issues need to be concerned when writing a spider:
1. the scope of the links, whether on the same host or not
2. the file type of the content to request (html, imgs, documents etc.)
3. maximum size of the document to fetch
Reference:
Robot Rules: http://www.robotstxt.org/
