How Spiders Work
(Google: What it does)
Web search tools have several parts.
- Download the page.
- Index the page.
- Run the query.
- Show the advertisement.
- Fight enemies
See http://www.zdnet.com.au/insight/software/0,39023769,39168647,00.htm for a summary.
Downloading the page
It's easy to find pages to download.
- Every time some page is downloaded it might contain links to other pages. Those pages become candidates for downloading.
- Users can suggest pages to download. See http://www.google.com/addurl.html.
- Pages already in the database should be re-downloaded occasionally in case they have changed.
The spider starts by getting the text of some page. It uses the normal http protocol to do this. Spiders have to be careful to download your site slowly or they will overwhelm your server. Spiders should respect /robots.txt files.
Index the page
The page should first be converted into an understandable format. Modern spiders can handle *.pdf, *.doc and so on. Google will even download images to get any metadata in them like subject, title, size, etc.
Probably one should not index items that do not make up the appearence of the page. Therefore 'keyword' like items do not work unless they appear as actual text. One should give more weight to items in bold, titles, headings, etc. If the whole page is in bold large font, none of it's words get extra weight.
It is appropriate to drop unimportant words like "the". But you do need to be able to search phrases like "oscar the cat".
Searching
When the user searches for pages on Cessna 150's, you search the index for every page with the word 'Cessna' and every page with the word '150'. Any page with both words is a good candidate. Any page with the words repeated is a good candidate (but limit this). Any page with synonyms of either 'cessna' or '150' are candidates. Finally, any page that contains those words and links to a second page increases the score for that second page. This is the classic Page Rank technology.
Please remember that you need to search about 8 billion pages in one tenth of a second.
Sometimes this results in suprising answers. Search for 'jew'.
Show the Advertisement
The basic idea is that the searched terms tell you what advertisements to show. When someone searches for 'cessna 150' we have a very good idea what product to show them. Using this idea is a HUGE increase in revenue per search. Google in particluar only charges when they show you an ad and then you click on it. You can submit any ad and any keyword and any payment rate. They will show your add when it's PayRate * ClickThroughRate exceeds other candidate ads. It's like evolution for ads. Only successful ads get shown. It's absolutely revolutionary!! You can make your own add at https://adwords.google.com/select/main?cmd=Login.
Fight the Enemy
It would be easy to make a bunch of sites all with the same keyword. All of these sites then point to the same site. That site then looks very important.
The Page Rank algorithm is known to be vulnerable to this flaw. It's called a Google Bomb (see http://en.wikipedia.org/wiki/Googlebomb.)
Some pages put commonly searched for terms (i.e. "President Bush" or "terrorism") that are not relevent to their site on their web site at the bottom in the background color. The hope is to capture some traffic for people looking for these common terms.
Cloaking is the act of offering one page when the search engine asks, and another page when a browser asks. It's widely considered unethical.
Google in particular will remove you from the index if you engage in some of these games and they catch you. Interestingly, they will not remove you because you are a hate-mongering evil jerk.