Search Engines
Not directories -- that's different
What Are they
Robots gather pages and index them into a database
Sometimes other robots collect metadata about the data
Have an interface to the database to allow searches of the index
How does a robot/spider/crawler work?
While there are pages to go visit
Get some URL from the stack to go visit
Mark the URL as visited in the database
Reads the page from the URL
If the page has additional links that have not yet been visited, add them
to the stack go visit
Index all the words from the page into the database
Good robots honor the file http://servername/robots.txt
Can describe what not to index
Can keep a robot out entirely
Up to robot to honor the request
Good robots do not flood one site, but access the site slowly
Good robots revisit a site frequently to notice changes. For example,
altavista visits once per month. Lycos probably less
Robots have more work than they can handle. They sometimes only do
the of of a site, and not the leaves.
The best indexes associate each word with an importance.
The databases are HUGE 'randy' at altavista showed 697,758, Linux 2,235,947
linux 4,949,183, bill gates about 200,000
Titles and headings get more weight than normal text
All indexes should honor the keyword tag
There are legal contraversities with the meta tag
Some pages are bogus, and just designed to get listed in the seach database.
They are the SPAM of the web
People want hits.
Make pages with hot topics current interest (Monica Lewinsli Monika Lewinski
Monika Lewinski)
Make pages with hot topic of perment interest, typically with a list of
sexual words.
Have a link at the top or bottom to try and get to to the page they wish
you would access.
Searches are vital
Should allow for boolean logic (and or not)
Don't forget not. (try searching for cheap car, and then cheap car NOT rental)
Should allow for adjacent ("cheap near car" as opposed to "cheap car")
Should weight each word by importance
Should weight each word by frequency in the document, showing documents
with keywords repeated is better than a document with the keyword only
Should weight each word inversly by frequency in the entire database.
rarly used words are more important
Can also use associated words (something very like synonyms)
Are words that mean the same thing( cheap and inexpensive) or are
closely related in concept (Start Trek and Captian Kirk)
Can get associations from the database, or from a human built list
Search engines should be fast and thurough. Ya right!
Links that appeach in pages that are results of searches are also good
results. Infoseekcalls these "Best bets", Lycos "related topics"
Dynamic web pages
Are almost impossable for a robot to search, since it does not know how
to access them
Can be impossibly big. Imagine a web page telling the current
time, with a reload button.
Best bet is to include appropriate keywords in the search form.
That's a weak solution; I know of none better.
Getting listed
All major crawlers allow you to submit a URL, which then goes onto the
There are free placement services that offer to do all the major search
engines at once.
There are pricey placement services that promise more.
A comparison of web search features can be found at
A good collection of pages can be found at