Search Engines

Not directories -- that's different
What Are they

Robots gather pages and index them into a database
Sometimes other robots collect metadata about the data
Have an interface to the database to allow searches of the index

How does a robot/spider/crawler work?

While there are pages to go visit

Get some URL from the stack to go visit
Mark the URL as visited in the database
Reads the page from the URL

If the page has additional links that have not yet been visited, add them to the stack go visit
Index all the words from the page into the database

Good robots honor the file http://servername/robots.txt

Can describe what not to index
Can keep a robot out entirely
Up to robot to honor the request

Good robots do not flood one site, but access the site slowly
Good robots revisit a site frequently to notice changes. For example, altavista visits once per month. Lycos probably less
Robots have more work than they can handle. They sometimes only do the of of a site, and not the leaves.

The best indexes associate each word with an importance.

The databases are HUGE 'randy' at altavista showed 697,758, Linux 2,235,947 linux 4,949,183, bill gates about 200,000
Titles and headings get more weight than normal text
All indexes should honor the keyword tag
There are legal contraversities with the meta tag
Some pages are bogus, and just designed to get listed in the seach database. They are the SPAM of the web

People want hits.
Make pages with hot topics current interest (Monica Lewinsli Monika Lewinski Monika Lewinski)
Make pages with hot topic of perment interest, typically with a list of sexual words.
Have a link at the top or bottom to try and get to to the page they wish you would access.

Searches are vital

Should allow for boolean logic (and or not)

Don't forget not. (try searching for cheap car, and then cheap car NOT rental)
Should allow for adjacent ("cheap near car" as opposed to "cheap car")

Should weight each word by importance
Should weight each word by frequency in the document, showing documents with keywords repeated is better than a document with the keyword only once
Should weight each word inversly by frequency in the entire database. rarly used words are more important
Can also use associated words (something very like synonyms)

Are words that mean the same thing( cheap and inexpensive) or are closely related in concept (Start Trek and Captian Kirk)
Can get associations from the database, or from a human built list

Search engines should be fast and thurough. Ya right!
Links that appeach in pages that are results of searches are also good results. Infoseekcalls these "Best bets", Lycos "related topics"

Dynamic web pages

Are almost impossable for a robot to search, since it does not know how to access them
Can be impossibly big. Imagine a web page telling the current time, with a reload button.
Best bet is to include appropriate keywords in the search form.

That's a weak solution; I know of none better.

Getting listed

All major crawlers allow you to submit a URL, which then goes onto the queue.
There are free placement services that offer to do all the major search engines at once.
There are pricey placement services that promise more.

A comparison of web search features can be found at http://www.kcpl.lib.mo.us/search/chart.htm .
A good collection of pages can be found at http://www.searchenginewatch.com.