Search Engines
-
Not directories -- that's different
-
What Are they
-
Robots gather pages and index them into a database
-
Sometimes other robots collect metadata about the data
-
Have an interface to the database to allow searches of the index
-
How does a robot/spider/crawler work?
-
While there are pages to go visit
-
Get some URL from the stack to go visit
-
Mark the URL as visited in the database
-
Reads the page from the URL
-
If the page has additional links that have not yet been visited, add them
to the stack go visit
-
Index all the words from the page into the database
-
Good robots honor the file http://servername/robots.txt
-
Can describe what not to index
-
Can keep a robot out entirely
-
Up to robot to honor the request
-
Good robots do not flood one site, but access the site slowly
-
Good robots revisit a site frequently to notice changes. For example,
altavista visits once per month. Lycos probably less
-
Robots have more work than they can handle. They sometimes only do
the of of a site, and not the leaves.
-
The best indexes associate each word with an importance.
-
The databases are HUGE 'randy' at altavista showed 697,758, Linux 2,235,947
linux 4,949,183, bill gates about 200,000
-
Titles and headings get more weight than normal text
-
All indexes should honor the keyword tag
-
There are legal contraversities with the meta tag
-
Some pages are bogus, and just designed to get listed in the seach database.
They are the SPAM of the web
-
People want hits.
-
Make pages with hot topics current interest (Monica Lewinsli Monika Lewinski
Monika Lewinski)
-
Make pages with hot topic of perment interest, typically with a list of
sexual words.
-
Have a link at the top or bottom to try and get to to the page they wish
you would access.
-
Searches are vital
-
Should allow for boolean logic (and or not)
-
Don't forget not. (try searching for cheap car, and then cheap car NOT rental)
-
Should allow for adjacent ("cheap near car" as opposed to "cheap car")
-
Should weight each word by importance
-
Should weight each word by frequency in the document, showing documents
with keywords repeated is better than a document with the keyword only
once
-
Should weight each word inversly by frequency in the entire database.
rarly used words are more important
-
Can also use associated words (something very like synonyms)
-
Are words that mean the same thing( cheap and inexpensive) or are
closely related in concept (Start Trek and Captian Kirk)
-
Can get associations from the database, or from a human built list
-
Search engines should be fast and thurough. Ya right!
-
Links that appeach in pages that are results of searches are also good
results. Infoseekcalls these "Best bets", Lycos "related topics"
-
Dynamic web pages
-
Are almost impossable for a robot to search, since it does not know how
to access them
-
Can be impossibly big. Imagine a web page telling the current
time, with a reload button.
-
Best bet is to include appropriate keywords in the search form.
-
That's a weak solution; I know of none better.
-
Getting listed
-
All major crawlers allow you to submit a URL, which then goes onto the
queue.
-
There are free placement services that offer to do all the major search
engines at once.
-
There are pricey placement services that promise more.
-
A comparison of web search features can be found at http://www.kcpl.lib.mo.us/search/chart.htm
.
-
A good collection of pages can be found at http://www.searchenginewatch.com.