This post is part of an ongoing series: How Search Really Works.
Last week: The Compressed Index.
While human beings can scan a page and see if the whole phrase "a grandiloquent dictionary" appears on it, a search engine can't.
A search engine needs to:
- Lookup the occurrences for each word in the phrase
- See if the positions of words in the document fit the phrase
As a search engine isn't smart it needs to work smart.
Leverage Keyword Frequency
By storing the frequency with which a word appears in the whole index we can right away cut down to the smallest set from which to draw results.
Instead of selecting 15,570,000,000 documents in which "a" occurs and then checking which have the words grandiloquent and dictionary we can immediately limit the set to 222,000 documents; those documents that contain the relatively rare grandiloquent.