This post is part of an ongoing series: How Search Really Works.
Last week: The Compressed Index.
While human beings can scan a page and see if the whole phrase "a grandiloquent dictionary" appears on it, a search engine can't.
A search engine needs to:
- Lookup the occurrences for each word in the phrase
- See if the positions of words in the document fit the phrase
As a search engine isn't smart it needs to work smart.
Leverage Keyword Frequency
By storing the frequency with which a word appears in the whole index we can right away cut down to the smallest set from which to draw results.
Instead of selecting 15,570,000,000 documents in which "a" occurs and then checking which have the words grandiloquent and dictionary we can immediately limit the set to 222,000 documents; those documents that contain the relatively rare grandiloquent.
I wish I wrote this! Nice work explaining a techie concept Ruud!
Next up, I’d love to see you explain query-dependent and query-independent stuff, because I don’t understand that very well, personally. My understanding is limited to some factors being processed ahead of time (prior to the search occurring, and being general relevance factors like PR and domain trust/age) and others being calculated on the fly (intitle, inanchor etc.)
Oh, and – Sphunn!
Thanks Gab. I can’t promise your topic is “next up” but I do have a whole slew of posts still to go!
I haven’t been able to catch up on your posts for awhile but I’m doing so now. I hope you keep up this series for awhile.
Happy to see you like it Jordan. Thanks for adding me on Twitter, by the way!
I hope to keep the series going for a while, yes.