This post is part of an ongoing series: How Search Really Works.
Last week: "The" Index (1).
Last week we saw how an inverted index (where a list of words points to a list of documents in which they appear) is insanely useful for doing AND queries.
But what if you're not looking for any document that has the words search AND people AND engine but you're looking for Search Engine People?
Well, if document 42 in our example reads "the engine was found after a search by some people" or "people use a search engine such as Google" than a traditional inverted index would think it's spot-on for your search. Ai....
(Extended) Biword Inverted Index
One way to go about this would be to somehow generate an inverted list of phrases.
Problem: if you index phrases of 2-words length, 3+ word phrase searches become just another AND query, combining the parts of the phrase. "Long John" AND "Silver".
Indexing phrases of 3-words length simply moves the problem to 4+ word phrase searches.... etc. etc.
Problem: the inverted index becomes huge, listing every word in every document and every 2 (3? 4?) word phrase in every document....
Positional Inverted Index
The only real solution is to store not only the incidence of occurrence of a word in a document but to store the exact position(s) of the word in that document.
In this example document 42 is identified for "search engine people" because the words appear in that order: they appear in positions 1, 2 and 3.
Advantage: because the positional index is similar in construction as the traditional inverted index it inherits the same advantage. That is, when doing an AND query it can jump ahead whenever one of the words doesn't occur in the document it is looking at.
Advantage: simply by looking at words occurring in the right order, any phrase of any length can be found even though it isn't indexed as such.
Advantage: by having precise position information we can do proximity queries.
Advantage: phrase matching and query word proximity can also be used to rank search results.
The Winner
Although a positional index is at least 2-4 times (or up to 50%) larger than a traditional inverted index the payoff is so large that this is the type of index in use by commercial search engines -- for phrases. ... In general....
Frequently searched phrases are still better stored in a biwords index; less frequently searched phrases are better processed with a positional inverted index.
Index Type & SEO
The fun (warning: geek talking!) is of course that knowing this kind of stuff implicitly explains you things.
For example, knowing that in order for a positional inverted index to really work all words, including so-called "stop words", need to be indexed makes it less surprising that stop words are dead.
Positional indexing and retrieval also makes it not only logic but expected that shop in new york and shop new york give different results.
Different results = different thinking, different SEO... different opportunities.
It's all in the index 🙂
Wow, this is advanced search engine stuff. Thanks for the info
Another great post Ruud! Stop words are definitely not dead. They are such a large part of natural language search queries.
There’s a trick, you know.