How Search Really Works: The Compressed Index

This post is part of an ongoing series: How Search Really Works.
Last week: Recognize this index?

Memory is much faster than looking things up.

In order for a search engine in high demand to serve its users efficiently it should keep things in memory instead of looking it up on a disk.

Traditionally large scale search engines will keep their complete dictionary in memory and the posting list on disk.

Inefficient Storage

Obviously the more you can keep in memory and the more information can be read back with one disk action, the better.

Unfortunately computer information gets inefficiently stored in boxes with fixed dimension: if a box is 10 characters wide a 4 character word still takes up 1 box of 10 characters.

Compression

The solution is to squeeze information together so the least amount of space contains the maximum amount of information.

Big String Compressed Dictionary

could become:

By adding the length of each word to every entry we can make the list of words hundreds of characters shorter.

Gap Compressed Posting List

could become:

By storing the difference between the document ID's (the gaps) we can save hundreds of characters.

The same could be done for storing the gaps between the index numbers for the occurrence positions in each document.

This "compressed representation encodes occurrences of a term as a pointer to the next occurrence of the term to facilitate rapid enumeration of the occurrences of the term".

You could search "occurrences of the terms in the set of documents by following pointers through the compressed representation".

Partial Decompression

An index stored this way is completely lossless: it retains all information from document identifier to document positional identifier.

By starting with the least frequently used term in the search it is very easy to unravel to do a partial decompression of this index by

"identifying occurrences of the terms in the set of documents"

(the dictionary) - to then use;

"the corresponding term identifiers for the terms in the search request to look up a term offset table for a pointer to a first occurrence of the terms in the compressed representation of the set of documents"

(the very first posting document ID);

"and following a chain of pointers starting at the first occurrence to identify other occurrences of the terms in the compressed representation of the set of documents"

(the gap compressed document ID list)

Recommended reading:

New Google Approach to Indexing and Stopwords

About the Author: Ruud Hein

I love helping to make web sites make it. From the ground up if needed. CSS challenges, server-side scripting, user and device friendly JavaScript tricks search engines have no problems with. Tracking how the sites perform and then figuring out how to make that performance and the tracking better. I'm passionate about information. No matter how often I trim my feeds in my feed readers (yes, I use more than one), I always have a couple of hundred in there covering topics ranging from design to usability, from SEO to SEM, from life hacks to productivity blogs, from.... Well, you get the idea, I guess. Knowledge and information management is close to my heart. Has to be with the amount of information I track. My "trusted system" is usually in flux but always at hand and fully searchable. My ~~paid passion~~ job at Search Engine People sees me applying my passions and knowledge to a wide array of problems, ones I usually experience as challenges. It's good to have you here: pleased to meet you!

8 Comments

Benny March 15, 2008 at 6:19 pm

Nice post you got going here. Keep up the good work, you have really nice articles.
spostareduro March 16, 2008 at 3:08 pm

Ruud, I kinda got lost on this one. Right about here:
“compressed representation encodes occurrences of a term as a pointer to the next occurrence of the term to facilitate rapid enumeration of the occurrences of the term”…I hate being blond! 🙂
Gab "SEO ROI" Goldenberg March 18, 2008 at 8:29 pm

Nice technical post Ruud :).

Comments are closed.

How Search Really Works: The Compressed Index

Inefficient Storage

Compression

Big String Compressed Dictionary

Gap Compressed Posting List

Partial Decompression

About the Author: Ruud Hein

Related Posts

Why You Should A/B Test Landing Pages When Running Digital Ads

Why I Still Do SEO After 20 Years — And Why You Might, Too

GEO vs AEO vs SEO: What’s the Difference and How to Optimize for All Three

8 Comments

Subscribe to Our Newsletter

Related Articles

Why You Should A/B Test Landing Pages When Running Digital Ads

Why I Still Do SEO After 20 Years — And Why You Might, Too

GEO vs AEO vs SEO: What’s the Difference and How to Optimize for All Three

Google Search Impressions Drop 12.3% YoY What’s Behind It?