How Search Really Works: Relevance (2)

This post is part of an ongoing series: How Search Really Works.
Previously: Relevance (1)

Another way we can assess the relevance of a document is by term weighting.

From the keyword density myth we know that true term weighting is done collection wide.

By looking at the number of documents in the index that a term appears in we can make a measurement of information: how good, how special... how meaningful is this word?

The word the would not be special at all, appearing in way too many documents. Its worth would be close to zero.

But klebenleiben ("the reluctance to stop talking about a certain subject" ...)would be very special indeed! Because it appears in only 18 documents among millions, its worth, its weight, would automatically be very high.

The measure is called inverse document frequency.

This measure is our weight; it is what we use to judge the relevance of a document with.

Term Frequency Times

We do so by counting the number of times a word appears in a document. We normalize that count; we adjust it so that the length of a document doesn't matter that much anymore.

We then multiply it by our weight measurement: TF x IDF. Term Frequency times Inverse Document Frequency.

In other words, a high count of a rare word = a high score for that document, for that word. But... a high count of a common word = not so high score for that document, for that word.

Vectors

A vector is a line of a certain length into a certain direction.

Both the length and the direction of the line represent important information.

Vectors enable us to represent, to talk about, size and direction when position is irrelevant. Wind speed, velocity, force, acceleration; all these are good candidates to be represented as a vector.

TFxIDF scores are perfectly suited to be represented as vectors.

Vector Space

Think of the words that make up our index as axes of a space.

Of course in a real index this space would consists of thousands upon thousands of axes...

Documents as Vectors

For each word in our document we can draw a line (vector) which shows its TFxIDF score for a certain term.

Queries as Vectors

Every word in a query can also be shown as a vector.

By looking at documents that are "near" our query we can rank (sort) documents in our result set.

TFxIDF Vector Space Ranking

If a document is close to our query it answers our query.

But better yet: documents close to ours are similar documents. They're talking about roughly the same thing.

This makes TFxIDF vector space ranking extremely useful to find sets of similar documents through "closeness".

About the Author: Ruud Hein

I love helping to make web sites make it. From the ground up if needed. CSS challenges, server-side scripting, user and device friendly JavaScript tricks search engines have no problems with. Tracking how the sites perform and then figuring out how to make that performance and the tracking better. I'm passionate about information. No matter how often I trim my feeds in my feed readers (yes, I use more than one), I always have a couple of hundred in there covering topics ranging from design to usability, from SEO to SEM, from life hacks to productivity blogs, from.... Well, you get the idea, I guess. Knowledge and information management is close to my heart. Has to be with the amount of information I track. My "trusted system" is usually in flux but always at hand and fully searchable. My ~~paid passion~~ job at Search Engine People sees me applying my passions and knowledge to a wide array of problems, ones I usually experience as challenges. It's good to have you here: pleased to meet you!