NLP - Getting started [Part 4]

Vector Space Model

The Search Task

Given a query and a corpus , find relevant items

query: a textual description of the user’s information need

corpus: a repository of textual documents

relevance: satisfaction of the user’s information need

Retrieval Model

A formal method that predicts the degree of relevance of a document to a query.

Basic Retrieval Process

Vector Space Model

The Vector Space Model (VSM) is a way of representing documents through the words that they contain.

It is a standard technique in Information Retrieval.

The VSM allows decisions to be made about which documents are similar to each other and to keyword queries.

Formally, a vector space is defined by a set of linearly independent basis vectors.

The basis vectors correspond to the dimensions or directions of the vector space.

The Vector Space Model represents documents and terms as vectors in a multi- dimensional space. Each dimension corresponds to a unique term in the entire corpus of documents.

How it works:

Each document is broken down into a word frequency table.
The tables are called vectors and can be stored as arrays.
A vocabulary is built from all the words in all documents in the system.
Each document is represented as a vector based against the vocabulary.

Example:

Queries can be represented as vectors in the same way as documents:

Dog = [0, 0, 0, 1, 0]

Document-Term Matrix

Similarity measures

There are many different ways to measure how similar two documents are, or how similar a document is to a query.

The cosine measure is a very common similarity measure.

Using a similarity measure, a set of documents can be compared to a query and the most similar document returned.

The cosine measure

For two vectors d and d’ the cosine similarity between d and d’ is given by:

Here d X d’ is the vector product of d and d’, calculated by multiplying corresponding frequencies together.

The cosine measure calculates the angle between the vectors in a high-dimensional virtual space.

Ranking documents

A user enters a query.

The query is compared to all documents using a similarity measure.

The user is shown the documents in decreasing order of similarity to the query term.

Cosine Similarity

VSM Variations

Vocabulary

Stopword lists

Commonly occurring words are unlikely to give useful information and may be removed from the vocabulary to speed processing.

Stopword lists contain frequent words to be excluded

Stopword lists need to be used carefully

E.g. “to be or not to be”

Term weighting

Not all words are equally useful

A word is most likely to be highly relevant to document A if it is:

Infrequent in other documents
Frequent in document A

The cosine measure needs to be modified to reflect this

Normalised term frequency (tf)

A normalised measure of the importance of a word to a document is its frequency, divided by the maximum frequency of any term in the document

This is known as the tf factor.

This stops large documents from scoring higher.

Inverse document frequency (idf)

A calculation designed to make rare words more important than common words

The idf of word i is given by

Where N is the number of documents and n_i is the number that contain word i

tf-idf

The tf-idf weighting scheme is to multiply each word in each document by its tf factor and idf factor.

Different schemes are usually used for query vectors.

Different variants of tf-idf are also used.

NLP - Getting started [Part 4]

NLP - Getting started [Part 4]

Vector Space Model

The Search Task

Retrieval Model

Basic Retrieval Process

Vector Space Model

How it works:

Document-Term Matrix

Similarity measures

The cosine measure

Ranking documents

Cosine Similarity

VSM Variations

Vocabulary

Stopword lists

Term weighting

Normalised term frequency (tf)

Inverse document frequency (idf)

tf-idf

Further Reading

NLP - Getting started [Part 2]

NLP - Getting started [Part 3]

NLP - Getting started [Part 5]