TF-IDF (Term Frequency-Inverse Document Frequency) Introduction

Introduction

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical method widely used in information retrieval and text mining to evaluate the importance of a word in a document. It combines two key concepts: term frequency (Term Frequency, TF) and inverse document frequency (Inverse Document Frequency, IDF).

Term Frequency (Term Frequency, TF)

Definition: Term frequency refers to the number of times a word appears in a document. To prevent the impact of document length on results, term frequency is usually normalized.

Formula:
$$ \text{TF}(t, d) = \frac{\text{Number of occurrences of term } t \text{ in document } d}{\text{Total number of words in document } d} $$

For example, if the word apple appears 5 times in document $d$, and document $d$ has a total of 100 words, then the TF value of apple is:
$$ \text{TF}(apple, d) = \frac{5}{100} = 0.05 $$

Inverse Document Frequency (Inverse Document Frequency, IDF)

Definition: Inverse document frequency measures the importance of a word across the entire document collection. If a word appears in many documents, its IDF value will be lower; conversely, if a word appears in few documents, its IDF value will be higher.

Formula:
$$ \text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t + 1}\right) $$

The addition of 1 is to prevent division by zero.

For example, if the total number of documents is 1000 and the word apple appears in 100 documents, then the IDF value of apple is:
$$ \text{IDF}(apple) = \log\left(\frac{1000}{100 + 1}\right) = \log(9.90099) \approx 0.996 $$

TF-IDF

Definition: TF-IDF is the product of term frequency and inverse document frequency, used to represent the importance of a word in a document.

Standard Formula:
$$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) $$

Formula Derivation

Combining the above two parts, the complete formula for TF-IDF is as follows:

$$ \text{TF-IDF}(t, d) = \left( \frac{\text{Number of occurrences of term } t \text{ in document } d}{\text{Total number of words in document } d} \right) \times \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t + 1}\right) $$

Example

Assume we have a document collection with 1000 documents. Consider the following scenario:

Document $d$ has 100 words.
The word apple appears 5 times in document $d$.
The word apple appears in 100 documents.

Calculate the TF-IDF value of apple in document $d$:

Calculate TF
$$ \text{TF}(apple, d) = \frac{5}{100} = 0.05 $$
Calculate IDF
$$ \text{IDF}(apple) = \log\left(\frac{1000}{100 + 1}\right) = \log(9.90099) \approx 0.996 $$
Calculate TF-IDF
$$ \text{TF-IDF}(apple, d) = 0.05 \times 0.996 \approx 0.0498 $$

Variants and Optimizations

Although the above is the standard TF-IDF formula, in practical applications, there may be variations and optimizations:

Smoothing: To prevent IDF values from being too high, a small constant $k$ is sometimes added to the IDF formula:
$$ \text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t + k}\right) $$
A common value for $k$ is 1.
Logarithm base: Some implementations use natural logarithms (base $e$), while others use base 10 or base 2. The choice of logarithm base affects the specific numerical values of IDF but does not change its relative size.
Term frequency normalization: Besides simple frequency, other methods can be used to normalize term frequency, such as square root normalization:
$$ \text{TF}(t, d) = \sqrt{\frac{\text{Number of occurrences of term } t \text{ in document } d}{\text{Total number of words in document } d}} $$

Applications

TF-IDF is widely applied in various text analysis tasks, including but not limited to:

Information Retrieval: Improve the relevance of search results.
Text Classification: Identify the topic or category of a document.
Keyword Extraction: Extract important words from a document.
Document Similarity Calculation: Compare the similarity between different documents.