Introduction
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical method widely used in information retrieval and text mining to evaluate the importance of a word in a document. It combines two key concepts: term frequency (Term Frequency, TF) and inverse document frequency (Inverse Document Frequency, IDF).
Term Frequency (Term Frequency, TF)
Definition: Term frequency refers to the number of times a word appears in a document. To prevent the impact of document length on results, term frequency is usually normalized.
Formula:
$$ \text{TF}(t, d) = \frac{\text{Number of occurrences of term } t \text{ in document } d}{\text{Total number of words in document } d} $$
For example, if the word apple appears 5 times in document $d$, and document $d$ has a total of 100 words, then the TF value of apple is:
$$ \text{TF}(apple, d) = \frac{5}{100} = 0.05 $$
Inverse Document Frequency (Inverse Document Frequency, IDF)
Definition: Inverse document frequency measures the importance of a word across the entire document collection. If a word appears in many documents, its IDF value will be lower; conversely, if a word appears in few documents, its IDF value will be higher.
Formula:
$$ \text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t + 1}\right) $$
The addition of 1 is to prevent division by zero.
For example, if the total number of documents is 1000 and the word apple appears in 100 documents, then the IDF value of apple is:
$$ \text{IDF}(apple) = \log\left(\frac{1000}{100 + 1}\right) = \log(9.90099) \approx 0.996 $$
TF-IDF
Definition: TF-IDF is the product of term frequency and inverse document frequency, used to represent the importance of a word in a document.
Standard Formula:
$$ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) $$
Formula Derivation
Combining the above two parts, the complete formula for TF-IDF is as follows:
$$ \text{TF-IDF}(t, d) = \left( \frac{\text{Number of occurrences of term } t \text{ in document } d}{\text{Total number of words in document } d} \right) \times \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t + 1}\right) $$
Example
Assume we have a document collection with 1000 documents. Consider the following scenario:
- Document $d$ has 100 words.
- The word
appleappears 5 times in document $d$. - The word
appleappears in 100 documents.
Calculate the TF-IDF value of apple in document $d$:
- Calculate TF
$$ \text{TF}(apple, d) = \frac{5}{100} = 0.05 $$ - Calculate IDF
$$ \text{IDF}(apple) = \log\left(\frac{1000}{100 + 1}\right) = \log(9.90099) \approx 0.996 $$ - Calculate TF-IDF
$$ \text{TF-IDF}(apple, d) = 0.05 \times 0.996 \approx 0.0498 $$
Variants and Optimizations
Although the above is the standard TF-IDF formula, in practical applications, there may be variations and optimizations:
- Smoothing: To prevent IDF values from being too high, a small constant $k$ is sometimes added to the IDF formula:
$$ \text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t + k}\right) $$
A common value for $k$ is 1. - Logarithm base: Some implementations use natural logarithms (base $e$), while others use base 10 or base 2. The choice of logarithm base affects the specific numerical values of IDF but does not change its relative size.
- Term frequency normalization: Besides simple frequency, other methods can be used to normalize term frequency, such as square root normalization:
$$ \text{TF}(t, d) = \sqrt{\frac{\text{Number of occurrences of term } t \text{ in document } d}{\text{Total number of words in document } d}} $$
Applications
TF-IDF is widely applied in various text analysis tasks, including but not limited to:
Information Retrieval: Improve the relevance of search results.
Text Classification: Identify the topic or category of a document.
Keyword Extraction: Extract important words from a document.
Document Similarity Calculation: Compare the similarity between different documents.
