Conclusion First
The TFIDF value is not only influenced by the frequency of a word in the current document (i.e., term frequency TF), but also by its distribution across the entire document collection (i.e., inverse document frequency IDF). Specifically, the TFIDF value consists of two parts:
- Term Frequency (TF): The frequency of a word in the current document.
- Inverse Document Frequency (IDF): The scarcity of a word across the entire document collection.
Term Frequency (TF)
Definition: The ratio of the number of times a word appears in the document to the total number of words in the document.
Formula:
$$
\mathrm{TF}(t) = \frac{\text{Occurrences of term } t \text{ in the document}}{\text{Total number of words in the document}}
$$
Inverse Document Frequency (IDF)
Definition: Measures the importance of a word across the entire document collection. If a word appears in many documents, its IDF value will be low; conversely, if a word appears in few documents, its IDF value will be high.
Formula:
$$
\mathrm{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t + 1}\right)
$$
TFIDF Value
Definition: Combines the term frequency and inverse document frequency to calculate the importance of a word in the current document.
Formula:
$$
\mathrm{TFIDF}(t) = \mathrm{TF}(t) \times \mathrm{IDF}(t)
$$
Example
Assume we have two terms a and b, with their occurrence counts in the current document as follows:
- Term
aappears 10 times. - Term
bappears 1000 times.
However, their distribution across the entire document collection differs:
- Term
aappears in most documents (e.g., 80% of the documents). - Term
bappears in few documents (e.g., 5% of the documents).
Calculation Process
Assume there are 1000 total documents:
IDF for term a:
$$
\mathrm{IDF}(a) = \log\left(\frac{1000}{0.8 \times 1000 + 1}\right) \approx \log(1.25) \approx 0.0969
$$
IDF for term b:
$$
\mathrm{IDF}(b) = \log\left(\frac{1000}{0.05 \times 1000 + 1}\right) \approx \log(19.6078) \approx 1.292
$$
Assume the current document has 1000 words:
TF for term a:
$$
\mathrm{TF}(a) = \frac{10}{1000} = 0.01
$$
TF for term b:
$$
\mathrm{TF}(b) = \frac{1000}{1000} = 1.0
$$
Calculating TFIDF Values
TFIDF for term a:
$$
\mathrm{TFIDF}(a) = 0.01 \times 0.0969 \approx 0.000969
$$
TFIDF for term b:
$$
\mathrm{TFIDF}(b) = 1.0 \times 1.292 \approx 1.292
$$
Conclusion
Even though term
aappears fewer times in the current document (10 times), its low IDF value due to its prevalence across most documents results in a low TFIDF value. Conversely, termb, despite appearing many times in the current document (1000 times), has a high IDF value because it appears in few documents, leading to a higher TFIDF value.This demonstrates that the TFIDF value depends not only on the frequency of a word in the current document but also on its scarcity across the entire document collection. A word appearing frequently in the current document does not necessarily have a high TFIDF value; its distribution across the document collection must also be considered.
Therefore, the importance of the corpus used for training models cannot be overstated.
