In the current document, word A appears 1000 times and word B appears 10 times. Is the confidence level of word A necessarily higher than that of word B?

Conclusion First

TFIDF value is not only dependent on the frequency of a word in the current document (i.e., term frequency TF), but also on its distribution across the entire document collection (i.e., inverse document frequency IDF). Specifically, the TFIDF value consists of two parts:

  1. Term Frequency (TF): The frequency of a word in the current document.
  2. Inverse Document Frequency (IDF): The scarcity of a word across the entire document collection.

Term Frequency (TF)

Definition: The ratio of the number of times term \(t\) appears in the document to the total number of words in the document.

Formula:

$$\mathrm{TF}(t) = \frac{\text{Number of occurrences of term } t \text{ in the document}}{\text{Total number of words in the document}}$$

Inverse Document Frequency (IDF)

Definition: Measures the importance of a word across the entire document collection. If a word appears in many documents, its IDF value will be low; conversely, if a word appears in few documents, its IDF value will be high.

Formula:

$$\mathrm{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t + 1}\right)$$

TFIDF Value

Definition: Combines term frequency and inverse document frequency to calculate the importance of a word in the current document.

Formula:

$$\mathrm{TFIDF}(t) = \mathrm{TF}(t) \times \mathrm{IDF}(t)$$

Example

Assume we have two terms \(a\) and \(b\), with their occurrence counts in the current document as follows:

  • Term \(a\) appears 10 times.
  • Term \(b\) appears 1000 times.

However, their distribution across the entire document collection differs:

  • Term \(a\) appears in most documents (e.g., 80% of the documents).
  • Term \(b\) appears in few documents (e.g., 5% of the documents).

Calculation Process

Assume there are 1000 total documents:

IDF for term \(a\):

$$\mathrm{IDF}(a) = \log\left(\frac{1000}{0.8 \times 1000 + 1}\right) \approx \log(1.25) \approx 0.0969$$

IDF for term \(b\):

$$\mathrm{IDF}(b) = \log\left(\frac{1000}{0.05 \times 1000 + 1}\right) \approx \log(19.6078) \approx 1.292$$

Assume the current document has 1000 words:

TF for term \(a\):

$$\mathrm{TF}(a) = \frac{10}{1000} = 0.01$$

TF for term \(b\):

$$\mathrm{TF}(b) = \frac{1000}{1000} = 1.0$$

Calculating TFIDF Values

TFIDF for term \(a\):

$$\mathrm{TFIDF}(a) = 0.01 \times 0.0969 \approx 0.000969$$

TFIDF for term \(b\):

$$\mathrm{TFIDF}(b) = 1.0 \times 1.292 \approx 1.292$$

Conclusion

  • Even though term \(a\) appears fewer times (10 times) in the current document, its IDF value is low because it appears in most documents, resulting in a low TFIDF value. Conversely, term \(b\), despite appearing many times (1000 times), has a high IDF value due to its scarcity across the document collection, leading to a higher TFIDF value.

  • This shows that the TFIDF value depends not only on the frequency of a word in the current document but also on its scarcity across the entire document collection. A word appearing frequently in the current document does not necessarily have a high TFIDF value; its distribution across the document collection is equally important.

  • Therefore, the corpus used for training models is crucial.

What do you think?
0 Reactions
Pick a reaction