Explanation of Term frequency - inverse document frequency.

For feature extraction we used Sci-Kit Learns, tf-idf vectorizer. It is a count vectorizer combined with idf. The count vectorizer measures term frequency(tf), ie how often a word appears in a title. If we do this for the following sentences then we produce the matrix below.

Title 1: The dog jumped over the fence
Title 2: The cat chased the dog
Title 3: The white cat chased the brown cat who jumped over the orange cat
the dog jumped over fence cat chased white brown who orange
Title 1 2 1 1 1 1 0 0 0 0 0 0
Title 2 1 1 0 0 0 1 1 0 0 0 0
Title 3 3 0 1 1 0 3 1 1 1 1 1

The downside of just using tf is that words that appear most often tend to dominate the vector. To overcome this we use a combination of term frequency - inverse document frequency(tf-idf). Idf is measure of whether a term is common or rare across all documents [Side note 2]. Idf is the log of one plus the number of documents(N) divided by the number of documents a term(n) appears in. The one is present so that the equation doesn't evaluate to zero.

\begin{equation*} log(1 +\frac{N}{n_t}) \end{equation*}

Essentially, Tf-idf creates a word vector in which a word is weighted by its occurence not only in the title it was derived from but also the entire group of titles(corpus). Tf-idf is calculated by the following formula

t = term, d = single title, D = all titles

\begin{equation*} tfidf(t,d,D) = tf(t,d)\cdot idf(t, D) \end{equation*}

Below is the workflow for calculating tfidf for the term "cat" in the above titles.

\begin{equation*} tf("cat",d_1) = \frac{0}{6} = 0 \end{equation*}\begin{equation*} tf("cat",d_2) = \frac{1}{4} = 0.250 \end{equation*}\begin{equation*} tf("cat",d_3) = \frac{3}{13} \approx 0.231 \end{equation*}\begin{equation*} idf("cat",D) = log(1 + \frac{3}{2}) \approx 0.4 \end{equation*}\begin{equation*} tfidf("cat", d_1) = tf("cat", d_1) \times idf("cat", D) = 0 \times 0.4 = 0 \end{equation*}\begin{equation*} tfidf("cat", d_2) = tf("cat", d_2) \times idf("cat", D) = 0.250 \times 0.4 = 0.1 \end{equation*}\begin{equation*} tfidf("cat", d_3) = tf("cat", d_3) \times idf("cat", D) = 0.231 \times 0.4 = 0.0924 \end{equation*}

Related content