## Explanation of Term frequency - inverse document frequency.¶

For feature extraction we used Sci-Kit Learns, tf-idf vectorizer. It is a count vectorizer combined with idf. The count vectorizer measures term frequency(tf), ie how often a word appears in a title. If we do this for the following sentences then we produce the matrix below.

##### Title 1: The dog jumped over the fence¶

##### Title 2: The cat chased the dog¶

##### Title 3: The white cat chased the brown cat who jumped over the orange cat¶

the | dog | jumped | over | fence | cat | chased | white | brown | who | orange | |
---|---|---|---|---|---|---|---|---|---|---|---|

Title 1 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |

Title 2 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |

Title 3 | 3 | 0 | 1 | 1 | 0 | 3 | 1 | 1 | 1 | 1 | 1 |

The downside of just using tf is that words that appear most often tend to dominate the vector. To overcome this we use a combination of term frequency - inverse document frequency(tf-idf). Idf is measure of whether a term is common or rare across all documents [Side note 2]. Idf is the log of one plus the number of documents(N) divided by the number of documents a term(n) appears in. The one is present so that the equation doesn't evaluate to zero.

\begin{equation*} log(1 +\frac{N}{n_t}) \end{equation*}Essentially, Tf-idf creates a word vector in which a word is weighted by its occurence not only in the title it was derived from but also the entire group of titles(corpus). Tf-idf is calculated by the following formula

t = term, d = single title, D = all titles

\begin{equation*} tfidf(t,d,D) = tf(t,d)\cdot idf(t, D) \end{equation*}Below is the workflow for calculating tfidf for the term "cat" in the above titles.

\begin{equation*} tf("cat",d_1) = \frac{0}{6} = 0 \end{equation*}\begin{equation*} tf("cat",d_2) = \frac{1}{4} = 0.250 \end{equation*}\begin{equation*} tf("cat",d_3) = \frac{3}{13} \approx 0.231 \end{equation*}\begin{equation*} idf("cat",D) = log(1 + \frac{3}{2}) \approx 0.4 \end{equation*}\begin{equation*} tfidf("cat", d_1) = tf("cat", d_1) \times idf("cat", D) = 0 \times 0.4 = 0 \end{equation*}\begin{equation*} tfidf("cat", d_2) = tf("cat", d_2) \times idf("cat", D) = 0.250 \times 0.4 = 0.1 \end{equation*}\begin{equation*} tfidf("cat", d_3) = tf("cat", d_3) \times idf("cat", D) = 0.231 \times 0.4 = 0.0924 \end{equation*}### Related content

- Calzone hyperparameter optomization
- Calzone Model Evaluation
- What's in a name?
- Calzone Feature Analysis
- Can Machine Learning predict how many upvotes your post will receive? Part 1.