Concept Frequency-Inverse Concept Document Frquency: Analyzing Concepts in Text

16 minute read | Updated:

x
* [Intro](#Intro)
* [Word Embeddings and Semantic Similarity](#Word-Embeddings-and-Semantic-Similarity)
* [Calculating CFIDF](#Calculating-CFIDF)
* [Visualizing CFIDF](#Visualizing-CFIDF)
* [What it's not](#What-it's-not)
* [What's Next](#What's-Next)
# Intro

Intro

xxxxxxxxxx
<b>C</b>oncept <b>F</b>requency-<b>I</b>nverse Concept <b>D</b>ocument <b>F</b>requency (or CFIDF) is a measure I've created to explore text data and it works surprisingly well for exploring and visualizing text data. As opposed to its statistical parent TF-IDF, it isn't based strictly on word counts, but on *entire concepts*, assuming that you're word embedding layer is trained appropriately. First, it may help to explain what TF-IDF accomplishes. 
Most of NLP is trying to figure out what a body of text is about. TF-IDF is crucial to answering this question as it calculates how unique a word is to a document compared to other documents in the corpus. If you can measure how important a set of words is to a document, then you're one step closer to answering what the document is about. One problem here is the issue of synonyms, commonly misspelled words, and even conceptually similar words or phrases. These issues tend to dilute the impact a word has on a document. This is where CFIDF comes in. 

Concept Frequency-Inverse Concept Document Frequency (or CFIDF) is a measure I've created to explore text data and it works surprisingly well for exploring and visualizing text data. As opposed to its statistical parent TF-IDF, it isn't based strictly on word counts, but on entire concepts, assuming that you're word embedding layer is trained appropriately. First, it may help to explain what TF-IDF accomplishes.

Most of NLP is trying to figure out what a body of text is about. TF-IDF is crucial to answering this question as it calculates how unique a word is to a document compared to other documents in the corpus. If you can measure how important a set of words is to a document, then you're one step closer to answering what the document is about. One problem here is the issue of synonyms, commonly misspelled words, and even conceptually similar words or phrases. These issues tend to dilute the impact a word has on a document. This is where CFIDF comes in.

xxxxxxxxxx
# Word Embeddings and Semantic Similarity

Word Embeddings and Semantic Similarity

xxxxxxxxxx
Word2Vec, which is the word embedding algorithm I use here (but you can use any word embedding algorithm), produces measures of semantic similarity. This allows us to determine if two words are semantically similar (or conceptually similar), regardless if they are synonyms, commonly misspelled words, or different parts of speech entirely. Thus, with CFIDF, we can calculate how *frequent a concept* shows up in a document compared to how frequent it shows up in other documents in a corpus. 

Word2Vec, which is the word embedding algorithm I use here (but you can use any word embedding algorithm), produces measures of semantic similarity. This allows us to determine if two words are semantically similar (or conceptually similar), regardless if they are synonyms, commonly misspelled words, or different parts of speech entirely. Thus, with CFIDF, we can calculate how frequent a concept shows up in a document compared to how frequent it shows up in other documents in a corpus.

# Calculating CFIDF

Calculating CFIDF

xxxxxxxxxx
Calculating CFIDF is pretty straight forward and almost exactly like TF-IDF but instead of using term frequency you use *concept frequency*. To do this, you establish a user-defined similarity threshold on your similarity queries to your word embeddings. If a word in the document is above this threshold then we count that as being conceptually similar and thus increasing concept frequency (it may help to set thresholds on each concept/target word individually). 

Calculating CFIDF is pretty straight forward and almost exactly like TF-IDF but instead of using term frequency you use concept frequency. To do this, you establish a user-defined similarity threshold on your similarity queries to your word embeddings. If a word in the document is above this threshold then we count that as being conceptually similar and thus increasing concept frequency (it may help to set thresholds on each concept/target word individually).

xxxxxxxxxx
One may notice that I'm computing concept frequency by dividing the number of times a concept appears by the total number of *terms* in the document. Since we can't easily detect the number of concepts present this will have to do as a way to account for document length.

One may notice that I'm computing concept frequency by dividing the number of times a concept appears by the total number of terms in the document. Since we can't easily detect the number of concepts present this will have to do as a way to account for document length.

$$
cf(c,d) = {\displaystyle f_{c,d}{\Bigg /}{\sum _{t'\in d}{f_{t',d}}}}
 $$

xxxxxxxxxx
I calculate inverse concept document frequency by using the following formula:
$$
 \mathrm{idf}(c, D) =  \log \frac{N}{1+|\{d \in D: c \in d\}|}
$$

I calculate inverse concept document frequency by using the following formula:

With 
N being the total number of documents in the corpus or $$ N = {|D|} $$
The denominator:
$$1+|\{d \in D: c \in d\}|$$  
is the number of documents where the concept c appears, and I add the constant 1 to avoid division by zero.

With N being the total number of documents in the corpus or The denominator:

is the number of documents where the concept c appears, and I add the constant 1 to avoid division by zero.

# Visualizing CFIDF

Visualizing CFIDF

xxxxxxxxxx
CFIDF is great for exploring text that you are conceptually familiar with or where you roughly know what concepts are mentioned. It's also great for comparing text. Take the following ternary plot, for example, where we look at song lyrics. (It should be obvious <a href="https://tmthyjames.github.io/2018/january/Cypher/" target="_blank">now</a> <a href="https://tmthyjames.github.io/2018/january/Analyzing-Rap-Lyrics-Using-Word-Vectors/" target="_blank">that</a> <a href="https://tmthyjames.github.io/2018/february/Predicting-Musical-Genres/" target="_blank">I</a> <a href="https://tmthyjames.github.io/2018/june/Expose-Word2vec-Model-with-a-RESTful-API-Using-Only-a-Jupyter-Notebook/" target="_blank">love</a> <a href="https://tmthyjames.github.io/2018/august/Using-Bigram-Paragraph-Vectors/" target="_blank">rap</a> <a href="https://www.youtube.com/watch?v=L4Nd7lZgp4o" target="_blank">music</a>).
We have specified three concepts to which we want to map artists to a certain degree. If an artist lands in the middle, then they likely write about all three concepts equally. If they land where Jackie Wilson does in this example (left edge), then that means they write about `family` and `money` equally but shy away from `politics`. This exmaple shows that metal and punk artists tend to write about `politics` more so than `family` or `money`. And rap artists tend to rap about `money` over `family` and `politics` in general. Anecdotally, we can confirm this by looking at metal and punk artists like Fear Factory and Rage Against the Machine, two artists known to politic. Also notice the rapper Immortal Technique's position, who is known to engage in political lyricism. It should be no surprise that country artists tend to lean towards `family` concepts.

CFIDF is great for exploring text that you are conceptually familiar with or where you roughly know what concepts are mentioned. It's also great for comparing text. Take the following ternary plot, for example, where we look at song lyrics. (It should be obvious now that I love rap music).

We have specified three concepts to which we want to map artists to a certain degree. If an artist lands in the middle, then they likely write about all three concepts equally. If they land where Jackie Wilson does in this example (left edge), then that means they write about family and money equally but shy away from politics. This exmaple shows that metal and punk artists tend to write about politics more so than family or money. And rap artists tend to rap about money over family and politics in general. Anecdotally, we can confirm this by looking at metal and punk artists like Fear Factory and Rage Against the Machine, two artists known to politic. Also notice the rapper Immortal Technique's position, who is known to engage in political lyricism. It should be no surprise that country artists tend to lean towards family concepts.

<img src="29.png"></img>

xxxxxxxxxx
When the concepts of interest are `politics`, `drugs`, and `sex`, we see pretty definite segmentation: rap leans towards drugs, heavy metal leans toward politics, and R&B dominates the sex category, as one would expect, right?

When the concepts of interest are politics, drugs, and sex, we see pretty definite segmentation: rap leans towards drugs, heavy metal leans toward politics, and R&B dominates the sex category, as one would expect, right?

<img src="19.png"></img>

xxxxxxxxxx
Here's all the CFIDF scores by artist and genre:

Here's all the CFIDF scores by artist and genre:

In [9]:
import pandas as pd
cfidf = pd.read_csv('cf-idf.csv')
cfidf
Out[9]:
drugs family money partying politics religion sex Group Genre
0 0.151377 0.427702 0.375632 0.301393 0.048299 0.133169 0.129391 2Pac Hip Hop
1 0.239066 0.417061 0.454616 0.370648 0.039125 0.121863 0.135851 50_Cent Hip Hop
2 0.174737 0.467548 0.179775 0.276611 0.103990 0.132259 0.117274 7 Seconds punk rock
3 0.660889 0.344094 0.455429 0.364400 0.096142 0.125103 0.131203 A$AP_Rocky Hip Hop
4 0.034425 0.476945 0.337638 0.417414 0.077265 0.357589 0.231473 AC/DC heavy metal
5 0.249775 0.346488 0.260931 0.248761 0.102499 0.090355 0.099653 AZ Hip Hop
6 0.032446 0.483345 0.203328 0.316998 0.148683 0.202916 0.160069 Accept heavy metal
7 0.000000 0.685787 0.522545 0.436630 0.051919 0.498870 0.404920 Adele rhythm and blues
8 0.000000 0.900886 0.560054 0.697584 0.046892 0.695580 0.627945 Al Green rhythm and blues
9 0.053640 0.647162 0.509189 0.414272 0.119802 0.484224 0.273535 Alan Jackson Country
10 0.046781 0.526274 0.384845 0.353584 0.153123 0.268308 0.244421 Alice Cooper heavy metal
11 0.000000 0.479292 0.178801 0.292639 0.000000 0.218498 0.137328 Alice in Chains heavy metal
12 0.035913 0.596436 0.392878 0.366707 0.070529 0.406749 0.253452 Alison Krauss Country
13 0.030683 0.297049 0.028043 0.125625 0.388327 0.140677 0.035558 Amon Amarth heavy metal
14 0.257458 0.641693 0.439047 0.418453 0.028090 0.364027 0.269258 Amy Winehouse rhythm and blues
15 0.020131 0.937717 0.728164 0.644628 0.000000 0.750571 0.685267 Anita Baker rhythm and blues
16 0.000000 0.656679 0.391062 0.401276 0.053439 0.426590 0.298280 Anne Murray Country
17 0.042059 0.456947 0.168757 0.235924 0.119311 0.155751 0.081028 Anthrax heavy metal
18 0.058171 0.503801 0.246655 0.323619 0.247526 0.212399 0.137969 Anti-Nowhere League punk rock
19 0.062821 0.397296 0.151164 0.269749 0.511776 0.118351 0.104213 Anti‐Flag punk rock
20 0.028504 0.820412 0.606775 0.593966 0.054282 0.700646 0.616238 Aretha Franklin rhythm and blues
21 0.022174 0.525510 0.147532 0.261213 0.145159 0.130319 0.106520 Avenged Sevenfold heavy metal
22 0.000000 0.905792 0.458609 0.630811 0.038014 0.573877 0.607313 B.B. King rhythm and blues
23 0.159230 0.462897 0.354108 0.386592 0.099894 0.113994 0.184632 B.o.B Hip Hop
24 0.014794 0.378958 0.138150 0.217477 0.380938 0.129712 0.074173 Bad Religion punk rock
25 0.171185 0.370992 0.292913 0.273246 0.037355 0.065371 0.089830 Bad_Meets_Evil Hip Hop
26 0.032547 0.775277 0.555606 0.495916 0.042612 0.525926 0.386995 Barbara Mandrell Country
27 0.000000 1.000000 0.745141 0.808837 0.000000 0.836872 0.943167 Barry White rhythm and blues
28 0.017855 0.772672 0.545288 0.629006 0.000000 0.474196 0.525784 Beyoncé Knowles rhythm and blues
29 0.245185 0.413275 0.331191 0.358828 0.015286 0.065510 0.146883 Big_Boi Hip Hop
... ... ... ... ... ... ... ... ... ...
362 0.000000 0.632911 0.441297 0.420136 0.040967 0.450969 0.357437 Trisha Yearwood Country
363 0.320308 0.403973 0.401622 0.357915 0.054575 0.116402 0.144865 Twista Hip Hop
364 0.010390 0.658518 0.299632 0.429956 0.027208 0.416839 0.212274 Twisted Sister heavy metal
365 0.164294 0.415106 0.326170 0.316618 0.112228 0.104804 0.128864 Tyler,_The_Creator Hip Hop
366 0.037013 0.457210 0.265354 0.285825 0.072690 0.329173 0.172931 Type O Negative heavy metal
367 0.007734 0.730533 0.603650 0.657350 0.030378 0.315966 0.594983 Tyrese Gibson Hip Hop
368 0.014057 0.649171 0.404000 0.424277 0.079751 0.396563 0.284272 UFO heavy metal
369 0.007463 0.624370 0.373314 0.370615 0.097705 0.360790 0.242315 Uriah Heep heavy metal
370 0.016748 0.758233 0.523127 0.666694 0.005482 0.343238 0.544873 Usher rhythm and blues
371 0.006117 0.664322 0.476895 0.535035 0.072075 0.457920 0.404772 Van Halen heavy metal
372 0.081082 0.275381 0.088967 0.149084 0.329089 0.225467 0.076623 Venom heavy metal
373 0.000000 0.687720 0.361110 0.427076 0.020911 0.394013 0.310108 Vince Gill Country
374 0.035617 0.608620 0.344285 0.398376 0.233158 0.348415 0.246486 Violent Femmes punk rock
375 0.032590 0.547653 0.294348 0.316193 0.192013 0.462534 0.257751 W.A.S.P. heavy metal
376 0.205116 0.394842 0.389623 0.302427 0.033569 0.116834 0.133046 Warren_G Hip Hop
377 0.030052 0.613280 0.346167 0.373449 0.102906 0.313079 0.258100 Waylon Jennings Country
378 0.000000 0.379652 0.234339 0.368102 0.139130 0.271915 0.235558 White Zombie heavy metal
379 0.009354 0.805518 0.764349 0.591093 0.067360 0.630351 0.594585 Whitesnake heavy metal
380 0.000000 0.812481 0.629915 0.574522 0.019814 0.612128 0.594845 Whitney Houston rhythm and blues
381 0.050114 0.478771 0.227129 0.350105 0.082015 0.130546 0.208611 Will_Smith Hip Hop
382 0.052394 0.638656 0.343738 0.357677 0.135236 0.357768 0.233339 Willie Nelson Country
383 0.135084 0.231607 0.124481 0.160269 0.141489 0.153111 0.076997 Wire punk rock
384 0.776054 0.420987 0.546276 0.376374 0.069092 0.102349 0.145945 Wiz_Khalifa Hip Hop
385 0.189808 0.464809 0.269780 0.359945 0.045183 0.185374 0.187468 X punk rock
386 0.204148 0.369473 0.307254 0.298508 0.092521 0.062660 0.100177 Xzibit Hip Hop
387 0.173154 0.414595 0.325445 0.331869 0.110114 0.140262 0.139033 Yelawolf Hip Hop
388 0.011097 0.470948 0.201246 0.289181 0.217937 0.289195 0.153054 Yngwie Malmsteen heavy metal
389 0.284804 0.417470 0.415205 0.379049 0.050196 0.061686 0.086339 Young_Jeezy Hip Hop
390 0.026043 0.679013 0.399591 0.385002 0.187538 0.324088 0.203139 Zac Brown Band Country
391 0.057894 0.590785 0.230635 0.304427 0.081213 0.191311 0.142993 motörhead heavy metal

392 rows × 9 columns

# What it's not

What it's not

xxxxxxxxxx
• Although possible, CFIDF is not meant to be used as a predictive algorithm. You're much better off using your word embeddings as features instead of the learned features *produced* by your word embeddings.
• CFIDF is not designed for topic modeling as it doesn't "discover" topics. CFIDF assumes you've specified your topics (or concepts) of interest a priori. It may be helpful for a priori topic/concept analysis though. 
On the other hand, it works great for texts where you have a set of concepts you're interested in analyzing, where you may have a lot of overlapping language (money vs cash), and/or where you know (roughly) what concepts exist in the text already. 

• Although possible, CFIDF is not meant to be used as a predictive algorithm. You're much better off using your word embeddings as features instead of the learned features produced by your word embeddings.

• CFIDF is not designed for topic modeling as it doesn't "discover" topics. CFIDF assumes you've specified your topics (or concepts) of interest a priori. It may be helpful for a priori topic/concept analysis though.

On the other hand, it works great for texts where you have a set of concepts you're interested in analyzing, where you may have a lot of overlapping language (money vs cash), and/or where you know (roughly) what concepts exist in the text already.

# What's Next

What's Next

xxxxxxxxxx
This is a good start for determining which *concepts*, as opposed to terms, are unique to a document. But it's not perfect. One flaw with this specific example is my use of <a href="https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf">Word2Vec</a> for training word embeddings that produce the CFIDF scores. The problem is that Word2Vec produces a single vector representation for each word, regardless of the context of the word. This poses two problems. One, it doesn't take into account the complex characteristics of word use (word-sense ambiguity); two, it doesn't model <a href="https://en.wikipedia.org/wiki/Polysemy">polysemic</a> words whose meanings vary across linguistic contexts. JR Firth, pioneer of distributional semantics, said
>The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.
Although Word2Vec takes context words into account, it assigns one vector representation to each word, regardless of how the word is used. So although it does a great job of learning semantic similarity (*hot-cold*, *computer-tv*, *hand-foot*), it isn't as adept at learning semantic relatedness (*hot-sun*, *computer-database*, *hand-ring*) because of its one-embedding-per-word architecture.
This distinction is important. Take the example *cup-bottle* vs *necklace-neck*. The first pair of words are semantically *similar*. That is, they are roughly interchangeable and similar in function. However, the second pair of words are merely semantically *related*—they share a common theme but aren't interchangeable nor functionally similar. If, for example, I'm interested in finding conceptual similarities for `partying`, then *cup* and *bottle* would likely both make the cut. But If I'm interested in finding concepts similar to `money` then although *necklace* may contribute, *neck* likely will not because it's not semantically similar. Choosing a word-embedding architecture that supports multiple vector representations per word depending on context could likely improve the results of CFIDF.
Asr et al., in <a href="https://aclweb.org/anthology/N18-1062">Querying Word Embeddings for Similarity and Relatedness</a> said:
> It may be unrealistic to expect a single vector representation to account for qualitatively distinct similarity and relatedness data.
They went on to empirically show that although word embeddings are best for finding semantic similarities, contexualized word embeddings are better indicators of semantic relatedness. 
Much work has been done to differentiate semantic relatedness and semantic similarity and to solve the problems of word-sense ambiguity and polysemy modeling (see the <a href="http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf">CoVe paper</a>, and the <a href="https://arxiv.org/pdf/1802.05365.pdf">ELMo paper</a>). 
ELMo, as opposed to Word2Vec, uses a language model (predicting the next word in a sequence of words; more on language models <a href="https://machinelearningmastery.com/statistical-language-modeling-and-neural-language-models/">here</a>) to solve our two problems: word context (word-sense disambiguation) and linguistic context (polysemy modeling). It does this by producing an embedding for a word *based on the context in which it appears*, therefore producing a slightly different embedding for each word's occurences, thus better disambiguating each word and providing more context to our text data!
Therefore, for better CFIDF performance, try <a href="https://arxiv.org/pdf/1802.05365.pdf"> deep *contexualized* word representations</a> (my next step is to use ELMo embeddings to produce CFIDF scores).

This is a good start for determining which concepts, as opposed to terms, are unique to a document. But it's not perfect. One flaw with this specific example is my use of Word2Vec for training word embeddings that produce the CFIDF scores. The problem is that Word2Vec produces a single vector representation for each word, regardless of the context of the word. This poses two problems. One, it doesn't take into account the complex characteristics of word use (word-sense ambiguity); two, it doesn't model polysemic words whose meanings vary across linguistic contexts. JR Firth, pioneer of distributional semantics, said

The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.

Although Word2Vec takes context words into account, it assigns one vector representation to each word, regardless of how the word is used. So although it does a great job of learning semantic similarity (hot-cold, computer-tv, hand-foot), it isn't as adept at learning semantic relatedness (hot-sun, computer-database, hand-ring) because of its one-embedding-per-word architecture.

This distinction is important. Take the example cup-bottle vs necklace-neck. The first pair of words are semantically similar. That is, they are roughly interchangeable and similar in function. However, the second pair of words are merely semantically related—they share a common theme but aren't interchangeable nor functionally similar. If, for example, I'm interested in finding conceptual similarities for partying, then cup and bottle would likely both make the cut. But If I'm interested in finding concepts similar to money then although necklace may contribute, neck likely will not because it's not semantically similar. Choosing a word-embedding architecture that supports multiple vector representations per word depending on context could likely improve the results of CFIDF.

Asr et al., in Querying Word Embeddings for Similarity and Relatedness said:

It may be unrealistic to expect a single vector representation to account for qualitatively distinct similarity and relatedness data.

They went on to empirically show that although word embeddings are best for finding semantic similarities, contexualized word embeddings are better indicators of semantic relatedness.

Much work has been done to differentiate semantic relatedness and semantic similarity and to solve the problems of word-sense ambiguity and polysemy modeling (see the CoVe paper, and the ELMo paper).

ELMo, as opposed to Word2Vec, uses a language model (predicting the next word in a sequence of words; more on language models here) to solve our two problems: word context (word-sense disambiguation) and linguistic context (polysemy modeling). It does this by producing an embedding for a word based on the context in which it appears, therefore producing a slightly different embedding for each word's occurences, thus better disambiguating each word and providing more context to our text data!

Therefore, for better CFIDF performance, try deep contexualized word representations (my next step is to use ELMo embeddings to produce CFIDF scores).

xxxxxxxxxx
HT to <a href="https://bl.ocks.org/tomgp">Tom Pearson</a> and <a href="http://bl.ocks.org/tomgp/7674234">this</a> block for the <a href="https://d3js.org/">d3.js</a> inspiration.

HT to Tom Pearson and this block for the d3.js inspiration.