x
* [Intro](#Intro)* [Word Embeddings and Semantic Similarity](#Word-Embeddings-and-Semantic-Similarity)* [Calculating CFIDF](#Calculating-CFIDF)* [Visualizing CFIDF](#Visualizing-CFIDF)* [What it's not](#What-it's-not)* [What's Next](#What's-Next)

# Intro

Intro¶

xxxxxxxxxx
<b>C</b>oncept <b>F</b>requency-<b>I</b>nverse Concept <b>D</b>ocument <b>F</b>requency (or CFIDF) is a measure I've created to explore text data and it works surprisingly well for exploring and visualizing text data. As opposed to its statistical parent TF-IDF, it isn't based strictly on word counts, but on *entire concepts*, assuming that you're word embedding layer is trained appropriately. First, it may help to explain what TF-IDF accomplishes. ​Most of NLP is trying to figure out what a body of text is about. TF-IDF is crucial to answering this question as it calculates how unique a word is to a document compared to other documents in the corpus. If you can measure how important a set of words is to a document, then you're one step closer to answering what the document is about. One problem here is the issue of synonyms, commonly misspelled words, and even conceptually similar words or phrases. These issues tend to dilute the impact a word has on a document. This is where CFIDF comes in. 

Concept Frequency-Inverse Concept Document Frequency (or CFIDF) is a measure I've created to explore text data and it works surprisingly well for exploring and visualizing text data. As opposed to its statistical parent TF-IDF, it isn't based strictly on word counts, but on entire concepts, assuming that you're word embedding layer is trained appropriately. First, it may help to explain what TF-IDF accomplishes.

Most of NLP is trying to figure out what a body of text is about. TF-IDF is crucial to answering this question as it calculates how unique a word is to a document compared to other documents in the corpus. If you can measure how important a set of words is to a document, then you're one step closer to answering what the document is about. One problem here is the issue of synonyms, commonly misspelled words, and even conceptually similar words or phrases. These issues tend to dilute the impact a word has on a document. This is where CFIDF comes in.

xxxxxxxxxx
# Word Embeddings and Semantic Similarity

Word Embeddings and Semantic Similarity¶

xxxxxxxxxx
Word2Vec, which is the word embedding algorithm I use here (but you can use any word embedding algorithm), produces measures of semantic similarity. This allows us to determine if two words are semantically similar (or conceptually similar), regardless if they are synonyms, commonly misspelled words, or different parts of speech entirely. Thus, with CFIDF, we can calculate how *frequent a concept* shows up in a document compared to how frequent it shows up in other documents in a corpus. 

Word2Vec, which is the word embedding algorithm I use here (but you can use any word embedding algorithm), produces measures of semantic similarity. This allows us to determine if two words are semantically similar (or conceptually similar), regardless if they are synonyms, commonly misspelled words, or different parts of speech entirely. Thus, with CFIDF, we can calculate how frequent a concept shows up in a document compared to how frequent it shows up in other documents in a corpus.

# Calculating CFIDF

Calculating CFIDF¶

xxxxxxxxxx
Calculating CFIDF is pretty straight forward and almost exactly like TF-IDF but instead of using term frequency you use *concept frequency*. To do this, you establish a user-defined similarity threshold on your similarity queries to your word embeddings. If a word in the document is above this threshold then we count that as being conceptually similar and thus increasing concept frequency (it may help to set thresholds on each concept/target word individually). 

Calculating CFIDF is pretty straight forward and almost exactly like TF-IDF but instead of using term frequency you use concept frequency. To do this, you establish a user-defined similarity threshold on your similarity queries to your word embeddings. If a word in the document is above this threshold then we count that as being conceptually similar and thus increasing concept frequency (it may help to set thresholds on each concept/target word individually).

xxxxxxxxxx
One may notice that I'm computing concept frequency by dividing the number of times a concept appears by the total number of *terms* in the document. Since we can't easily detect the number of concepts present this will have to do as a way to account for document length.

One may notice that I'm computing concept frequency by dividing the number of times a concept appears by the total number of terms in the document. Since we can't easily detect the number of concepts present this will have to do as a way to account for document length.

$$cf(c,d) = {\displaystyle f_{c,d}{\Bigg /}{\sum _{t'\in d}{f_{t',d}}}} $$

$cf(c,d) = {\displaystyle f_{c,d}{\Bigg /}{\sum _{t'\in d}{f_{t',d}}}}$

xxxxxxxxxx
I calculate inverse concept document frequency by using the following formula:​$$ \mathrm{idf}(c, D) =  \log \frac{N}{1+|\{d \in D: c \in d\}|}$$

I calculate inverse concept document frequency by using the following formula:

$\mathrm{idf}(c, D) = \log \frac{N}{1+|\{d \in D: c \in d\}|}$

With N being the total number of documents in the corpus or $$ N = {|D|} $$The denominator:$$1+|\{d \in D: c \in d\}|$$  ​is the number of documents where the concept c appears, and I add the constant 1 to avoid division by zero.

With N being the total number of documents in the corpus or $N = {|D|}$ The denominator: $1+|\{d \in D: c \in d\}|$

is the number of documents where the concept c appears, and I add the constant 1 to avoid division by zero.

# Visualizing CFIDF

Visualizing CFIDF¶

xxxxxxxxxx
CFIDF is great for exploring text that you are conceptually familiar with or where you roughly know what concepts are mentioned. It's also great for comparing text. Take the following ternary plot, for example, where we look at song lyrics. (It should be obvious <a href="https://tmthyjames.github.io/2018/january/Cypher/" target="_blank">now</a> <a href="https://tmthyjames.github.io/2018/january/Analyzing-Rap-Lyrics-Using-Word-Vectors/" target="_blank">that</a> <a href="https://tmthyjames.github.io/2018/february/Predicting-Musical-Genres/" target="_blank">I</a> <a href="https://tmthyjames.github.io/2018/june/Expose-Word2vec-Model-with-a-RESTful-API-Using-Only-a-Jupyter-Notebook/" target="_blank">love</a> <a href="https://tmthyjames.github.io/2018/august/Using-Bigram-Paragraph-Vectors/" target="_blank">rap</a> <a href="https://www.youtube.com/watch?v=L4Nd7lZgp4o" target="_blank">music</a>).​We have specified three concepts to which we want to map artists to a certain degree. If an artist lands in the middle, then they likely write about all three concepts equally. If they land where Jackie Wilson does in this example (left edge), then that means they write about `family` and `money` equally but shy away from `politics`. This exmaple shows that metal and punk artists tend to write about `politics` more so than `family` or `money`. And rap artists tend to rap about `money` over `family` and `politics` in general. Anecdotally, we can confirm this by looking at metal and punk artists like Fear Factory and Rage Against the Machine, two artists known to politic. Also notice the rapper Immortal Technique's position, who is known to engage in political lyricism. It should be no surprise that country artists tend to lean towards `family` concepts.

CFIDF is great for exploring text that you are conceptually familiar with or where you roughly know what concepts are mentioned. It's also great for comparing text. Take the following ternary plot, for example, where we look at song lyrics. (It should be obvious now that I love rap music).

We have specified three concepts to which we want to map artists to a certain degree. If an artist lands in the middle, then they likely write about all three concepts equally. If they land where Jackie Wilson does in this example (left edge), then that means they write about family and money equally but shy away from politics. This exmaple shows that metal and punk artists tend to write about politics more so than family or money. And rap artists tend to rap about money over family and politics in general. Anecdotally, we can confirm this by looking at metal and punk artists like Fear Factory and Rage Against the Machine, two artists known to politic. Also notice the rapper Immortal Technique's position, who is known to engage in political lyricism. It should be no surprise that country artists tend to lean towards family concepts.

<img src="29.png"></img>

xxxxxxxxxx
When the concepts of interest are `politics`, `drugs`, and `sex`, we see pretty definite segmentation: rap leans towards drugs, heavy metal leans toward politics, and R&B dominates the sex category, as one would expect, right?

When the concepts of interest are politics, drugs, and sex, we see pretty definite segmentation: rap leans towards drugs, heavy metal leans toward politics, and R&B dominates the sex category, as one would expect, right?

<img src="19.png"></img>

xxxxxxxxxx
Here's all the CFIDF scores by artist and genre:

Here's all the CFIDF scores by artist and genre:

In [9]:

import pandas as pdcfidf = pd.read_csv('cf-idf.csv')​cfidf

Out[9]:

	drugs	family	money	partying	politics	religion	sex	Group	Genre
0	0.151377	0.427702	0.375632	0.301393	0.048299	0.133169	0.129391	2Pac	Hip Hop
1	0.239066	0.417061	0.454616	0.370648	0.039125	0.121863	0.135851	50_Cent	Hip Hop
2	0.174737	0.467548	0.179775	0.276611	0.103990	0.132259	0.117274	7 Seconds	punk rock
3	0.660889	0.344094	0.455429	0.364400	0.096142	0.125103	0.131203	A$AP_Rocky	Hip Hop
4	0.034425	0.476945	0.337638	0.417414	0.077265	0.357589	0.231473	AC/DC	heavy metal
5	0.249775	0.346488	0.260931	0.248761	0.102499	0.090355	0.099653	AZ	Hip Hop
6	0.032446	0.483345	0.203328	0.316998	0.148683	0.202916	0.160069	Accept	heavy metal
7	0.000000	0.685787	0.522545	0.436630	0.051919	0.498870	0.404920	Adele	rhythm and blues
8	0.000000	0.900886	0.560054	0.697584	0.046892	0.695580	0.627945	Al Green	rhythm and blues
9	0.053640	0.647162	0.509189	0.414272	0.119802	0.484224	0.273535	Alan Jackson	Country
10	0.046781	0.526274	0.384845	0.353584	0.153123	0.268308	0.244421	Alice Cooper	heavy metal
11	0.000000	0.479292	0.178801	0.292639	0.000000	0.218498	0.137328	Alice in Chains	heavy metal
12	0.035913	0.596436	0.392878	0.366707	0.070529	0.406749	0.253452	Alison Krauss	Country
13	0.030683	0.297049	0.028043	0.125625	0.388327	0.140677	0.035558	Amon Amarth	heavy metal
14	0.257458	0.641693	0.439047	0.418453	0.028090	0.364027	0.269258	Amy Winehouse	rhythm and blues
15	0.020131	0.937717	0.728164	0.644628	0.000000	0.750571	0.685267	Anita Baker	rhythm and blues
16	0.000000	0.656679	0.391062	0.401276	0.053439	0.426590	0.298280	Anne Murray	Country
17	0.042059	0.456947	0.168757	0.235924	0.119311	0.155751	0.081028	Anthrax	heavy metal
18	0.058171	0.503801	0.246655	0.323619	0.247526	0.212399	0.137969	Anti-Nowhere League	punk rock
19	0.062821	0.397296	0.151164	0.269749	0.511776	0.118351	0.104213	Anti‐Flag	punk rock
20	0.028504	0.820412	0.606775	0.593966	0.054282	0.700646	0.616238	Aretha Franklin	rhythm and blues
21	0.022174	0.525510	0.147532	0.261213	0.145159	0.130319	0.106520	Avenged Sevenfold	heavy metal
22	0.000000	0.905792	0.458609	0.630811	0.038014	0.573877	0.607313	B.B. King	rhythm and blues
23	0.159230	0.462897	0.354108	0.386592	0.099894	0.113994	0.184632	B.o.B	Hip Hop
24	0.014794	0.378958	0.138150	0.217477	0.380938	0.129712	0.074173	Bad Religion	punk rock
25	0.171185	0.370992	0.292913	0.273246	0.037355	0.065371	0.089830	Bad_Meets_Evil	Hip Hop
26	0.032547	0.775277	0.555606	0.495916	0.042612	0.525926	0.386995	Barbara Mandrell	Country
27	0.000000	1.000000	0.745141	0.808837	0.000000	0.836872	0.943167	Barry White	rhythm and blues
28	0.017855	0.772672	0.545288	0.629006	0.000000	0.474196	0.525784	Beyoncé Knowles	rhythm and blues
29	0.245185	0.413275	0.331191	0.358828	0.015286	0.065510	0.146883	Big_Boi	Hip Hop
...	...	...	...	...	...	...	...	...	...
362	0.000000	0.632911	0.441297	0.420136	0.040967	0.450969	0.357437	Trisha Yearwood	Country
363	0.320308	0.403973	0.401622	0.357915	0.054575	0.116402	0.144865	Twista	Hip Hop
364	0.010390	0.658518	0.299632	0.429956	0.027208	0.416839	0.212274	Twisted Sister	heavy metal
365	0.164294	0.415106	0.326170	0.316618	0.112228	0.104804	0.128864	Tyler,_The_Creator	Hip Hop
366	0.037013	0.457210	0.265354	0.285825	0.072690	0.329173	0.172931	Type O Negative	heavy metal
367	0.007734	0.730533	0.603650	0.657350	0.030378	0.315966	0.594983	Tyrese Gibson	Hip Hop
368	0.014057	0.649171	0.404000	0.424277	0.079751	0.396563	0.284272	UFO	heavy metal
369	0.007463	0.624370	0.373314	0.370615	0.097705	0.360790	0.242315	Uriah Heep	heavy metal
370	0.016748	0.758233	0.523127	0.666694	0.005482	0.343238	0.544873	Usher	rhythm and blues
371	0.006117	0.664322	0.476895	0.535035	0.072075	0.457920	0.404772	Van Halen	heavy metal
372	0.081082	0.275381	0.088967	0.149084	0.329089	0.225467	0.076623	Venom	heavy metal
373	0.000000	0.687720	0.361110	0.427076	0.020911	0.394013	0.310108	Vince Gill	Country
374	0.035617	0.608620	0.344285	0.398376	0.233158	0.348415	0.246486	Violent Femmes	punk rock
375	0.032590	0.547653	0.294348	0.316193	0.192013	0.462534	0.257751	W.A.S.P.	heavy metal
376	0.205116	0.394842	0.389623	0.302427	0.033569	0.116834	0.133046	Warren_G	Hip Hop
377	0.030052	0.613280	0.346167	0.373449	0.102906	0.313079	0.258100	Waylon Jennings	Country
378	0.000000	0.379652	0.234339	0.368102	0.139130	0.271915	0.235558	White Zombie	heavy metal
379	0.009354	0.805518	0.764349	0.591093	0.067360	0.630351	0.594585	Whitesnake	heavy metal
380	0.000000	0.812481	0.629915	0.574522	0.019814	0.612128	0.594845	Whitney Houston	rhythm and blues
381	0.050114	0.478771	0.227129	0.350105	0.082015	0.130546	0.208611	Will_Smith	Hip Hop
382	0.052394	0.638656	0.343738	0.357677	0.135236	0.357768	0.233339	Willie Nelson	Country
383	0.135084	0.231607	0.124481	0.160269	0.141489	0.153111	0.076997	Wire	punk rock
384	0.776054	0.420987	0.546276	0.376374	0.069092	0.102349	0.145945	Wiz_Khalifa	Hip Hop
385	0.189808	0.464809	0.269780	0.359945	0.045183	0.185374	0.187468	X	punk rock
386	0.204148	0.369473	0.307254	0.298508	0.092521	0.062660	0.100177	Xzibit	Hip Hop
387	0.173154	0.414595	0.325445	0.331869	0.110114	0.140262	0.139033	Yelawolf	Hip Hop
388	0.011097	0.470948	0.201246	0.289181	0.217937	0.289195	0.153054	Yngwie Malmsteen	heavy metal
389	0.284804	0.417470	0.415205	0.379049	0.050196	0.061686	0.086339	Young_Jeezy	Hip Hop
390	0.026043	0.679013	0.399591	0.385002	0.187538	0.324088	0.203139	Zac Brown Band	Country
391	0.057894	0.590785	0.230635	0.304427	0.081213	0.191311	0.142993	motörhead	heavy metal

392 rows × 9 columns

# What it's not

What it's not¶

xxxxxxxxxx
• Although possible, CFIDF is not meant to be used as a predictive algorithm. You're much better off using your word embeddings as features instead of the learned features *produced* by your word embeddings.​• CFIDF is not designed for topic modeling as it doesn't "discover" topics. CFIDF assumes you've specified your topics (or concepts) of interest a priori. It may be helpful for a priori topic/concept analysis though. ​On the other hand, it works great for texts where you have a set of concepts you're interested in analyzing, where you may have a lot of overlapping language (money vs cash), and/or where you know (roughly) what concepts exist in the text already. 

• Although possible, CFIDF is not meant to be used as a predictive algorithm. You're much better off using your word embeddings as features instead of the learned features produced by your word embeddings.

• CFIDF is not designed for topic modeling as it doesn't "discover" topics. CFIDF assumes you've specified your topics (or concepts) of interest a priori. It may be helpful for a priori topic/concept analysis though.

On the other hand, it works great for texts where you have a set of concepts you're interested in analyzing, where you may have a lot of overlapping language (money vs cash), and/or where you know (roughly) what concepts exist in the text already.

# What's Next

What's Next¶

xxxxxxxxxx
This is a good start for determining which *concepts*, as opposed to terms, are unique to a document. But it's not perfect. One flaw with this specific example is my use of <a href="https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf">Word2Vec</a> for training word embeddings that produce the CFIDF scores. The problem is that Word2Vec produces a single vector representation for each word, regardless of the context of the word. This poses two problems. One, it doesn't take into account the complex characteristics of word use (word-sense ambiguity); two, it doesn't model <a href="https://en.wikipedia.org/wiki/Polysemy">polysemic</a> words whose meanings vary across linguistic contexts. JR Firth, pioneer of distributional semantics, said​>The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.​Although Word2Vec takes context words into account, it assigns one vector representation to each word, regardless of how the word is used. So although it does a great job of learning semantic similarity (*hot-cold*, *computer-tv*, *hand-foot*), it isn't as adept at learning semantic relatedness (*hot-sun*, *computer-database*, *hand-ring*) because of its one-embedding-per-word architecture.​This distinction is important. Take the example *cup-bottle* vs *necklace-neck*. The first pair of words are semantically *similar*. That is, they are roughly interchangeable and similar in function. However, the second pair of words are merely semantically *related*—they share a common theme but aren't interchangeable nor functionally similar. If, for example, I'm interested in finding conceptual similarities for `partying`, then *cup* and *bottle* would likely both make the cut. But If I'm interested in finding concepts similar to `money` then although *necklace* may contribute, *neck* likely will not because it's not semantically similar. Choosing a word-embedding architecture that supports multiple vector representations per word depending on context could likely improve the results of CFIDF.​Asr et al., in <a href="https://aclweb.org/anthology/N18-1062">Querying Word Embeddings for Similarity and Relatedness</a> said:​> It may be unrealistic to expect a single vector representation to account for qualitatively distinct similarity and relatedness data.​They went on to empirically show that although word embeddings are best for finding semantic similarities, contexualized word embeddings are better indicators of semantic relatedness. ​Much work has been done to differentiate semantic relatedness and semantic similarity and to solve the problems of word-sense ambiguity and polysemy modeling (see the <a href="http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf">CoVe paper</a>, and the <a href="https://arxiv.org/pdf/1802.05365.pdf">ELMo paper</a>). ​ELMo, as opposed to Word2Vec, uses a language model (predicting the next word in a sequence of words; more on language models <a href="https://machinelearningmastery.com/statistical-language-modeling-and-neural-language-models/">here</a>) to solve our two problems: word context (word-sense disambiguation) and linguistic context (polysemy modeling). It does this by producing an embedding for a word *based on the context in which it appears*, therefore producing a slightly different embedding for each word's occurences, thus better disambiguating each word and providing more context to our text data!​Therefore, for better CFIDF performance, try <a href="https://arxiv.org/pdf/1802.05365.pdf"> deep *contexualized* word representations</a> (my next step is to use ELMo embeddings to produce CFIDF scores).

This is a good start for determining which concepts, as opposed to terms, are unique to a document. But it's not perfect. One flaw with this specific example is my use of Word2Vec for training word embeddings that produce the CFIDF scores. The problem is that Word2Vec produces a single vector representation for each word, regardless of the context of the word. This poses two problems. One, it doesn't take into account the complex characteristics of word use (word-sense ambiguity); two, it doesn't model polysemic words whose meanings vary across linguistic contexts. JR Firth, pioneer of distributional semantics, said

The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.

Although Word2Vec takes context words into account, it assigns one vector representation to each word, regardless of how the word is used. So although it does a great job of learning semantic similarity (hot-cold, computer-tv, hand-foot), it isn't as adept at learning semantic relatedness (hot-sun, computer-database, hand-ring) because of its one-embedding-per-word architecture.

This distinction is important. Take the example cup-bottle vs necklace-neck. The first pair of words are semantically similar. That is, they are roughly interchangeable and similar in function. However, the second pair of words are merely semantically related—they share a common theme but aren't interchangeable nor functionally similar. If, for example, I'm interested in finding conceptual similarities for partying, then cup and bottle would likely both make the cut. But If I'm interested in finding concepts similar to money then although necklace may contribute, neck likely will not because it's not semantically similar. Choosing a word-embedding architecture that supports multiple vector representations per word depending on context could likely improve the results of CFIDF.

Asr et al., in Querying Word Embeddings for Similarity and Relatedness said:

It may be unrealistic to expect a single vector representation to account for qualitatively distinct similarity and relatedness data.

They went on to empirically show that although word embeddings are best for finding semantic similarities, contexualized word embeddings are better indicators of semantic relatedness.

Much work has been done to differentiate semantic relatedness and semantic similarity and to solve the problems of word-sense ambiguity and polysemy modeling (see the CoVe paper, and the ELMo paper).

ELMo, as opposed to Word2Vec, uses a language model (predicting the next word in a sequence of words; more on language models here) to solve our two problems: word context (word-sense disambiguation) and linguistic context (polysemy modeling). It does this by producing an embedding for a word based on the context in which it appears, therefore producing a slightly different embedding for each word's occurences, thus better disambiguating each word and providing more context to our text data!

Therefore, for better CFIDF performance, try deep contexualized word representations (my next step is to use ELMo embeddings to produce CFIDF scores).

xxxxxxxxxx
HT to <a href="https://bl.ocks.org/tomgp">Tom Pearson</a> and <a href="http://bl.ocks.org/tomgp/7674234">this</a> block for the <a href="https://d3js.org/">d3.js</a> inspiration.

HT to Tom Pearson and this block for the d3.js inspiration.

Concept Frequency-Inverse Concept Document Frquency: Analyzing Concepts in Text

Intro¶

Word Embeddings and Semantic Similarity¶

Calculating CFIDF¶

Visualizing CFIDF¶

What it's not¶

What's Next¶

Share on

You May Also Enjoy

SQLCell 2.0: Redesigning SQLCell for JupyterLab

So you want to write a book? A conversation with Manning author John Berryman

Think Twice Before You Accept That Fancy Data Science Job

Using Bigram Paragraph Vectors for Concept Detection