Concept Frequency-Inverse Concept Document Frequency (or CFIDF) is a measure I've created to explore text data and it works surprisingly well for exploring and visualizing text data. As opposed to its statistical parent TF-IDF, it isn't based strictly on word counts, but on entire concepts, assuming that you're word embedding layer is trained appropriately. First, it may help to explain what TF-IDF accomplishes.
Most of NLP is trying to figure out what a body of text is about. TF-IDF is crucial to answering this question as it calculates how unique a word is to a document compared to other documents in the corpus. If you can measure how important a set of words is to a document, then you're one step closer to answering what the document is about. One problem here is the issue of synonyms, commonly misspelled words, and even conceptually similar words or phrases. These issues tend to dilute the impact a word has on a document. This is where CFIDF comes in.
Word2Vec, which is the word embedding algorithm I use here (but you can use any word embedding algorithm), produces measures of semantic similarity. This allows us to determine if two words are semantically similar (or conceptually similar), regardless if they are synonyms, commonly misspelled words, or different parts of speech entirely. Thus, with CFIDF, we can calculate how frequent a concept shows up in a document compared to how frequent it shows up in other documents in a corpus.
Calculating CFIDF is pretty straight forward and almost exactly like TF-IDF but instead of using term frequency you use concept frequency. To do this, you establish a user-defined similarity threshold on your similarity queries to your word embeddings. If a word in the document is above this threshold then we count that as being conceptually similar and thus increasing concept frequency (it may help to set thresholds on each concept/target word individually).
One may notice that I'm computing concept frequency by dividing the number of times a concept appears by the total number of terms in the document. Since we can't easily detect the number of concepts present this will have to do as a way to account for document length.
I calculate inverse concept document frequency by using the following formula:
With N being the total number of documents in the corpus or The denominator:
is the number of documents where the concept c appears, and I add the constant 1 to avoid division by zero.
CFIDF is great for exploring text that you are conceptually familiar with or where you roughly know what concepts are mentioned. It's also great for comparing text. Take the following ternary plot, for example, where we look at song lyrics. (It should be obvious now that I love rap music).
We have specified three concepts to which we want to map artists to a certain degree. If an artist lands in the middle, then they likely write about all three concepts equally. If they land where Jackie Wilson does in this example (left edge), then that means they write about
money equally but shy away from
politics. This exmaple shows that metal and punk artists tend to write about
politics more so than
money. And rap artists tend to rap about
politics in general. Anecdotally, we can confirm this by looking at metal and punk artists like Fear Factory and Rage Against the Machine, two artists known to politic. Also notice the rapper Immortal Technique's position, who is known to engage in political lyricism. It should be no surprise that country artists tend to lean towards
When the concepts of interest are
sex, we see pretty definite segmentation: rap leans towards drugs, heavy metal leans toward politics, and R&B dominates the sex category, as one would expect, right?
Here's all the CFIDF scores by artist and genre:
|2||0.174737||0.467548||0.179775||0.276611||0.103990||0.132259||0.117274||7 Seconds||punk rock|
|7||0.000000||0.685787||0.522545||0.436630||0.051919||0.498870||0.404920||Adele||rhythm and blues|
|8||0.000000||0.900886||0.560054||0.697584||0.046892||0.695580||0.627945||Al Green||rhythm and blues|
|10||0.046781||0.526274||0.384845||0.353584||0.153123||0.268308||0.244421||Alice Cooper||heavy metal|
|11||0.000000||0.479292||0.178801||0.292639||0.000000||0.218498||0.137328||Alice in Chains||heavy metal|
|13||0.030683||0.297049||0.028043||0.125625||0.388327||0.140677||0.035558||Amon Amarth||heavy metal|
|14||0.257458||0.641693||0.439047||0.418453||0.028090||0.364027||0.269258||Amy Winehouse||rhythm and blues|
|15||0.020131||0.937717||0.728164||0.644628||0.000000||0.750571||0.685267||Anita Baker||rhythm and blues|
|18||0.058171||0.503801||0.246655||0.323619||0.247526||0.212399||0.137969||Anti-Nowhere League||punk rock|
|20||0.028504||0.820412||0.606775||0.593966||0.054282||0.700646||0.616238||Aretha Franklin||rhythm and blues|
|21||0.022174||0.525510||0.147532||0.261213||0.145159||0.130319||0.106520||Avenged Sevenfold||heavy metal|
|22||0.000000||0.905792||0.458609||0.630811||0.038014||0.573877||0.607313||B.B. King||rhythm and blues|
|24||0.014794||0.378958||0.138150||0.217477||0.380938||0.129712||0.074173||Bad Religion||punk rock|
|27||0.000000||1.000000||0.745141||0.808837||0.000000||0.836872||0.943167||Barry White||rhythm and blues|
|28||0.017855||0.772672||0.545288||0.629006||0.000000||0.474196||0.525784||Beyoncé Knowles||rhythm and blues|
|364||0.010390||0.658518||0.299632||0.429956||0.027208||0.416839||0.212274||Twisted Sister||heavy metal|
|366||0.037013||0.457210||0.265354||0.285825||0.072690||0.329173||0.172931||Type O Negative||heavy metal|
|367||0.007734||0.730533||0.603650||0.657350||0.030378||0.315966||0.594983||Tyrese Gibson||Hip Hop|
|369||0.007463||0.624370||0.373314||0.370615||0.097705||0.360790||0.242315||Uriah Heep||heavy metal|
|370||0.016748||0.758233||0.523127||0.666694||0.005482||0.343238||0.544873||Usher||rhythm and blues|
|371||0.006117||0.664322||0.476895||0.535035||0.072075||0.457920||0.404772||Van Halen||heavy metal|
|374||0.035617||0.608620||0.344285||0.398376||0.233158||0.348415||0.246486||Violent Femmes||punk rock|
|378||0.000000||0.379652||0.234339||0.368102||0.139130||0.271915||0.235558||White Zombie||heavy metal|
|380||0.000000||0.812481||0.629915||0.574522||0.019814||0.612128||0.594845||Whitney Houston||rhythm and blues|
|388||0.011097||0.470948||0.201246||0.289181||0.217937||0.289195||0.153054||Yngwie Malmsteen||heavy metal|
|390||0.026043||0.679013||0.399591||0.385002||0.187538||0.324088||0.203139||Zac Brown Band||Country|
392 rows × 9 columns
• Although possible, CFIDF is not meant to be used as a predictive algorithm. You're much better off using your word embeddings as features instead of the learned features produced by your word embeddings.
• CFIDF is not designed for topic modeling as it doesn't "discover" topics. CFIDF assumes you've specified your topics (or concepts) of interest a priori. It may be helpful for a priori topic/concept analysis though.
On the other hand, it works great for texts where you have a set of concepts you're interested in analyzing, where you may have a lot of overlapping language (money vs cash), and/or where you know (roughly) what concepts exist in the text already.
This is a good start for determining which concepts, as opposed to terms, are unique to a document. But it's not perfect. One flaw with this specific example is my use of Word2Vec for training word embeddings that produce the CFIDF scores. The problem is that Word2Vec produces a single vector representation for each word, regardless of the context of the word. This poses two problems. One, it doesn't take into account the complex characteristics of word use (word-sense ambiguity); two, it doesn't model polysemic words whose meanings vary across linguistic contexts. JR Firth, pioneer of distributional semantics, said
The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.
Although Word2Vec takes context words into account, it assigns one vector representation to each word, regardless of how the word is used. So although it does a great job of learning semantic similarity (hot-cold, computer-tv, hand-foot), it isn't as adept at learning semantic relatedness (hot-sun, computer-database, hand-ring) because of its one-embedding-per-word architecture.
This distinction is important. Take the example cup-bottle vs necklace-neck. The first pair of words are semantically similar. That is, they are roughly interchangeable and similar in function. However, the second pair of words are merely semantically related—they share a common theme but aren't interchangeable nor functionally similar. If, for example, I'm interested in finding conceptual similarities for
partying, then cup and bottle would likely both make the cut. But If I'm interested in finding concepts similar to
money then although necklace may contribute, neck likely will not because it's not semantically similar. Choosing a word-embedding architecture that supports multiple vector representations per word depending on context could likely improve the results of CFIDF.
Asr et al., in Querying Word Embeddings for Similarity and Relatedness said:
It may be unrealistic to expect a single vector representation to account for qualitatively distinct similarity and relatedness data.
They went on to empirically show that although word embeddings are best for finding semantic similarities, contexualized word embeddings are better indicators of semantic relatedness.
ELMo, as opposed to Word2Vec, uses a language model (predicting the next word in a sequence of words; more on language models here) to solve our two problems: word context (word-sense disambiguation) and linguistic context (polysemy modeling). It does this by producing an embedding for a word based on the context in which it appears, therefore producing a slightly different embedding for each word's occurences, thus better disambiguating each word and providing more context to our text data!
Therefore, for better CFIDF performance, try deep contexualized word representations (my next step is to use ELMo embeddings to produce CFIDF scores).