Using Bigram Paragraph Vectors for Concept Detection

6 minute read | Updated:

Recently, I was working on a project using paragraph vectors at work (with gensim's `Doc2Vec` model) and noticed that the `Doc2Vec` model didn't natively interact well with their `Phrases` class, and there was no easy workaround (that I noticed). I saw <a href="http://lmgtfy.com/?q=bigrams+doc2vec+gensim">very little activity</a> around the interwebs about using bigrams with paragraph vectors, which I thought was surprising since paragraph vectors can be much more illuminating than word vectors, especially when trying to disambiguate the various meanings of a given text. This is the main reason I was looking to move from bigram word vectors to bigram *paragraph* vectors.So I decided to take a look at <a href="https://github.com/RaRe-Technologies/gensim">gensim's source code</a> and incorporate this interaction into its API. With this commit, you can build paragraph vectors with unigrams *and* bigrams by only passing an additional argument to the `Phrases` class. If you want to dig into the code I added, you can find it <a href="https://github.com/RaRe-Technologies/gensim/pull/2158/files">on github</a>. Here, I'll just explain how to use this new code to detect concepts, something I have to do quite often. First, let's create a helper class to stream in our documents, much like we did in my <a href="https://tmthyjames.github.io/posts/Analyzing-Rap-Lyrics-Using-Word-Vectors/">previous post</a> on analyzing rap lyrics with word vectors. Since the `Doc2Vec` model accepts as input a `TaggedDocument` object, that's what we'll `yield` from our `__iter__` method. 

Recently, I was working on a project using paragraph vectors at work (with gensim's Doc2Vec model) and noticed that the Doc2Vec model didn't natively interact well with their Phrases class, and there was no easy workaround (that I noticed). I saw very little activity around the interwebs about using bigrams with paragraph vectors, which I thought was surprising since paragraph vectors can be much more illuminating than word vectors, especially when trying to disambiguate the various meanings of a given text. This is the main reason I was looking to move from bigram word vectors to bigram paragraph vectors.

So I decided to take a look at gensim's source code and incorporate this interaction into its API. With this commit, you can build paragraph vectors with unigrams and bigrams by only passing an additional argument to the Phrases class. If you want to dig into the code I added, you can find it on github. Here, I'll just explain how to use this new code to detect concepts, something I have to do quite often.

First, let's create a helper class to stream in our documents, much like we did in my previous post on analyzing rap lyrics with word vectors. Since the Doc2Vec model accepts as input a TaggedDocument object, that's what we'll yield from our __iter__ method.

In [509]:

from gensim.models.doc2vec import TaggedDocumentimport nltk, csvclass Sentences(object):        def __init__(self, filename=None, col=None,                  stopwords=None, ID=None):        self.filename = filename        self.col = col        self.stopwords = stopwords        self.ID = ID            @staticmethod    def get_tokens(text):        """Helper function for retrieving for tokenizing data,            ignoring stemming and lemmatizing here for simplicity"""        return [r.lower() for r in text.split()]     def __iter__(self):        reader = csv.DictReader(open(self.filename, 'r' ))        for row in reader:            song = row[self.ID]            if not row[self.col]: continue            words = self.get_tokens(row[self.col])            tags = ['%s' % (row[self.ID].strip())]            yield TaggedDocument(words=words, tags=tags)

Now let's initialize a `Sentences` object and pass in the `id` column so that we can tag each document with an identifier. 

Now let's initialize a Sentences object and pass in the id column so that we can tag each document with an identifier.

In [511]:

sentences = Sentences(    filename='rap-lyrics.csv', # our filename    col='lyric', # the text field    ID='id', # our ID column for document tagging    stopwords=nltk.corpus.stopwords.words('english') # default stopwords from NLTK)

Previous to this commit, you had to pass a list of strings to the `Phrases` class in gensim, which is fine if you pass the output of `Phrases` to the `Word2Vec` model since that's exactly what the `Word2Vec` model expects. But `Doc2vec` expects a list (or any iterable) of `TaggedDocument` objects. This is really the only thing preventing the easy use of bigrams with `Doc2Vec`. So let's initialize a `Phrases` object and pass our `doc2vec` argument in.

Previous to this commit, you had to pass a list of strings to the Phrases class in gensim, which is fine if you pass the output of Phrases to the Word2Vec model since that's exactly what the Word2Vec model expects. But Doc2vec expects a list (or any iterable) of TaggedDocument objects. This is really the only thing preventing the easy use of bigrams with Doc2Vec. So let's initialize a Phrases object and pass our doc2vec argument in.

In [481]:

phrases = Phrases(sentences, doc2vec=True)

To those familiar with gensim's API, the only difference in the workflow is that now you must set `doc2vec=True` when initializing the `Phrases` object. That's it! **Now you can leverage the insight of bigrams while harnessing the totality of document-level vectors**. The following is the code I used to train a `Word2vec` model using bigrams (phrases), but just replacing `Word2Vec` with `Doc2vec`. Previously, it would error out because the two weren't compatiable. Now they interact seemlessly.

To those familiar with gensim's API, the only difference in the workflow is that now you must set doc2vec=True when initializing the Phrases object. That's it! Now you can leverage the insight of bigrams while harnessing the totality of document-level vectors. The following is the code I used to train a Word2vec model using bigrams (phrases), but just replacing Word2Vec with Doc2vec. Previously, it would error out because the two weren't compatiable. Now they interact seemlessly.

In [482]:

from gensim.models.doc2vec import Doc2Vecphrase2vec = Doc2Vec(     workers=5,     window=20,     alpha=0.025,     min_alpha=0.025,    min_count=1,)phrase2vec.build_vocab(phrases[sentences])for epoch in range(3):    phrase2vec.train(phrases[sentences], total_examples=phrase2vec.corpus_count, epochs=phrase2vec.iter)    phrase2vec.alpha -= 0.002  # decrease the learning rate    phrase2vec.min_alpha = phrase2vec.alpha  # fix the learning rate, no decayphrase2vec.save('rap-lyrics3.bigrams.doc2vec')

/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:561: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class

Now, let's use these bigram paragraph vectors to detect various concepts. We'll use rap lyrics since I have those on my machine already. 

Now, let's use these bigram paragraph vectors to detect various concepts. We'll use rap lyrics since I have those on my machine already.

In [500]:

concepts = ['weed', 'mom', 'religion', 'party', 'spanish', 'politics']for concept in concepts:    query = concept.split()    scores = pd.DataFrame(        phrase2vec.docvecs.most_similar(            phrase2vec[query],             topn=len(lyrics)        ),         columns=['id', concept + '_concept']    )    scores['id'] = scores['id'].astype(int)    lyrics = lyrics.merge(scores, on=['id'])

In [501]:

lyrics.sort_values('politics_concept', ascending=False).head()

Out[501]:

	id	song	year	album	artist	lyric	weed_concept	mom_concept	religion_concept	party_concept	spanish_concept	politics_concept
10036	10036	The Wormhole	2013.0	Gravitas (2013)	Talib_Kweli	symbologists at the college oh you n****s wann...	-0.011319	0.059534	0.451099	0.049866	0.076093	0.462608
8963	8963	Speak Your Mind (Hidden Track)	2001.0	Revolutionary Vol. 1 (2001)	Immortal_Technique	you have to speak the truth you have to speak ...	0.047033	0.066683	0.477586	-0.000200	0.032641	0.461523
4663	4663	I Want to Talk to You	1999.0	I Am (1999)	Nas	chorus repeat 2x i wanna talk to the mayor the...	-0.112713	0.040988	0.299852	-0.011642	0.187577	0.460955
8861	8861	Solidified	2003.0	Free Agents: The Murda Mixtape (2003)	Mobb_Deep	prodigy yeah you know the shit dont stop never...	-0.083604	0.032377	0.175557	-0.128048	0.116116	0.431326
7243	7243	Open Your Eyes	2008.0	The 3rd World (2008)	Immortal_Technique	were here because of you were here because you...	-0.035662	0.011329	0.379785	-0.144915	0.059664	0.424974

xxxxxxxxxx
Our model does a pretty good job of detecting very high level concepts, as you can see with our politics example. I won't show the entire songs since most are NSFW, but you can look these songs up and see for yourself. For the rap connoisseurs out there, you can tell immediately from the artists that these are artists who wear their politics on their sleeves—especially Talib Kweli and Immortal Technique.

Our model does a pretty good job of detecting very high level concepts, as you can see with our politics example. I won't show the entire songs since most are NSFW, but you can look these songs up and see for yourself. For the rap connoisseurs out there, you can tell immediately from the artists that these are artists who wear their politics on their sleeves—especially Talib Kweli and Immortal Technique.

Share on

Twitter Facebook Reddit LinkedIn Google+

Using Bigram Paragraph Vectors for Concept Detection

Share on

You May Also Enjoy

SQLCell 2.0: Redesigning SQLCell for JupyterLab

So you want to write a book? A conversation with Manning author John Berryman

Concept Frequency-Inverse Concept Document Frquency: Analyzing Concepts in Text

Think Twice Before You Accept That Fancy Data Science Job