Using Bigram Paragraph Vectors for Concept Detection
Recently, I was working on a project using paragraph vectors at work (with gensim's `Doc2Vec` model) and noticed that the `Doc2Vec` model didn't natively interact well with their `Phrases` class, and there was no easy workaround (that I noticed). I saw <a href="http://lmgtfy.com/?q=bigrams+doc2vec+gensim">very little activity</a> around the interwebs about using bigrams with paragraph vectors, which I thought was surprising since paragraph vectors can be much more illuminating than word vectors, especially when trying to disambiguate the various meanings of a given text. This is the main reason I was looking to move from bigram word vectors to bigram *paragraph* vectors.So I decided to take a look at <a href="https://github.com/RaRe-Technologies/gensim">gensim's source code</a> and incorporate this interaction into its API. With this commit, you can build paragraph vectors with unigrams *and* bigrams by only passing an additional argument to the `Phrases` class. If you want to dig into the code I added, you can find it <a href="https://github.com/RaRe-Technologies/gensim/pull/2158/files">on github</a>. Here, I'll just explain how to use this new code to detect concepts, something I have to do quite often. First, let's create a helper class to stream in our documents, much like we did in my <a href="https://tmthyjames.github.io/posts/Analyzing-Rap-Lyrics-Using-Word-Vectors/">previous post</a> on analyzing rap lyrics with word vectors. Since the `Doc2Vec` model accepts as input a `TaggedDocument` object, that's what we'll `yield` from our `__iter__` method. Recently, I was working on a project using paragraph vectors at work (with gensim's Doc2Vec model) and noticed that the Doc2Vec model didn't natively interact well with their Phrases class, and there was no easy workaround (that I noticed). I saw very little activity around the interwebs about using bigrams with paragraph vectors, which I thought was surprising since paragraph vectors can be much more illuminating than word vectors, especially when trying to disambiguate the various meanings of a given text. This is the main reason I was looking to move from bigram word vectors to bigram paragraph vectors.
So I decided to take a look at gensim's source code and incorporate this interaction into its API. With this commit, you can build paragraph vectors with unigrams and bigrams by only passing an additional argument to the Phrases class. If you want to dig into the code I added, you can find it on github. Here, I'll just explain how to use this new code to detect concepts, something I have to do quite often.
First, let's create a helper class to stream in our documents, much like we did in my previous post on analyzing rap lyrics with word vectors. Since the Doc2Vec model accepts as input a TaggedDocument object, that's what we'll yield from our __iter__ method.
from gensim.models.doc2vec TaggedDocument nltk, csvclass Sentences(object): def __init__(self, filename=None, col=None, stopwords=None, ID=None): self.filename = filename self.col = col self.stopwords = stopwords self.ID = ID def get_tokens(text): """Helper function for retrieving for tokenizing data, ignoring stemming and lemmatizing here for simplicity""" return [r.lower() for r in text.split()] def __iter__(self): reader = csv.DictReader(open(self.filename, 'r' )) for row in reader: song = row[self.ID] if not row[self.col]: continue words = self.get_tokens(row[self.col]) tags = ['%s' % (row[self.ID].strip())] yield TaggedDocument(words=words, tags=tags)Now let's initialize a `Sentences` object and pass in the `id` column so that we can tag each document with an identifier. Now let's initialize a Sentences object and pass in the id column so that we can tag each document with an identifier.
= Sentences( filename='rap-lyrics.csv', # our filename col='lyric', # the text field ID='id', # our ID column for document tagging stopwords=nltk.corpus.stopwords.words('english') # default stopwords from NLTK)Previous to this commit, you had to pass a list of strings to the `Phrases` class in gensim, which is fine if you pass the output of `Phrases` to the `Word2Vec` model since that's exactly what the `Word2Vec` model expects. But `Doc2vec` expects a list (or any iterable) of `TaggedDocument` objects. This is really the only thing preventing the easy use of bigrams with `Doc2Vec`. So let's initialize a `Phrases` object and pass our `doc2vec` argument in.Previous to this commit, you had to pass a list of strings to the Phrases class in gensim, which is fine if you pass the output of Phrases to the Word2Vec model since that's exactly what the Word2Vec model expects. But Doc2vec expects a list (or any iterable) of TaggedDocument objects. This is really the only thing preventing the easy use of bigrams with Doc2Vec. So let's initialize a Phrases object and pass our doc2vec argument in.
= Phrases(sentences, doc2vec=True)To those familiar with gensim's API, the only difference in the workflow is that now you must set `doc2vec=True` when initializing the `Phrases` object. That's it! **Now you can leverage the insight of bigrams while harnessing the totality of document-level vectors**. The following is the code I used to train a `Word2vec` model using bigrams (phrases), but just replacing `Word2Vec` with `Doc2vec`. Previously, it would error out because the two weren't compatiable. Now they interact seemlessly.To those familiar with gensim's API, the only difference in the workflow is that now you must set doc2vec=True when initializing the Phrases object. That's it! Now you can leverage the insight of bigrams while harnessing the totality of document-level vectors. The following is the code I used to train a Word2vec model using bigrams (phrases), but just replacing Word2Vec with Doc2vec. Previously, it would error out because the two weren't compatiable. Now they interact seemlessly.
from gensim.models.doc2vec import phrase2vec = ( workers=5, window=20, alpha=0.025, min_alpha=0.025, min_count=1,)phrase2vec.build_vocab(phrases[sentences])for epoch in range(3): phrase2vec.train(phrases[sentences], total_examples=phrase2vec.corpus_count, epochs=phrase2vec.iter) phrase2vec.alpha -= 0.002 # decrease the learning rate phrase2vec.min_alpha = phrase2vec.alpha # fix the learning rate, no decayphrase2vec.save('rap-lyrics3.bigrams.doc2vec')Now, let's use these bigram paragraph vectors to detect various concepts. We'll use rap lyrics since I have those on my machine already. Now, let's use these bigram paragraph vectors to detect various concepts. We'll use rap lyrics since I have those on my machine already.
= ['weed', 'mom', 'religion', 'party', 'spanish', 'politics']for concept in : query = concept.split() scores = pd.DataFrame( phrase2vec.docvecs.most_similar( phrase2vec[query], topn=len(lyrics) ), columns=['id', concept + '_concept'] ) scores['id'] = scores['id'].astype(int) lyrics = lyrics.merge(scores, on=['id']).sort_values('politics_concept', ascending=False).head()xxxxxxxxxxOur model does a pretty good job of detecting very high level concepts, as you can see with our politics example. I won't show the entire songs since most are NSFW, but you can look these songs up and see for yourself. For the rap connoisseurs out there, you can tell immediately from the artists that these are artists who wear their politics on their sleeves—especially Talib Kweli and Immortal Technique.Our model does a pretty good job of detecting very high level concepts, as you can see with our politics example. I won't show the entire songs since most are NSFW, but you can look these songs up and see for yourself. For the rap connoisseurs out there, you can tell immediately from the artists that these are artists who wear their politics on their sleeves—especially Talib Kweli and Immortal Technique.