Using Bigram Paragraph Vectors for Concept Detection
Recently, I was working on a project using paragraph vectors at work (with gensim's `Doc2Vec` model) and noticed that the `Doc2Vec` model didn't natively interact well with their `Phrases` class, and there was no easy workaround (that I noticed). I saw <a href="http://lmgtfy.com/?q=bigrams+doc2vec+gensim">very little activity</a> around the interwebs about using bigrams with paragraph vectors, which I thought was surprising since paragraph vectors can be much more illuminating than word vectors, especially when trying to disambiguate the various meanings of a given text. This is the main reason I was looking to move from bigram word vectors to bigram *paragraph* vectors.
So I decided to take a look at <a href="https://github.com/RaRe-Technologies/gensim">gensim's source code</a> and incorporate this interaction into its API. With this commit, you can build paragraph vectors with unigrams *and* bigrams by only passing an additional argument to the `Phrases` class. If you want to dig into the code I added, you can find it <a href="https://github.com/RaRe-Technologies/gensim/pull/2158/files">on github</a>. Here, I'll just explain how to use this new code to detect concepts, something I have to do quite often.
First, let's create a helper class to stream in our documents, much like we did in my <a href="https://tmthyjames.github.io/posts/Analyzing-Rap-Lyrics-Using-Word-Vectors/">previous post</a> on analyzing rap lyrics with word vectors. Since the `Doc2Vec` model accepts as input a `TaggedDocument` object, that's what we'll `yield` from our `__iter__` method.
Recently, I was working on a project using paragraph vectors at work (with gensim's Doc2Vec
model) and noticed that the Doc2Vec
model didn't natively interact well with their Phrases
class, and there was no easy workaround (that I noticed). I saw very little activity around the interwebs about using bigrams with paragraph vectors, which I thought was surprising since paragraph vectors can be much more illuminating than word vectors, especially when trying to disambiguate the various meanings of a given text. This is the main reason I was looking to move from bigram word vectors to bigram paragraph vectors.
So I decided to take a look at gensim's source code and incorporate this interaction into its API. With this commit, you can build paragraph vectors with unigrams and bigrams by only passing an additional argument to the Phrases
class. If you want to dig into the code I added, you can find it on github. Here, I'll just explain how to use this new code to detect concepts, something I have to do quite often.
First, let's create a helper class to stream in our documents, much like we did in my previous post on analyzing rap lyrics with word vectors. Since the Doc2Vec
model accepts as input a TaggedDocument
object, that's what we'll yield
from our __iter__
method.
from gensim.models.doc2vec TaggedDocument
nltk, csv
class Sentences(object):
def __init__(self, filename=None, col=None,
stopwords=None, ID=None):
self.filename = filename
self.col = col
self.stopwords = stopwords
self.ID = ID
def get_tokens(text):
"""Helper function for retrieving for tokenizing data,
ignoring stemming and lemmatizing here for simplicity"""
return [r.lower() for r in text.split()]
def __iter__(self):
reader = csv.DictReader(open(self.filename, 'r' ))
for row in reader:
song = row[self.ID]
if not row[self.col]: continue
words = self.get_tokens(row[self.col])
tags = ['%s' % (row[self.ID].strip())]
yield TaggedDocument(words=words, tags=tags)
Now let's initialize a `Sentences` object and pass in the `id` column so that we can tag each document with an identifier.
Now let's initialize a Sentences
object and pass in the id
column so that we can tag each document with an identifier.
Sentences( =
filename='rap-lyrics.csv', # our filename
col='lyric', # the text field
ID='id', # our ID column for document tagging
stopwords=nltk.corpus.stopwords.words('english') # default stopwords from NLTK
)
Previous to this commit, you had to pass a list of strings to the `Phrases` class in gensim, which is fine if you pass the output of `Phrases` to the `Word2Vec` model since that's exactly what the `Word2Vec` model expects. But `Doc2vec` expects a list (or any iterable) of `TaggedDocument` objects. This is really the only thing preventing the easy use of bigrams with `Doc2Vec`. So let's initialize a `Phrases` object and pass our `doc2vec` argument in.
Previous to this commit, you had to pass a list of strings to the Phrases
class in gensim, which is fine if you pass the output of Phrases
to the Word2Vec
model since that's exactly what the Word2Vec
model expects. But Doc2vec
expects a list (or any iterable) of TaggedDocument
objects. This is really the only thing preventing the easy use of bigrams with Doc2Vec
. So let's initialize a Phrases
object and pass our doc2vec
argument in.
Phrases(sentences, doc2vec=True) =
To those familiar with gensim's API, the only difference in the workflow is that now you must set `doc2vec=True` when initializing the `Phrases` object. That's it! **Now you can leverage the insight of bigrams while harnessing the totality of document-level vectors**. The following is the code I used to train a `Word2vec` model using bigrams (phrases), but just replacing `Word2Vec` with `Doc2vec`. Previously, it would error out because the two weren't compatiable. Now they interact seemlessly.
To those familiar with gensim's API, the only difference in the workflow is that now you must set doc2vec=True
when initializing the Phrases
object. That's it! Now you can leverage the insight of bigrams while harnessing the totality of document-level vectors. The following is the code I used to train a Word2vec
model using bigrams (phrases), but just replacing Word2Vec
with Doc2vec
. Previously, it would error out because the two weren't compatiable. Now they interact seemlessly.
from gensim.models.doc2vec import
phrase2vec = (
workers=5,
window=20,
alpha=0.025,
min_alpha=0.025,
min_count=1,
)
phrase2vec.build_vocab(phrases[sentences])
for epoch in range(3):
phrase2vec.train(phrases[sentences], total_examples=phrase2vec.corpus_count, epochs=phrase2vec.iter)
phrase2vec.alpha -= 0.002 # decrease the learning rate
phrase2vec.min_alpha = phrase2vec.alpha # fix the learning rate, no decay
phrase2vec.save('rap-lyrics3.bigrams.doc2vec')
Now, let's use these bigram paragraph vectors to detect various concepts. We'll use rap lyrics since I have those on my machine already.
Now, let's use these bigram paragraph vectors to detect various concepts. We'll use rap lyrics since I have those on my machine already.
'weed', 'mom', 'religion', 'party', 'spanish', 'politics'] = [
for concept in :
query = concept.split()
scores = pd.DataFrame(
phrase2vec.docvecs.most_similar(
phrase2vec[query],
topn=len(lyrics)
),
columns=['id', concept + '_concept']
)
scores['id'] = scores['id'].astype(int)
lyrics = lyrics.merge(scores, on=['id'])
sort_values('politics_concept', ascending=False).head() .
xxxxxxxxxx
Our model does a pretty good job of detecting very high level concepts, as you can see with our politics example. I won't show the entire songs since most are NSFW, but you can look these songs up and see for yourself. For the rap connoisseurs out there, you can tell immediately from the artists that these are artists who wear their politics on their sleeves—especially Talib Kweli and Immortal Technique.
Our model does a pretty good job of detecting very high level concepts, as you can see with our politics example. I won't show the entire songs since most are NSFW, but you can look these songs up and see for yourself. For the rap connoisseurs out there, you can tell immediately from the artists that these are artists who wear their politics on their sleeves—especially Talib Kweli and Immortal Technique.