Recently, I was working on a project using paragraph vectors at work (with gensim's
Doc2Vec model) and noticed that the
Doc2Vec model didn't natively interact well with their
Phrases class, and there was no easy workaround (that I noticed). I saw very little activity around the interwebs about using bigrams with paragraph vectors, which I thought was surprising since paragraph vectors can be much more illuminating than word vectors, especially when trying to disambiguate the various meanings of a given text. This is the main reason I was looking to move from bigram word vectors to bigram paragraph vectors.
So I decided to take a look at gensim's source code and incorporate this interaction into its API. With this commit, you can build paragraph vectors with unigrams and bigrams by only passing an additional argument to the
Phrases class. If you want to dig into the code I added, you can find it on github. Here, I'll just explain how to use this new code to detect concepts, something I have to do quite often.
First, let's create a helper class to stream in our documents, much like we did in my previous post on analyzing rap lyrics with word vectors. Since the
Doc2Vec model accepts as input a
TaggedDocument object, that's what we'll
yield from our
Now let's initialize a
Sentences object and pass in the
id column so that we can tag each document with an identifier.
Previous to this commit, you had to pass a list of strings to the
Phrases class in gensim, which is fine if you pass the output of
Phrases to the
Word2Vec model since that's exactly what the
Word2Vec model expects. But
Doc2vec expects a list (or any iterable) of
TaggedDocument objects. This is really the only thing preventing the easy use of bigrams with
Doc2Vec. So let's initialize a
Phrases object and pass our
doc2vec argument in.
To those familiar with gensim's API, the only difference in the workflow is that now you must set
doc2vec=True when initializing the
Phrases object. That's it! Now you can leverage the insight of bigrams while harnessing the totality of document-level vectors. The following is the code I used to train a
Word2vec model using bigrams (phrases), but just replacing
Doc2vec. Previously, it would error out because the two weren't compatiable. Now they interact seemlessly.
/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:561: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
Now, let's use these bigram paragraph vectors to detect various concepts. We'll use rap lyrics since I have those on my machine already.
|10036||10036||The Wormhole||2013.0||Gravitas (2013)||Talib_Kweli||symbologists at the college oh you n****s wann...||-0.011319||0.059534||0.451099||0.049866||0.076093||0.462608|
|8963||8963||Speak Your Mind (Hidden Track)||2001.0||Revolutionary Vol. 1 (2001)||Immortal_Technique||you have to speak the truth you have to speak ...||0.047033||0.066683||0.477586||-0.000200||0.032641||0.461523|
|4663||4663||I Want to Talk to You||1999.0||I Am (1999)||Nas||chorus repeat 2x i wanna talk to the mayor the...||-0.112713||0.040988||0.299852||-0.011642||0.187577||0.460955|
|8861||8861||Solidified||2003.0||Free Agents: The Murda Mixtape (2003)||Mobb_Deep||prodigy yeah you know the shit dont stop never...||-0.083604||0.032377||0.175557||-0.128048||0.116116||0.431326|
|7243||7243||Open Your Eyes||2008.0||The 3rd World (2008)||Immortal_Technique||were here because of you were here because you...||-0.035662||0.011329||0.379785||-0.144915||0.059664||0.424974|
Our model does a pretty good job of detecting very high level concepts, as you can see with our politics example. I won't show the entire songs since most are NSFW, but you can look these songs up and see for yourself. For the rap connoisseurs out there, you can tell immediately from the artists that these are artists who wear their politics on their sleeves—especially Talib Kweli and Immortal Technique.