Doc2Vec is an extension of Word2Vec, an algorithm that employs a shallow neural network to map words to a vector space called word vectors (or word embeddings). Whereas Word2Vec produces word vectors so you can run similarity queries between words, Doc2Vec produces document vectors so you can run similarity queries on whole sentences, paragraphs, or documents. Finding semantic similarities is based on the distributional hypothesis that states words that appear in the same contexts share the same meaning. Or, as the English linguist J. R. Firth put it, "a word is characterized by the company it keeps".
My aim for this post isn't to cover the theory or math behind Doc2Vec but to show its power. For a deeper overview of Doc2Vec, see here.
To get all the lyrics for the top 100 rappers, we'll use Cypher, a new python library I released recently to retrieve music lyrics (to install:
pip install thecypher). But first, we need to get a list of the top 100 rappers. For this, I just Googled "top rappers" and got a hit from ranker.com. This will suffice, although I don't think this list is perfect. Luckily they source the data from an API so we don't have to screen scrape! Here's the code to get this list:
['Tupac', 'Eminem', 'The Notorious B.I.G.', 'Nas', 'Ice Cube', 'Jay-Z', 'Snoop Dogg', 'Dr. Dre', 'Kendrick Lamar', 'Rakim', 'André 3000', 'Eazy-E', 'Kanye West', '50 Cent', 'DMX', 'Busta Rhymes', 'Method Man', 'J. Cole', 'Mos Def', 'Ludacris', 'KRS-One', 'LL Cool J', 'Lil Wayne', 'Common', 'Big L', 'Ghostface Killah', 'Redman', 'T.I.', 'Big Pun', 'Nate Dogg', 'Tech N9ne', 'Lauryn Hill', 'Scarface', 'Slick Rick', 'Raekwon', 'Big Daddy Kane', "Ol' Dirty Bastard", 'The Game', 'Mobb Deep', 'Logic', 'Chance the Rapper', 'Cypress Hill', 'Ice-T', 'Lupe Fiasco', 'RZA', 'GZA', 'Q-Tip', 'Warren G', 'Talib Kweli', 'Xzibit', 'Missy Elliott', 'ASAP Rocky', 'Joey Badass', 'Immortal Technique', 'Twista', 'Big Sean', 'Kid Cudi', 'Big Boi', 'Chuck D', 'Donald Glover', 'Drake', 'Wiz Khalifa', 'Eric B. & Rakim', 'Schoolboy Q', 'DMC', 'Nelly', 'Hopsin', 'D12', 'Jadakiss', 'Tyler, the Creator', 'Kurupt', 'Grandmaster Flash and the Furious Five', 'Gang Starr', 'Too $hort', 'Royce da 5'9"', 'MC Ren', 'E-40', 'Pusha T', 'Coolio', 'De La Soul', 'Proof', 'Bad Meets Evil', 'Guru', 'Will Smith', 'Krayzie Bone', 'Black Thought', 'B.o.B', 'AZ', 'Yelawolf', 'The Sugarhill Gang', 'Earl Sweatshirt', 'Fabolous', 'Mac Miller', 'Fat Joe', 'Young Jeezy', 'Kool G Rap', 'Bizzy Bone', 'Queen Latifah', 'Prodigy', '2 Chainz']
To use Cypher to retrieve these lyrics we'll loop over the list and run
thecypher.get_lyrics on each artist. The following will
get_lyrics and then convert it to a
|0||Infinite (1996)||Eminem||Hip_Hop||14201||Oh yeah, this is Eminem baby, back up in that motherfucking ass||Infinite||1996|
|1||Infinite (1996)||Eminem||Hip_Hop||14202||One time for your mother fucking mind, we represent the 313||Infinite||1996|
|2||Infinite (1996)||Eminem||Hip_Hop||14203||You know what I'm saying?, 'cause they don't know shit about this||Infinite||1996|
|3||Infinite (1996)||Eminem||Hip_Hop||14204||For the 9-6||Infinite||1996|
|4||Infinite (1996)||Eminem||Hip_Hop||14205||Ayo, my pen and paper cause a chain reaction||Infinite||1996|
By default, the data is delivered with one lyric per row. The following code will convert it to one song per row:
|0||313||1996||Infinite (1996)||Hip_Hop||Eminem||Eye-Kyu: Now what you know about a sweet MC, from the 313 None of these skills you bout to see come free So you wanna be a sweet MC, you gotta become me If you ever wanna be one see Eminem: Man what you know about a sweet MC, in the 313 None of these skills you bout to see come free So you wanna be a sweet MC, you better become me If you ever wanna be one see Verse 1: Eye-Kyu Yo some people say I'm whack, now if that's right I'm the freshest whack MC that you ever heard, in your lifetime My slick accapella sounds clever with the beats Boy I'm the deepest thing since potholes to ever hit the streets Forgot a gold digger's succubus, my souls thick with ruggedness With the mic....|
Next, we need to load the data. Doc2Vec requires A LOT of memory, so we'll create an iterator so our data doesn't have to be loaded into memory simultaneously. Instead, we load one document at a time, train the model on it, then discard it and move on to the next document. We could also stream this data from a database if we wanted. Here's how you stream the data from a file:
A couple things to note. First, the Doc2Vec model accepts a list of
TaggedDocument elements which will allow us to identify a song. Second, we use
wnl.lemmatize as apart of our tokenization so we can group together the inflected forms of a word so they can be analysed as a single word. For instance,
wnl.lemmatize will convert 'cars' into 'car'.
To initialize our Sentence object, we do the following:
To initialize our
Doc2Vec model, we'll do the following:
Let's go over each argument.
alpha is the initial learning rate. A very intuitive explanation for learning rate can be found here. Essentially, the learning rate is, as stated in the link, "how quickly a network abandons old beliefs for new ones."
min_alpha is exactly what it sounds like, the minimum
alpha can be, which we reduce after every epoch.
workers is the number of threads used to train the model.
min_count specifies a term frequency that must be met for a word to be considered by the model.
window is how many words in front and behind the input word should be considered when determining context.
size is the number of dimensions. Unlike most numerical datasets that have 2 dimensions, text data can have hundreds or even more.
iter is the number of iterations, the number of times the training set passes through the algorithm.
sample is the downsampling rate. Words representing more than this will be eligible for downsampling.
negative is the negative sampling rate. 0 means update all weights in the output layer of the neural network.
Now we'll build our vocabulary and train our model. We'll train our model for 10 epochs. To understand epochs and how they differ from iterations (from above), check out this StackOverflow post. Namely this answer:
In the neural network terminology:
one epoch = one forward pass and one backward pass of all the training examples
batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
We use multiple epochs because neural networks typically require an iterative optimization method to produce good results, which usually means several passes over the data.
After each epoch, we'll decrease the learning rate (known as learning rate decay). This is to help speed up our training. For more on learning rate decay and the intuition behind it, see Andrew Ng's video on the subject.
To persist our model so we can use it later without training it again, we'll use
model.save and load it using
Doc2Vec.load like so:
Next we'll find the most similar words given a target word. Similar words, in this context, refers to words that have similar vector representations. Let's first see what one of these vector representations looks like:
array([ 0.08730847, -0.75961363, 1.38362062, -0.6143629 , 0.38046223, -0.27822378, 1.0065887 , 0.66717136, 0.53995496, -0.23645727, -0.54589874, 0.0852062 , -1.74815035, 0.11079719, -0.08960737, 0.529109 , -0.50958592, -0.17503066, -0.79260975, 0.14438754, 0.77649647, -0.45132214, 0.26107937, -0.94072151, 0.33201343, 0.06891677, 0.07961012, 0.4604567 , 0.59327006, -0.97538424, 0.72243172, -0.62705523, -0.67403787, -0.49406284, -0.12099945, 0.94990158, -0.13507502, -0.28207451, 0.26398847, -1.06900597, -0.00755116, 0.57757616, 1.11100399, -1.2982794 , -0.49452487, -0.87145579, 0.95555776, -0.11877067, -0.43198681, -0.93733525, 0.37859944, -0.30048838, -0.66467839, 0.18476482, 1.00505781, -0.32252848, 0.37282225, -0.25394279, -1.34661531, -0.52854782, 1.13223743, 0.99049121, 0.46284243, -0.1918252 , 0.13938105, -0.48491701, 0.51925433, 1.20754588, -0.96833384, 0.79104269, -0.73094076, 0.47804666, -0.83540857, 0.28851396, 0.63589162,...., dtype=float32)
The results produced by Doc2vec are very impressive. To showcase, we'll start with the
most_similar method which finds the top n words most similar to the target word. We can see from the following that the results are accurate.
[('crib', 0.4296485483646393), ('room', 0.33615612983703613), ('club', 0.30419921875), ('place', 0.29620522260665894), ('mansion', 0.2891782522201538), ('spot', 0.2849082350730896), ('garage', 0.28439778089523315), ('town', 0.2630491256713867), ('south', 0.2609255313873291), ('trunk', 0.26089051365852356)]
[('tree', 0.45602014660835266), ('chronic', 0.3657829761505127), ('bud', 0.34473711252212524), ('reefer', 0.33160412311553955), ('blantz', 0.32347556948661804), ('dope', 0.3029516637325287), ('blunts', 0.2944639325141907), ('blunt', 0.2931532859802246), ('hahahahahaaa', 0.2876523733139038), ('drug', 0.2835467457771301)]
I found this next result to be very interesting. There apparently is a double meaning to the word 'seed' and our model captures both meanings, an offspring and another word for weed. That's cool!
[('child', 0.30444782972335815), ('greed', 0.2916702926158905), ('leaf', 0.2634624242782593), ('weed', 0.262786328792572), ('breed', 0.25418415665626526), ('dream', 0.24939578771591187), ('loyalty', 0.2438662201166153), ('daughter', 0.23810240626335144), ('tree', 0.23642070591449738), ('kid', 0.2338743656873703)]
Even more interesting are the results we get from using the
negative keywords. We'll use the "seed" example. The positive words contribute positively towards the similarity score; the negative words contribute negatively. When we use "seed" as our target word and don't specify a
negative word, we get a double meaning. But when we add "weed" as a
negative word, the meaning becomes much more about offspring.
[('seed', 0.7398009300231934), ('responsibility', 0.26197338104248047), ('fetus', 0.25151997804641724), ('child', 0.24744100868701935), ('breddern', 0.23935382068157196), ('loyalty', 0.2368765026330948), ('embrace', 0.2257089465856552), ('yosemite', 0.22085259854793549), ('pallbearer', 0.2204713225364685), ('decomposed', 0.21810504794120789)]
Here are a couple more things you can do with the word vectors. The first will find the word that doesn't match. The second will find the word most similar to the target word.
Let's first define a helper function so we can look up song titles given the song IDs.
We can also find the top n most similar songs to a target word. When we pass in 'midwest' as our target word, it should be no surprise that Tech N9ne and Nelly have an appearance since both rappers are from and rap about the Midwest.
[['Tech_N9ne', 'Planet Rock 2K (Down South Mix)', 0.24092429876327515], ['Tech_N9ne', 'Strange', 0.24040183424949646], ['Tech_N9ne', 'Planet Rock 2K (Original Version)', 0.2370171993970871], ['Tech_N9ne', 'Strangeulation I', 0.2366243600845337], ['Warren_G', 'Gangsta Love', 0.23494234681129456], ['Tech_N9ne', "Now It's On", 0.2246299684047699], ['Tech_N9ne', 'P.R. 2K1', 0.22402459383010864], ['Nelly', 'L.A.', 0.2188049554824829], ['Lil_Wayne', 'Banned From TV', 0.21523958444595337], ['Method_Man', "Release Yo' Delf", 0.21435599029064178]]
Also not surprising is that when our target word is 'eminem', Eminem and Eminem's band D12 dominate the results.
[['Eminem', 'Ken Kaniff (Skit)', 0.26885756850242615], ['Eminem', 'Ken Kaniff (Skit)', 0.2669536769390106], ['D12', 'Commercial Break', 0.25570592284202576], ['Eminem', 'The Kiss (Skit)', 0.2381049543619156], ['D12', 'Steve Berman (Skit)', 0.2308967411518097], ['D12', 'Words Are Weapons', 0.2302926629781723], ['D12', 'American Psycho II', 0.2270514965057373], ['Eminem', "Drop the Bomb On 'Em", 0.2205711007118225], ['Eminem', 'My Name Is', 0.21902979910373688], ['Fat_Joe', 'My Fofo', 0.21522411704063416]]
The next one is probably the most fascinating result. When our target word is "church", we get results that clearly have an element of "church" in them. Just look at the first two results, The Game's Hallelujah and Ice Cube's When I Get to Heaven.
[['The_Game', 'Hallelujah', 0.2645237445831299], ['Ice_Cube', 'When I Get to Heaven', 0.2541915774345398], ['Yelawolf', 'The Last Song', 0.21845132112503052], ['Scarface', 'Crack', 0.21555303037166595], ['Lauryn_Hill', 'Interlude 5', 0.21299612522125244], ['KRS-One', "Ain't Ready", 0.21055129170417786], ['Missy_Elliott', 'Intro', 0.20643793046474457], ['Talib_Kweli', "Give 'Em Hell", 0.19774499535560608], ['Tech_N9ne', 'Sad Circus', 0.1967805027961731], ['Tech_N9ne', 'Show Me a God', 0.19402025640010834]]
We can also find songs that are semantically similar to each other by looking up a word vector using the document tag.
[['Eminem', 'The Way I Am', 0.9999999403953552], ['Eminem', 'The Way I Am (Danny Lohner Remix)', 0.9738667011260986], ['Eminem', 'The Way I Am', 0.9721589088439941], ['Nas', 'Album Intro', 0.5593523383140564], ['Immortal_Technique', 'Understand Why', 0.4280475974082947], ['Nate_Dogg', "I Don't Wanna Hurt No More", 0.41760876774787903], ['LL_Cool_J', 'Skit', 0.41418007016181946], ['Big_L', 'Platinum Plus', 0.4120897054672241], ['Big_L', 'Platinum Plus', 0.4068526029586792], ['Gang_Starr', 'My Advice 2 You', 0.40361344814300537]]
Many of these are duplicates due to the lyrics site that powers Cypher being community generated, but you get the idea. We can also detect which documents do not belong in a list of documents by using the
doesnt_match method. Here, we choose which song doesn't match among Eminem's The Way I Am, The Game's Hallelujah and Ice Cube's When I Get to Heaven. The result seems sensible.
Lastly, we'll use our test data to see which songs are the most semantically similar to each other. First, let's load our test data then choose a song as input into the
infer_vector method. We'll choose Eminem's Just the Two of Us, which is
Next, we'll feed the lyrics into
infer_vector to return a vector representation of the song. We'll then input that vector representation into
model.docvecs.most_similar to return back the 10 most similar songs. You can look all the songs up using the ID.
[['Gang_Starr', 'Daily Operation (Intro)', 0.6264939308166504], ['Gang_Starr', 'My Advice 2 You', 0.6025712490081787], ['De_La_Soul', 'The Dawn Brings Smoke', 0.6022455096244812], ['De_La_Soul', 'Stickabush', 0.574888288974762], ['Fabolous', 'Niggas Know', 0.5698577761650085], ['Too_$hort', "Can't Stay Away (Outro)", 0.5679160356521606], ['Immortal_Technique', 'Apocrypha (Interlude)', 0.5677438378334045], ['Twista', 'Wide Open', 0.5660445690155029], ['Big_Daddy_Kane', 'Looks Like A Job For...', 0.5610144734382629], ['Method_Man', 'Dooney Boy (Skit)', 0.5602116584777832]]
As you can see, Doc2Vec provides a lot of insight. But we didn't even get to the good stuff: using this data to train machine learning models. Doc2Vec produces
numpy feature vectors which allow us to use them as training data for machine learning algorithms. In the next post, we'll do just this. I'll train a model that predicts an artist given a song's lyrics. To do this, we'll employ two machine learning classification algorithms, Naive Bayes and Support Vector Machines. See you next time.
• Lyric Attribution using Naive Bayes and Support Vector Machines
• Predicting A Song's Genre Given Its Lyrics
• Topic Modeling with Latent Dirichlet Allocation