Analyzing Rap Lyrics Using Word Vectors

22 minute read | Updated:

In this post, we'll analyze lyrics from the best rappers of all time. To do this, we'll use Python and [gensim's](https://radimrehurek.com/gensim/) implementation of the [Doc2Vec](https://arxiv.org/abs/1405.4053) algorithm. To get the data, we'll use [Cypher](https://github.com/tmthyjames/cypher), a new Python package [I recently released](https://tmthyjames.github.io/tools/Cypher/) that retrieves music lyrics.

In this post, we'll analyze lyrics from the best rappers of all time. To do this, we'll use Python and gensim's implementation of the Doc2Vec algorithm. To get the data, we'll use Cypher, a new Python package I recently released that retrieves music lyrics.

## Contents
[Quick Note On Doc2Vec](#Quick-Note-on-Doc2Vec)<br/>
[Getting the Data](#Getting-the-Data)<br/>
[Loading the Data](#Loading-the-Data)<br/>
[Initializing the Model](#Initializing-the-Model)<br/>
[Training the Model](#Training-the-Model)<br/>
[Finding Most Similar Words](#Finding-Most-Similar-Words)<br/>
[Finding Most Similar Documents](#Finding-Most-Similar-Documents)<br/>
[Inferring Vectors](#Inferring-Vectors)<br/>
[Up Next](#Up-Next)
## Quick Note on Doc2Vec

Quick Note on Doc2Vec

Doc2Vec is an extension of Word2Vec, an algorithm that employs a shallow neural network to map words to a vector space called word vectors (or word embeddings). Whereas Word2Vec produces word vectors so you can run similarity queries between <i>words</i>, Doc2Vec produces document vectors so you can run similarity queries on whole sentences, paragraphs, or documents. Finding semantic similarities is based on the distributional hypothesis that states words that appear in the same contexts share the same meaning. Or, as the English linguist J. R. Firth put it, "a word is characterized by the company it keeps".
My aim for this post isn't to cover the theory or math behind Doc2Vec but to show its power. For a deeper overview of Doc2Vec, see [here](https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e).

Doc2Vec is an extension of Word2Vec, an algorithm that employs a shallow neural network to map words to a vector space called word vectors (or word embeddings). Whereas Word2Vec produces word vectors so you can run similarity queries between words, Doc2Vec produces document vectors so you can run similarity queries on whole sentences, paragraphs, or documents. Finding semantic similarities is based on the distributional hypothesis that states words that appear in the same contexts share the same meaning. Or, as the English linguist J. R. Firth put it, "a word is characterized by the company it keeps".

My aim for this post isn't to cover the theory or math behind Doc2Vec but to show its power. For a deeper overview of Doc2Vec, see here.

## Getting the Data

Getting the Data

To get all the lyrics for the top 100 rappers, we'll use [Cypher](https://github.com/tmthyjames/cypher), a new python library [I released](https://tmthyjames.github.io/tools/Cypher/) recently to retrieve music lyrics (to install: `pip install thecypher`). But first, we need to get a list of the top 100 rappers. For this, I just Googled "top rappers" and got a hit from [ranker.com](https://www.ranker.com/crowdranked-list/the-greatest-rappers-of-all-time). This will suffice, although I don't think this list is perfect. Luckily they source the data from an API so we don't have to screen scrape! Here's the code to get this list:

To get all the lyrics for the top 100 rappers, we'll use Cypher, a new python library I released recently to retrieve music lyrics (to install: pip install thecypher). But first, we need to get a list of the top 100 rappers. For this, I just Googled "top rappers" and got a hit from ranker.com. This will suffice, although I don't think this list is perfect. Luckily they source the data from an API so we don't have to screen scrape! Here's the code to get this list:

In [1]:
import requests
url = 'https://cache-api.ranker.com/lists/855723/items'\
      '?limit=100&offset=0&include=votes,wikiText,rankings,'\
      'openListItemContributors&propertyFetchType=ALL&liCacheKey=null'
r = requests.get(url)
data = r.json()
artists = [i['name'] for i in data['listItems']]
print(artists)
['Tupac', 'Eminem', 'The Notorious B.I.G.', 'Nas', 'Ice Cube', 'Jay-Z', 'Snoop Dogg', 'Dr. Dre', 'Kendrick Lamar', 'Rakim', 'André 3000', 'Eazy-E', 'Kanye West', '50 Cent', 'DMX', 'Busta Rhymes', 'Method Man', 'J. Cole', 'Mos Def', 'Ludacris', 'KRS-One', 'LL Cool J', 'Lil Wayne', 'Common', 'Big L', 'Ghostface Killah', 'Redman', 'T.I.', 'Big Pun', 'Nate Dogg', 'Tech N9ne', 'Lauryn Hill', 'Scarface', 'Slick Rick', 'Raekwon', 'Big Daddy Kane', "Ol' Dirty Bastard", 'The Game', 'Mobb Deep', 'Logic', 'Chance the Rapper', 'Cypress Hill', 'Ice-T', 'Lupe Fiasco', 'RZA', 'GZA', 'Q-Tip', 'Warren G', 'Talib Kweli', 'Xzibit', 'Missy Elliott', 'ASAP Rocky', 'Joey Badass', 'Immortal Technique', 'Twista', 'Big Sean', 'Kid Cudi', 'Big Boi', 'Chuck D', 'Donald Glover', 'Drake', 'Wiz Khalifa', 'Eric B. & Rakim', 'Schoolboy Q', 'DMC', 'Nelly', 'Hopsin', 'D12', 'Jadakiss', 'Tyler, the Creator', 'Kurupt', 'Grandmaster Flash and the Furious Five', 'Gang Starr', 'Too $hort', 'Royce da 5&#39;9&#34;', 'MC Ren', 'E-40', 'Pusha T', 'Coolio', 'De La Soul', 'Proof', 'Bad Meets Evil', 'Guru', 'Will Smith', 'Krayzie Bone', 'Black Thought', 'B.o.B', 'AZ', 'Yelawolf', 'The Sugarhill Gang', 'Earl Sweatshirt', 'Fabolous', 'Mac Miller', 'Fat Joe', 'Young Jeezy', 'Kool G Rap', 'Bizzy Bone', 'Queen Latifah', 'Prodigy', '2 Chainz']
To use Cypher to retrieve these lyrics we'll loop over the list and run `thecypher.get_lyrics` on each artist. The following will `get_lyrics` and then convert it to a `DataFrame`.

To use Cypher to retrieve these lyrics we'll loop over the list and run thecypher.get_lyrics on each artist. The following will get_lyrics and then convert it to a DataFrame.

In [2]:
import thecypher
import pandas as pd
lyrics = []
for artist in artists:
    
    # our Cypher code
    artist_lyrics = thecypher.get_lyrics(artist)
    
    # append each record
    [lyrics.append(i) for i in artist_lyrics]
# convert to a DataFrame
lyrics_df = pd.DataFrame(lyrics)
lyrics_df.head()
Out[2]:
album artist genre id lyric song year
0 Infinite (1996) Eminem Hip_Hop 14201 Oh yeah, this is Eminem baby, back up in that motherfucking ass Infinite 1996
1 Infinite (1996) Eminem Hip_Hop 14202 One time for your mother fucking mind, we represent the 313 Infinite 1996
2 Infinite (1996) Eminem Hip_Hop 14203 You know what I'm saying?, 'cause they don't know shit about this Infinite 1996
3 Infinite (1996) Eminem Hip_Hop 14204 For the 9-6 Infinite 1996
4 Infinite (1996) Eminem Hip_Hop 14205 Ayo, my pen and paper cause a chain reaction Infinite 1996
By default, the data is delivered with one lyric per row. The following code will convert it to one song per row:

By default, the data is delivered with one lyric per row. The following code will convert it to one song per row:

In [3]:
group = ['song', 'year', 'album', 'genre', 'artist']
lyrics_by_song = lyrics_df.sort_values(group)\
       .groupby(group).lyric\
       .apply(' '.join)\
       .reset_index(name='lyric')
    
lyrics_by_song.head(1)
Out[3]:
song year album genre artist lyric
0 313 1996 Infinite (1996) Hip_Hop Eminem Eye-Kyu: Now what you know about a sweet MC, from the 313 None of these skills you bout to see come free So you wanna be a sweet MC, you gotta become me If you ever wanna be one see Eminem: Man what you know about a sweet MC, in the 313 None of these skills you bout to see come free So you wanna be a sweet MC, you better become me If you ever wanna be one see Verse 1: Eye-Kyu Yo some people say I'm whack, now if that's right I'm the freshest whack MC that you ever heard, in your lifetime My slick accapella sounds clever with the beats Boy I'm the deepest thing since potholes to ever hit the streets Forgot a gold digger's succubus, my souls thick with ruggedness With the mic....
You can find the data [here](https://github.com/tmthyjames/cypher/tree/master/data), randomly split into training and testing sets.

You can find the data here, randomly split into training and testing sets.

## Loading the Data

Loading the Data

Next, we need to load the data. Doc2Vec requires A LOT of memory, so we'll create an iterator so our data doesn't have to be loaded into memory simultaneously. Instead, we load one document at a time, train the model on it, then discard it and move on to the next document. We could also stream this data from a database if we wanted. Here's how you stream the data from a file:

Next, we need to load the data. Doc2Vec requires A LOT of memory, so we'll create an iterator so our data doesn't have to be loaded into memory simultaneously. Instead, we load one document at a time, train the model on it, then discard it and move on to the next document. We could also stream this data from a database if we wanted. Here's how you stream the data from a file:

In [4]:
import csv
from nltk.stem import WordNetLemmatizer
from gensim.models.doc2vec import TaggedDocument
wnl = WordNetLemmatizer()
class Sentences(object):
    
    def __init__(self, filename, column):
        self.filename = filename
        self.column = column
        
    @staticmethod
    def get_tokens(text):
        """Helper function for tokenizing data"""
        return [wnl.lemmatize(r.lower()) for r in text.split()]
 
    def __iter__(self):
        reader = csv.DictReader(open(self.filename, 'r' ))
        for row in reader:
            words = self.get_tokens(row[self.column])
            tags = ['%s|%s' % (row['artist'], row['song_id'])]
            yield TaggedDocument(words=words, tags=tags)
A couple things to note. First, the Doc2Vec model accepts a list of `TaggedDocument` elements which will allow us to identify a song. Second, we use `wnl.lemmatize` as apart of our tokenization so we can group together the inflected forms of a word so they can be analysed as a single word. For instance, `wnl.lemmatize` will convert 'cars' into 'car'.
To initialize our Sentence object, we do the following:

A couple things to note. First, the Doc2Vec model accepts a list of TaggedDocument elements which will allow us to identify a song. Second, we use wnl.lemmatize as apart of our tokenization so we can group together the inflected forms of a word so they can be analysed as a single word. For instance, wnl.lemmatize will convert 'cars' into 'car'.

To initialize our Sentence object, we do the following:

In [5]:
filename = 'lyrics_train.csv'
sentences = Sentences(filename=filename, column='word')
# for song lookups
df_train = pd.read_csv(filename)
## Initializing the Model

Initializing the Model

To initialize our `Doc2Vec` model, we'll do the following:

To initialize our Doc2Vec model, we'll do the following:

In [6]:
from gensim.models.doc2vec import Doc2Vec
model = Doc2Vec(
    alpha=0.025,
    min_alpha=0.025,
    workers=15, 
    min_count=2,
    window=10,
    size=300,
    iter=20,
    sample=0.001,
    negative=5
)
Let's go over each argument.
`alpha` is the initial learning rate. A very intuitive explanation for learning rate can be found [here](https://www.quora.com/What-is-the-learning-rate-in-neural-networks). Essentially, the learning rate is, as stated in the link, "how quickly a network abandons old beliefs for new ones." <br/>
`min_alpha` is exactly what it sounds like, the minimum `alpha` can be, which we reduce after every epoch. <br/>
`workers` is the number of threads used to train the model. <br/>
`min_count` specifies a term frequency that must be met for a word to be considered by the model. <br/>
`window` is how many words in front and behind the input word should be considered when determining context. <br/>
`size` is the number of dimensions. Unlike most numerical datasets that have 2 dimensions, text data can have hundreds or even more.<br/>
`iter` is the number of iterations, the number of times the training set passes through the algorithm. <br/>
`sample` is the downsampling rate. Words representing more than this will be eligible for downsampling.<br/>
`negative` is the negative sampling rate. 0 means update all weights in the output layer of the neural network.

Let's go over each argument.

alpha is the initial learning rate. A very intuitive explanation for learning rate can be found here. Essentially, the learning rate is, as stated in the link, "how quickly a network abandons old beliefs for new ones."
min_alpha is exactly what it sounds like, the minimum alpha can be, which we reduce after every epoch.
workers is the number of threads used to train the model.
min_count specifies a term frequency that must be met for a word to be considered by the model.
window is how many words in front and behind the input word should be considered when determining context.
size is the number of dimensions. Unlike most numerical datasets that have 2 dimensions, text data can have hundreds or even more.
iter is the number of iterations, the number of times the training set passes through the algorithm.
sample is the downsampling rate. Words representing more than this will be eligible for downsampling.
negative is the negative sampling rate. 0 means update all weights in the output layer of the neural network.

## Training the Model

Training the Model

Now we'll build our vocabulary and train our model. We'll train our model for 10 epochs. To understand epochs and how they differ from iterations (from above), check out [this](https://stackoverflow.com/questions/4752626/epoch-vs-iteration-when-training-neural-networks) StackOverflow post. Namely this answer:
> In the neural network terminology:
> one epoch = one forward pass and one backward pass of all the training examples
> batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
> number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
We use multiple epochs because neural networks typically require an iterative optimization method to produce good results, which usually means several passes over the data.
After each epoch, we'll decrease the learning rate (known as learning rate decay). This is to help speed up our training. For more on learning rate decay and the intuition behind it, see Andrew Ng's [video](https://www.coursera.org/learn/deep-neural-network/lecture/hjgIA/learning-rate-decay) on the subject.

Now we'll build our vocabulary and train our model. We'll train our model for 10 epochs. To understand epochs and how they differ from iterations (from above), check out this StackOverflow post. Namely this answer:

In the neural network terminology:

one epoch = one forward pass and one backward pass of all the training examples

batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.

number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).

We use multiple epochs because neural networks typically require an iterative optimization method to produce good results, which usually means several passes over the data.

After each epoch, we'll decrease the learning rate (known as learning rate decay). This is to help speed up our training. For more on learning rate decay and the intuition behind it, see Andrew Ng's video on the subject.

In [7]:
model.build_vocab(sentences)
epochs = 10
for epoch in range(epochs):
    model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay
To persist our model so we can use it later without training it again, we'll use `model.save` and load it using `Doc2Vec.load` like so:

To persist our model so we can use it later without training it again, we'll use model.save and load it using Doc2Vec.load like so:

In [8]:
model.save('rap-lyrics.doc2vec')
model = Doc2Vec.load('rap-lyrics.doc2vec')
Next we'll find the most similar words given a target word. Similar words, in this context, refers to words that have similar vector representations. Let's first see what one of these vector representations looks like:

Next we'll find the most similar words given a target word. Similar words, in this context, refers to words that have similar vector representations. Let's first see what one of these vector representations looks like:

In [9]:
model.wv.word_vec('rap')
Out[9]:
array([ 0.08730847, -0.75961363,  1.38362062, -0.6143629 ,  0.38046223,
       -0.27822378,  1.0065887 ,  0.66717136,  0.53995496, -0.23645727,
       -0.54589874,  0.0852062 , -1.74815035,  0.11079719, -0.08960737,
        0.529109  , -0.50958592, -0.17503066, -0.79260975,  0.14438754,
        0.77649647, -0.45132214,  0.26107937, -0.94072151,  0.33201343,
        0.06891677,  0.07961012,  0.4604567 ,  0.59327006, -0.97538424,
        0.72243172, -0.62705523, -0.67403787, -0.49406284, -0.12099945,
        0.94990158, -0.13507502, -0.28207451,  0.26398847, -1.06900597,
       -0.00755116,  0.57757616,  1.11100399, -1.2982794 , -0.49452487,
       -0.87145579,  0.95555776, -0.11877067, -0.43198681, -0.93733525,
        0.37859944, -0.30048838, -0.66467839,  0.18476482,  1.00505781,
       -0.32252848,  0.37282225, -0.25394279, -1.34661531, -0.52854782,
        1.13223743,  0.99049121,  0.46284243, -0.1918252 ,  0.13938105,
       -0.48491701,  0.51925433,  1.20754588, -0.96833384,  0.79104269,
       -0.73094076,  0.47804666, -0.83540857,  0.28851396,  0.63589162,...., dtype=float32)
## Finding Most Similar Words

Finding Most Similar Words

The results produced by Doc2vec are very impressive. To showcase, we'll start with the `most_similar` method which finds the top n words most similar to the target word. We can see from the following that the results are accurate.

The results produced by Doc2vec are very impressive. To showcase, we'll start with the most_similar method which finds the top n words most similar to the target word. We can see from the following that the results are accurate.

In [10]:
model.wv.most_similar('house')
Out[10]:
[('crib', 0.4296485483646393),
 ('room', 0.33615612983703613),
 ('club', 0.30419921875),
 ('place', 0.29620522260665894),
 ('mansion', 0.2891782522201538),
 ('spot', 0.2849082350730896),
 ('garage', 0.28439778089523315),
 ('town', 0.2630491256713867),
 ('south', 0.2609255313873291),
 ('trunk', 0.26089051365852356)]
In [11]:
model.wv.most_similar('weed')
Out[11]:
[('tree', 0.45602014660835266),
 ('chronic', 0.3657829761505127),
 ('bud', 0.34473711252212524),
 ('reefer', 0.33160412311553955),
 ('blantz', 0.32347556948661804),
 ('dope', 0.3029516637325287),
 ('blunts', 0.2944639325141907),
 ('blunt', 0.2931532859802246),
 ('hahahahahaaa', 0.2876523733139038),
 ('drug', 0.2835467457771301)]
I found this next result to be very interesting. There apparently is a double meaning to the word 'seed' and our model captures both meanings, an offspring and another word for weed. That's cool!

I found this next result to be very interesting. There apparently is a double meaning to the word 'seed' and our model captures both meanings, an offspring and another word for weed. That's cool!

In [12]:
model.wv.most_similar('seed')
Out[12]:
[('child', 0.30444782972335815),
 ('greed', 0.2916702926158905),
 ('leaf', 0.2634624242782593),
 ('weed', 0.262786328792572),
 ('breed', 0.25418415665626526),
 ('dream', 0.24939578771591187),
 ('loyalty', 0.2438662201166153),
 ('daughter', 0.23810240626335144),
 ('tree', 0.23642070591449738),
 ('kid', 0.2338743656873703)]
Even more interesting are the results we get from using the `positive` and `negative` keywords. We'll use the "seed" example. The positive words contribute positively towards the similarity score; the negative words contribute negatively. When we use "seed" as our target word and don't specify a `negative` word, we get a double meaning. But when we add "weed" as a `negative` word, the meaning becomes much more about offspring. 

Even more interesting are the results we get from using the positive and negative keywords. We'll use the "seed" example. The positive words contribute positively towards the similarity score; the negative words contribute negatively. When we use "seed" as our target word and don't specify a negative word, we get a double meaning. But when we add "weed" as a negative word, the meaning becomes much more about offspring.

In [13]:
model.wv.most_similar(
    positive=[model['seed']],
    negative=[model['weed']]
)
Out[13]:
[('seed', 0.7398009300231934),
 ('responsibility', 0.26197338104248047),
 ('fetus', 0.25151997804641724),
 ('child', 0.24744100868701935),
 ('breddern', 0.23935382068157196),
 ('loyalty', 0.2368765026330948),
 ('embrace', 0.2257089465856552),
 ('yosemite', 0.22085259854793549),
 ('pallbearer', 0.2204713225364685),
 ('decomposed', 0.21810504794120789)]
Here are a couple more things you can do with the word vectors. The first will find the word that doesn't match. The second will find the word most similar to the target word.

Here are a couple more things you can do with the word vectors. The first will find the word that doesn't match. The second will find the word most similar to the target word.

In [14]:
model.wv.doesnt_match(['south', 'east', 'west', 'atlanta'])
Out[14]:
'atlanta'
In [15]:
model.wv.most_similar_to_given(
    'god', 
    ['street', 'house', 'baby', 'church', 'party', 'struggle', 'loyalty']
)
Out[15]:
'church'
## Finding Most Similar Documents

Finding Most Similar Documents

Let's first define a helper function so we can look up song titles given the song IDs.

Let's first define a helper function so we can look up song titles given the song IDs.

In [16]:
def print_titles(results):
    lookup = lambda x: df_train[
        df_train.song_id==int(x)
    ].song.values[0]
    return [
        [
            i[0].split('|')[0], 
            lookup(i[0].split('|')[1]), 
            i[1]
        ] for i in results
    ]
We can also find the top n most similar <i>songs</i> to a target word. When we pass in 'midwest' as our target word, it should be no surprise that Tech N9ne and Nelly have an appearance since both rappers are from and rap about the Midwest.

We can also find the top n most similar songs to a target word. When we pass in 'midwest' as our target word, it should be no surprise that Tech N9ne and Nelly have an appearance since both rappers are from and rap about the Midwest.

In [17]:
print_titles(
    model.docvecs.most_similar([model['midwest']], topn=10)
)
Out[17]:
[['Tech_N9ne', 'Planet Rock 2K (Down South Mix)', 0.24092429876327515],
 ['Tech_N9ne', 'Strange', 0.24040183424949646],
 ['Tech_N9ne', 'Planet Rock 2K (Original Version)', 0.2370171993970871],
 ['Tech_N9ne', 'Strangeulation I', 0.2366243600845337],
 ['Warren_G', 'Gangsta Love', 0.23494234681129456],
 ['Tech_N9ne', "Now It's On", 0.2246299684047699],
 ['Tech_N9ne', 'P.R. 2K1', 0.22402459383010864],
 ['Nelly', 'L.A.', 0.2188049554824829],
 ['Lil_Wayne', 'Banned From TV', 0.21523958444595337],
 ['Method_Man', "Release Yo' Delf", 0.21435599029064178]]
Also not surprising is that when our target word is 'eminem', Eminem and Eminem's band D12 dominate the results.

Also not surprising is that when our target word is 'eminem', Eminem and Eminem's band D12 dominate the results.

In [18]:
print_titles(
    model.docvecs.most_similar([model['eminem']], topn=10)
)
Out[18]:
[['Eminem', 'Ken Kaniff (Skit)', 0.26885756850242615],
 ['Eminem', 'Ken Kaniff (Skit)', 0.2669536769390106],
 ['D12', 'Commercial Break', 0.25570592284202576],
 ['Eminem', 'The Kiss (Skit)', 0.2381049543619156],
 ['D12', 'Steve Berman (Skit)', 0.2308967411518097],
 ['D12', 'Words Are Weapons', 0.2302926629781723],
 ['D12', 'American Psycho II', 0.2270514965057373],
 ['Eminem', "Drop the Bomb On 'Em", 0.2205711007118225],
 ['Eminem', 'My Name Is', 0.21902979910373688],
 ['Fat_Joe', 'My Fofo', 0.21522411704063416]]
The next one is probably the most fascinating result. When our target word is "church", we get results that clearly have an element of "church" in them. Just look at the first two results, The Game's Hallelujah and Ice Cube's When I Get to Heaven. 

The next one is probably the most fascinating result. When our target word is "church", we get results that clearly have an element of "church" in them. Just look at the first two results, The Game's Hallelujah and Ice Cube's When I Get to Heaven.

In [19]:
print_titles(
    model.docvecs.most_similar([model['church']], topn=10)
)
Out[19]:
[['The_Game', 'Hallelujah', 0.2645237445831299],
 ['Ice_Cube', 'When I Get to Heaven', 0.2541915774345398],
 ['Yelawolf', 'The Last Song', 0.21845132112503052],
 ['Scarface', 'Crack', 0.21555303037166595],
 ['Lauryn_Hill', 'Interlude 5', 0.21299612522125244],
 ['KRS-One', "Ain't Ready", 0.21055129170417786],
 ['Missy_Elliott', 'Intro', 0.20643793046474457],
 ['Talib_Kweli', "Give 'Em Hell", 0.19774499535560608],
 ['Tech_N9ne', 'Sad Circus', 0.1967805027961731],
 ['Tech_N9ne', 'Show Me a God', 0.19402025640010834]]
We can also find songs that are semantically similar to each other by looking up a word vector using the document tag.

We can also find songs that are semantically similar to each other by looking up a word vector using the document tag.

In [20]:
print_titles(
    model.docvecs.most_similar([model.docvecs['Eminem|3006']], topn=10)
)
Out[20]:
[['Eminem', 'The Way I Am', 0.9999999403953552],
 ['Eminem', 'The Way I Am (Danny Lohner Remix)', 0.9738667011260986],
 ['Eminem', 'The Way I Am', 0.9721589088439941],
 ['Nas', 'Album Intro', 0.5593523383140564],
 ['Immortal_Technique', 'Understand Why', 0.4280475974082947],
 ['Nate_Dogg', "I Don't Wanna Hurt No More", 0.41760876774787903],
 ['LL_Cool_J', 'Skit', 0.41418007016181946],
 ['Big_L', 'Platinum Plus', 0.4120897054672241],
 ['Big_L', 'Platinum Plus', 0.4068526029586792],
 ['Gang_Starr', 'My Advice 2 You', 0.40361344814300537]]
Many of these are duplicates due to the lyrics site that powers Cypher being community generated, but you get the idea. We can also detect which documents do not belong in a list of documents by using the `doesnt_match` method. Here, we choose which song doesn't match among Eminem's The Way I Am, The Game's Hallelujah and Ice Cube's When I Get to Heaven. The result seems sensible.

Many of these are duplicates due to the lyrics site that powers Cypher being community generated, but you get the idea. We can also detect which documents do not belong in a list of documents by using the doesnt_match method. Here, we choose which song doesn't match among Eminem's The Way I Am, The Game's Hallelujah and Ice Cube's When I Get to Heaven. The result seems sensible.

In [21]:
model.docvecs.doesnt_match(['Eminem|3006', 'The_Game|10060', 'Ice_Cube|644'])
Out[21]:
'Eminem|3006'
## Inferring Vectors

Inferring Vectors

Lastly, we'll use our test data to see which songs are the most semantically similar to each other. First, let's load our test data then choose a song as input into the `infer_vector` method. We'll choose Eminem's Just the Two of Us, which is `song_id` 1644.

Lastly, we'll use our test data to see which songs are the most semantically similar to each other. First, let's load our test data then choose a song as input into the infer_vector method. We'll choose Eminem's Just the Two of Us, which is song_id 1644.

In [22]:
filename = 'lyrics_test.csv'
test_sentences = Sentences(filename=filename, column='word')
df = pd.read_csv(filename)
lyrics_str = df[df.song_id==1644].word.values[0]
Next, we'll feed the lyrics into `infer_vector` to return a vector representation of the song. We'll then input that vector representation into `model.docvecs.most_similar` to return back the 10 most similar songs. You can look all the songs up using the ID.

Next, we'll feed the lyrics into infer_vector to return a vector representation of the song. We'll then input that vector representation into model.docvecs.most_similar to return back the 10 most similar songs. You can look all the songs up using the ID.

In [23]:
word_sample = sentence.words
ivec = model.infer_vector(
    doc_words=lyrics_str, 
    steps=500, 
    alpha=0.5
)
print_titles(
    model.docvecs.most_similar([ivec], topn=10) 
)
Out[23]:
[['Gang_Starr', 'Daily Operation (Intro)', 0.6264939308166504],
 ['Gang_Starr', 'My Advice 2 You', 0.6025712490081787],
 ['De_La_Soul', 'The Dawn Brings Smoke', 0.6022455096244812],
 ['De_La_Soul', 'Stickabush', 0.574888288974762],
 ['Fabolous', 'Niggas Know', 0.5698577761650085],
 ['Too_$hort', "Can't Stay Away (Outro)", 0.5679160356521606],
 ['Immortal_Technique', 'Apocrypha (Interlude)', 0.5677438378334045],
 ['Twista', 'Wide Open', 0.5660445690155029],
 ['Big_Daddy_Kane', 'Looks Like A Job For...', 0.5610144734382629],
 ['Method_Man', 'Dooney Boy (Skit)', 0.5602116584777832]]
Pretty cool!
As you can see, Doc2Vec provides a lot of insight. But we didn't even get to the good stuff: using this data to train machine learning models. Doc2Vec produces `numpy` feature vectors which allow us to use them as training data for machine learning algorithms. In the next post, we'll do just this. I'll train a model that predicts an artist given a song's lyrics. To do this, we'll employ two machine learning classification algorithms, Naive Bayes and Support Vector Machines. See you next time.

Pretty cool!

As you can see, Doc2Vec provides a lot of insight. But we didn't even get to the good stuff: using this data to train machine learning models. Doc2Vec produces numpy feature vectors which allow us to use them as training data for machine learning algorithms. In the next post, we'll do just this. I'll train a model that predicts an artist given a song's lyrics. To do this, we'll employ two machine learning classification algorithms, Naive Bayes and Support Vector Machines. See you next time.

## Up Next

Up Next

• Lyric Attribution using Naive Bayes and Support Vector Machines <br/>
• Predicting A Song's Genre Given Its Lyrics <br/>
• Topic Modeling with Latent Dirichlet Allocation

• Lyric Attribution using Naive Bayes and Support Vector Machines
• Predicting A Song's Genre Given Its Lyrics
• Topic Modeling with Latent Dirichlet Allocation