Analyzing Rap Lyrics Using Word Vectors
In this post, we'll analyze lyrics from the best rappers of all time. To do this, we'll use Python and [gensim's](https://radimrehurek.com/gensim/) implementation of the [Doc2Vec](https://arxiv.org/abs/1405.4053) algorithm. To get the data, we'll use [Cypher](https://github.com/tmthyjames/cypher), a new Python package [I recently released](https://tmthyjames.github.io/tools/Cypher/) that retrieves music lyrics.
In this post, we'll analyze lyrics from the best rappers of all time. To do this, we'll use Python and gensim's implementation of the Doc2Vec algorithm. To get the data, we'll use Cypher, a new Python package I recently released that retrieves music lyrics.
## Contents
• [Quick Note On Doc2Vec](#Quick-Note-on-Doc2Vec)<br/>
• [Getting the Data](#Getting-the-Data)<br/>
• [Loading the Data](#Loading-the-Data)<br/>
• [Initializing the Model](#Initializing-the-Model)<br/>
• [Training the Model](#Training-the-Model)<br/>
• [Finding Most Similar Words](#Finding-Most-Similar-Words)<br/>
• [Finding Most Similar Documents](#Finding-Most-Similar-Documents)<br/>
• [Inferring Vectors](#Inferring-Vectors)<br/>
• [Up Next](#Up-Next)
## Quick Note on Doc2Vec
Quick Note on Doc2Vec¶
Doc2Vec is an extension of Word2Vec, an algorithm that employs a shallow neural network to map words to a vector space called word vectors (or word embeddings). Whereas Word2Vec produces word vectors so you can run similarity queries between <i>words</i>, Doc2Vec produces document vectors so you can run similarity queries on whole sentences, paragraphs, or documents. Finding semantic similarities is based on the distributional hypothesis that states words that appear in the same contexts share the same meaning. Or, as the English linguist J. R. Firth put it, "a word is characterized by the company it keeps".
My aim for this post isn't to cover the theory or math behind Doc2Vec but to show its power. For a deeper overview of Doc2Vec, see [here](https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e).
Doc2Vec is an extension of Word2Vec, an algorithm that employs a shallow neural network to map words to a vector space called word vectors (or word embeddings). Whereas Word2Vec produces word vectors so you can run similarity queries between words, Doc2Vec produces document vectors so you can run similarity queries on whole sentences, paragraphs, or documents. Finding semantic similarities is based on the distributional hypothesis that states words that appear in the same contexts share the same meaning. Or, as the English linguist J. R. Firth put it, "a word is characterized by the company it keeps".
My aim for this post isn't to cover the theory or math behind Doc2Vec but to show its power. For a deeper overview of Doc2Vec, see here.
## Getting the Data
Getting the Data¶
To get all the lyrics for the top 100 rappers, we'll use [Cypher](https://github.com/tmthyjames/cypher), a new python library [I released](https://tmthyjames.github.io/tools/Cypher/) recently to retrieve music lyrics (to install: `pip install thecypher`). But first, we need to get a list of the top 100 rappers. For this, I just Googled "top rappers" and got a hit from [ranker.com](https://www.ranker.com/crowdranked-list/the-greatest-rappers-of-all-time). This will suffice, although I don't think this list is perfect. Luckily they source the data from an API so we don't have to screen scrape! Here's the code to get this list:
To get all the lyrics for the top 100 rappers, we'll use Cypher, a new python library I released recently to retrieve music lyrics (to install: pip install thecypher
). But first, we need to get a list of the top 100 rappers. For this, I just Googled "top rappers" and got a hit from ranker.com. This will suffice, although I don't think this list is perfect. Luckily they source the data from an API so we don't have to screen scrape! Here's the code to get this list:
requests
url = 'https://cache-api.ranker.com/lists/855723/items'\
'?limit=100&offset=0&include=votes,wikiText,rankings,'\
'openListItemContributors&propertyFetchType=ALL&liCacheKey=null'
r = requests.get(url)
data = r.json()
artists = [i['name'] for i in data['listItems']]
print(artists)
To use Cypher to retrieve these lyrics we'll loop over the list and run `thecypher.get_lyrics` on each artist. The following will `get_lyrics` and then convert it to a `DataFrame`.
To use Cypher to retrieve these lyrics we'll loop over the list and run thecypher.get_lyrics
on each artist. The following will get_lyrics
and then convert it to a DataFrame
.
thecypher
pandas as pd
lyrics = []
for artist in artists:
# our Cypher code
artist_lyrics = thecypher.get_lyrics(artist)
# append each record
[lyrics.append(i) for i in artist_lyrics]
# convert to a DataFrame
lyrics_df = pd.DataFrame(lyrics)
lyrics_df.head()
By default, the data is delivered with one lyric per row. The following code will convert it to one song per row:
By default, the data is delivered with one lyric per row. The following code will convert it to one song per row:
'song', 'year', 'album', 'genre', 'artist'] = [
lyrics_by_song = lyrics_df.sort_values( )\
.groupby( ).lyric\
.apply(' '.join)\
.reset_index(name='lyric')
lyrics_by_song.head(1)
You can find the data [here](https://github.com/tmthyjames/cypher/tree/master/data), randomly split into training and testing sets.
You can find the data here, randomly split into training and testing sets.
## Loading the Data
Loading the Data¶
Next, we need to load the data. Doc2Vec requires A LOT of memory, so we'll create an iterator so our data doesn't have to be loaded into memory simultaneously. Instead, we load one document at a time, train the model on it, then discard it and move on to the next document. We could also stream this data from a database if we wanted. Here's how you stream the data from a file:
Next, we need to load the data. Doc2Vec requires A LOT of memory, so we'll create an iterator so our data doesn't have to be loaded into memory simultaneously. Instead, we load one document at a time, train the model on it, then discard it and move on to the next document. We could also stream this data from a database if we wanted. Here's how you stream the data from a file:
csv
from nltk.stem WordNetLemmatizer
from gensim.models.doc2vec TaggedDocument
wnl = WordNetLemmatizer()
class Sentences(object):
def __init__(self, filename, column):
self.filename = filename
self.column = column
def get_tokens(text):
"""Helper function for tokenizing data"""
return [wnl.lemmatize(r.lower()) for r in text.split()]
def __iter__(self):
reader = csv.DictReader(open(self.filename, 'r' ))
for row in reader:
words = self.get_tokens(row[self.column])
tags = ['%s|%s' % (row['artist'], row['song_id'])]
yield TaggedDocument(words=words, tags=tags)
A couple things to note. First, the Doc2Vec model accepts a list of `TaggedDocument` elements which will allow us to identify a song. Second, we use `wnl.lemmatize` as apart of our tokenization so we can group together the inflected forms of a word so they can be analysed as a single word. For instance, `wnl.lemmatize` will convert 'cars' into 'car'.
To initialize our Sentence object, we do the following:
A couple things to note. First, the Doc2Vec model accepts a list of TaggedDocument
elements which will allow us to identify a song. Second, we use wnl.lemmatize
as apart of our tokenization so we can group together the inflected forms of a word so they can be analysed as a single word. For instance, wnl.lemmatize
will convert 'cars' into 'car'.
To initialize our Sentence object, we do the following:
'lyrics_train.csv' =
sentences = Sentences( = , column='word')
# for song lookups
df_train = pd.read_csv( )
## Initializing the Model
Initializing the Model¶
To initialize our `Doc2Vec` model, we'll do the following:
To initialize our Doc2Vec
model, we'll do the following:
gensim.models.doc2vec import Doc2Vec
model = Doc2Vec(
alpha=0.025,
min_alpha=0.025,
workers=15,
min_count=2,
window=10,
size=300,
iter=20,
sample=0.001,
negative=5
)
Let's go over each argument.
• `alpha` is the initial learning rate. A very intuitive explanation for learning rate can be found [here](https://www.quora.com/What-is-the-learning-rate-in-neural-networks). Essentially, the learning rate is, as stated in the link, "how quickly a network abandons old beliefs for new ones." <br/>
• `min_alpha` is exactly what it sounds like, the minimum `alpha` can be, which we reduce after every epoch. <br/>
• `workers` is the number of threads used to train the model. <br/>
• `min_count` specifies a term frequency that must be met for a word to be considered by the model. <br/>
• `window` is how many words in front and behind the input word should be considered when determining context. <br/>
• `size` is the number of dimensions. Unlike most numerical datasets that have 2 dimensions, text data can have hundreds or even more.<br/>
• `iter` is the number of iterations, the number of times the training set passes through the algorithm. <br/>
• `sample` is the downsampling rate. Words representing more than this will be eligible for downsampling.<br/>
• `negative` is the negative sampling rate. 0 means update all weights in the output layer of the neural network.
Let's go over each argument.
• alpha
is the initial learning rate. A very intuitive explanation for learning rate can be found here. Essentially, the learning rate is, as stated in the link, "how quickly a network abandons old beliefs for new ones."
• min_alpha
is exactly what it sounds like, the minimum alpha
can be, which we reduce after every epoch.
• workers
is the number of threads used to train the model.
• min_count
specifies a term frequency that must be met for a word to be considered by the model.
• window
is how many words in front and behind the input word should be considered when determining context.
• size
is the number of dimensions. Unlike most numerical datasets that have 2 dimensions, text data can have hundreds or even more.
• iter
is the number of iterations, the number of times the training set passes through the algorithm.
• sample
is the downsampling rate. Words representing more than this will be eligible for downsampling.
• negative
is the negative sampling rate. 0 means update all weights in the output layer of the neural network.
## Training the Model
Training the Model¶
Now we'll build our vocabulary and train our model. We'll train our model for 10 epochs. To understand epochs and how they differ from iterations (from above), check out [this](https://stackoverflow.com/questions/4752626/epoch-vs-iteration-when-training-neural-networks) StackOverflow post. Namely this answer:
> In the neural network terminology:
> one epoch = one forward pass and one backward pass of all the training examples
> batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
> number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
We use multiple epochs because neural networks typically require an iterative optimization method to produce good results, which usually means several passes over the data.
After each epoch, we'll decrease the learning rate (known as learning rate decay). This is to help speed up our training. For more on learning rate decay and the intuition behind it, see Andrew Ng's [video](https://www.coursera.org/learn/deep-neural-network/lecture/hjgIA/learning-rate-decay) on the subject.
Now we'll build our vocabulary and train our model. We'll train our model for 10 epochs. To understand epochs and how they differ from iterations (from above), check out this StackOverflow post. Namely this answer:
In the neural network terminology:
one epoch = one forward pass and one backward pass of all the training examples
batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
We use multiple epochs because neural networks typically require an iterative optimization method to produce good results, which usually means several passes over the data.
After each epoch, we'll decrease the learning rate (known as learning rate decay). This is to help speed up our training. For more on learning rate decay and the intuition behind it, see Andrew Ng's video on the subject.
build_vocab(sentences) .
epochs = 10
for epoch in range(epochs):
train(sentences, total_examples= .corpus_count, epochs= .iter) .
alpha -= 0.002 # decrease the learning rate .
min_alpha = .alpha # fix the learning rate, no decay .
To persist our model so we can use it later without training it again, we'll use `model.save` and load it using `Doc2Vec.load` like so:
To persist our model so we can use it later without training it again, we'll use model.save
and load it using Doc2Vec.load
like so:
save('rap-lyrics.doc2vec') .
Doc2Vec.load('rap-lyrics.doc2vec') =
Next we'll find the most similar words given a target word. Similar words, in this context, refers to words that have similar vector representations. Let's first see what one of these vector representations looks like:
Next we'll find the most similar words given a target word. Similar words, in this context, refers to words that have similar vector representations. Let's first see what one of these vector representations looks like:
wv.word_vec('rap') .
## Finding Most Similar Words
Finding Most Similar Words¶
The results produced by Doc2vec are very impressive. To showcase, we'll start with the `most_similar` method which finds the top n words most similar to the target word. We can see from the following that the results are accurate.
The results produced by Doc2vec are very impressive. To showcase, we'll start with the most_similar
method which finds the top n words most similar to the target word. We can see from the following that the results are accurate.
wv.most_similar('house') .
wv.most_similar('weed') .
I found this next result to be very interesting. There apparently is a double meaning to the word 'seed' and our model captures both meanings, an offspring and another word for weed. That's cool!
I found this next result to be very interesting. There apparently is a double meaning to the word 'seed' and our model captures both meanings, an offspring and another word for weed. That's cool!
wv.most_similar('seed') .
Even more interesting are the results we get from using the `positive` and `negative` keywords. We'll use the "seed" example. The positive words contribute positively towards the similarity score; the negative words contribute negatively. When we use "seed" as our target word and don't specify a `negative` word, we get a double meaning. But when we add "weed" as a `negative` word, the meaning becomes much more about offspring.
Even more interesting are the results we get from using the positive
and negative
keywords. We'll use the "seed" example. The positive words contribute positively towards the similarity score; the negative words contribute negatively. When we use "seed" as our target word and don't specify a negative
word, we get a double meaning. But when we add "weed" as a negative
word, the meaning becomes much more about offspring.
wv.most_similar( .
positive=[ ['seed']],
negative=[ ['weed']]
)
Here are a couple more things you can do with the word vectors. The first will find the word that doesn't match. The second will find the word most similar to the target word.
Here are a couple more things you can do with the word vectors. The first will find the word that doesn't match. The second will find the word most similar to the target word.
wv.doesnt_match(['south', 'east', 'west', 'atlanta']) .
wv.most_similar_to_given( .
'god',
['street', 'house', 'baby', 'church', 'party', 'struggle', 'loyalty']
)
## Finding Most Similar Documents
Finding Most Similar Documents¶
Let's first define a helper function so we can look up song titles given the song IDs.
Let's first define a helper function so we can look up song titles given the song IDs.
print_titles(results):
lookup = lambda x: df_train[
df_train.song_id==int(x)
].song.values[0]
return [
[
i[0].split('|')[0],
lookup(i[0].split('|')[1]),
i[1]
] for i in results
]
We can also find the top n most similar <i>songs</i> to a target word. When we pass in 'midwest' as our target word, it should be no surprise that Tech N9ne and Nelly have an appearance since both rappers are from and rap about the Midwest.
We can also find the top n most similar songs to a target word. When we pass in 'midwest' as our target word, it should be no surprise that Tech N9ne and Nelly have an appearance since both rappers are from and rap about the Midwest.
(
model.docvecs.most_similar([model['midwest']], topn=10)
)
Also not surprising is that when our target word is 'eminem', Eminem and Eminem's band D12 dominate the results.
Also not surprising is that when our target word is 'eminem', Eminem and Eminem's band D12 dominate the results.
(
model.docvecs.most_similar([model['eminem']], topn=10)
)
The next one is probably the most fascinating result. When our target word is "church", we get results that clearly have an element of "church" in them. Just look at the first two results, The Game's Hallelujah and Ice Cube's When I Get to Heaven.
The next one is probably the most fascinating result. When our target word is "church", we get results that clearly have an element of "church" in them. Just look at the first two results, The Game's Hallelujah and Ice Cube's When I Get to Heaven.
(
model.docvecs.most_similar([model['church']], topn=10)
)
We can also find songs that are semantically similar to each other by looking up a word vector using the document tag.
We can also find songs that are semantically similar to each other by looking up a word vector using the document tag.
(
model.docvecs.most_similar([model.docvecs['Eminem|3006']], topn=10)
)
Many of these are duplicates due to the lyrics site that powers Cypher being community generated, but you get the idea. We can also detect which documents do not belong in a list of documents by using the `doesnt_match` method. Here, we choose which song doesn't match among Eminem's The Way I Am, The Game's Hallelujah and Ice Cube's When I Get to Heaven. The result seems sensible.
Many of these are duplicates due to the lyrics site that powers Cypher being community generated, but you get the idea. We can also detect which documents do not belong in a list of documents by using the doesnt_match
method. Here, we choose which song doesn't match among Eminem's The Way I Am, The Game's Hallelujah and Ice Cube's When I Get to Heaven. The result seems sensible.
docvecs.doesnt_match(['Eminem|3006', 'The_Game|10060', 'Ice_Cube|644']) .
## Inferring Vectors
Inferring Vectors¶
Lastly, we'll use our test data to see which songs are the most semantically similar to each other. First, let's load our test data then choose a song as input into the `infer_vector` method. We'll choose Eminem's Just the Two of Us, which is `song_id` 1644.
Lastly, we'll use our test data to see which songs are the most semantically similar to each other. First, let's load our test data then choose a song as input into the infer_vector
method. We'll choose Eminem's Just the Two of Us, which is song_id
1644.
'lyrics_test.csv' =
test_sentences = Sentences( = , column='word')
df = pd.read_csv( )
lyrics_str = df[df.song_id==1644].word.values[0]
Next, we'll feed the lyrics into `infer_vector` to return a vector representation of the song. We'll then input that vector representation into `model.docvecs.most_similar` to return back the 10 most similar songs. You can look all the songs up using the ID.
Next, we'll feed the lyrics into infer_vector
to return a vector representation of the song. We'll then input that vector representation into model.docvecs.most_similar
to return back the 10 most similar songs. You can look all the songs up using the ID.
sentence.words =
ivec = model.infer_vector(
doc_words=lyrics_str,
steps=500,
alpha=0.5
)
print_titles(
model.docvecs.most_similar([ivec], topn=10)
)
Pretty cool!
As you can see, Doc2Vec provides a lot of insight. But we didn't even get to the good stuff: using this data to train machine learning models. Doc2Vec produces `numpy` feature vectors which allow us to use them as training data for machine learning algorithms. In the next post, we'll do just this. I'll train a model that predicts an artist given a song's lyrics. To do this, we'll employ two machine learning classification algorithms, Naive Bayes and Support Vector Machines. See you next time.
Pretty cool!
As you can see, Doc2Vec provides a lot of insight. But we didn't even get to the good stuff: using this data to train machine learning models. Doc2Vec produces numpy
feature vectors which allow us to use them as training data for machine learning algorithms. In the next post, we'll do just this. I'll train a model that predicts an artist given a song's lyrics. To do this, we'll employ two machine learning classification algorithms, Naive Bayes and Support Vector Machines. See you next time.
## Up Next
Up Next¶
• Lyric Attribution using Naive Bayes and Support Vector Machines <br/>
• Predicting A Song's Genre Given Its Lyrics <br/>
• Topic Modeling with Latent Dirichlet Allocation
• Lyric Attribution using Naive Bayes and Support Vector Machines
• Predicting A Song's Genre Given Its Lyrics
• Topic Modeling with Latent Dirichlet Allocation