---

xxxxxxxxxx
In the [last post](https://tmthyjames.github.io/posts/Analyzing-Rap-Lyrics-Using-Word-Vectors/) we analyzed rap lyrics using word vectors. We mostly did some first-pass analysis and very little prediction. In this post, we'll actually focus on predictions and visualizing our results. I'll use Python's machine-learning library, a href="http://scikit-learn.org/stable/"scikit-learn/a, to build a a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier"naive Bayes classifier/a to predict a song's genre given its lyrics. To get the data, we'll use [Cypher](https://github.com/tmthyjames/cypher), a new Python package [I recently released](https://tmthyjames.github.io/tools/Cypher/) that retrieves music lyrics. To visualize the results, I'll use [D3](https://d3js.org/) and [D3Plus](https://d3plus.org/), which is a nice wrapper for D3.

In the last post we analyzed rap lyrics using word vectors. We mostly did some first-pass analysis and very little prediction. In this post, we'll actually focus on predictions and visualizing our results. I'll use Python's machine-learning library, scikit-learn, to build a naive Bayes classifier to predict a song's genre given its lyrics. To get the data, we'll use Cypher, a new Python package I recently released that retrieves music lyrics. To visualize the results, I'll use D3 and D3Plus, which is a nice wrapper for D3.

xxxxxxxxxx
## Contents​• [Quick Note on Naive Bayes](#Quick-Note-on-Naive-Bayes)br/• [Getting the Data](#Getting-the-Data)br/• [Loading the Data](#Loading-the-Data)br/• [Splitting the Data](#Splitting-the-Data)br/• [Training the Model](#Training-the-Model)br/• [Top Hip Hop Songs](#Top-Hip-Hop-Songs)br/• [Hip Hop Songs that have Alt Rock and Country Lyrics](#Hip-Hop-Songs-that-have-Alt-Rock-and-Country-Lyrics)br/• [Visualizing Our results](#Visualizing-Our-Results) (With d3.js)br/• [Up Next](#Up-Next)br/

Contents¶

• Quick Note on Naive Bayes
• Getting the Data
• Loading the Data
• Splitting the Data
• Training the Model
• Top Hip Hop Songs
• Hip Hop Songs that have Alt Rock and Country Lyrics
• Visualizing Our results (With d3.js)
• Up Next

## Quick Note on Naive Bayes

Quick Note on Naive Bayes¶

The naive Bayes classifier is based on [Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) and known for its simplicity, accuracy, and speed, particularly when it comes to text classification, which is what our aim is for this post. In short, as Wikipedia puts it, Bayes' Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if a musical genre is related to lyrics, then, with Bayes' Theorem, we can more accuarately assess the probability that a certain song belongs to a particular genre, compared to the assessment of the probability of a genre made without knowledge of a song's lyrics. For more on Bayes' Theorem, check this [post](https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/) out. 

The naive Bayes classifier is based on Bayes' Theorem and known for its simplicity, accuracy, and speed, particularly when it comes to text classification, which is what our aim is for this post. In short, as Wikipedia puts it, Bayes' Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if a musical genre is related to lyrics, then, with Bayes' Theorem, we can more accuarately assess the probability that a certain song belongs to a particular genre, compared to the assessment of the probability of a genre made without knowledge of a song's lyrics. For more on Bayes' Theorem, check this post out.

## Getting the Data

Getting the Data¶

xxxxxxxxxx
The data was retrieved with [Cypher](https://github.com/tmthyjames/cypher). The data and code used for this post is available on the Cypher's [GitHub page](https://github.com/tmthyjames/cypher/tree/master/notebooks). Since the data takes so long to retrieve (there are over 900 hundred artists), I plan on adding a feature to Cypher that allows the user to load already-retrieved data if it exists, other wise it will retrieve the data like normal. For now, you can just download it from the [GitHub page](https://github.com/tmthyjames/cypher/tree/master/notebooks).​I started this post with the intention of trying to classify 10 genres: pop, blues, heavy metal, classic rock, indie folk, RnB, punk rock, screamo, country, and rap.​I ran into a few problems with this as classic rock lyrically was very similar to country; indie folk was also similar to country; punk rock, heavy metal, and screamo were all similar; and RnB and rap were very similar. It's not surprising; as the number of classes grows, it becomes harder to correctly classify. I may write a post on my trouble with this approach if there is interest in it, or just post the results of trying to predict all 10 genres.​Anyways, to get the data, I used [Ranker](https://ranker.com) to get a list of the top 100 artists of each genre. They have a nice API endpoint you can hit to get all the artists so you don't have to web scrape. 

The data was retrieved with Cypher. The data and code used for this post is available on the Cypher's GitHub page. Since the data takes so long to retrieve (there are over 900 hundred artists), I plan on adding a feature to Cypher that allows the user to load already-retrieved data if it exists, other wise it will retrieve the data like normal. For now, you can just download it from the GitHub page.

I started this post with the intention of trying to classify 10 genres: pop, blues, heavy metal, classic rock, indie folk, RnB, punk rock, screamo, country, and rap.

I ran into a few problems with this as classic rock lyrically was very similar to country; indie folk was also similar to country; punk rock, heavy metal, and screamo were all similar; and RnB and rap were very similar. It's not surprising; as the number of classes grows, it becomes harder to correctly classify. I may write a post on my trouble with this approach if there is interest in it, or just post the results of trying to predict all 10 genres.

Anyways, to get the data, I used Ranker to get a list of the top 100 artists of each genre. They have a nice API endpoint you can hit to get all the artists so you don't have to web scrape.

## Loading the Data

Loading the Data¶

To load the data, we'll use [pandas'](https://pandas.pydata.org/) `read_csv` method. We'll also clean up the genres due to the problems mentioned above about lyrical similarity. The three genres we'll try to predict are country, rap, and alt rock since those genres are clearly different. For our purposes, we'll classify metal, punk, and screamo as "alt rock". Here's how we do it:

To load the data, we'll use pandas' read_csv method. We'll also clean up the genres due to the problems mentioned above about lyrical similarity. The three genres we'll try to predict are country, rap, and alt rock since those genres are clearly different. For our purposes, we'll classify metal, punk, and screamo as "alt rock". Here's how we do it:

In [38]:

import pandas as pdimport numpy as np​df = pd.read_csv('lyrics.csv')​df['ranker_genre'] = np.where(    (df['ranker_genre'] == 'screamo')|    (df['ranker_genre'] == 'punk rock')|    (df['ranker_genre'] == 'heavy metal'),     'alt rock',     df['ranker_genre'])

The data is available as one lyric per row. To train our classifier, we'll need to transform it into one *song* per row. We'll also go ahead and convert the data to lowercase with `.apply(lambda x: x.lower())`. To do that, we do the following:

The data is available as one lyric per row. To train our classifier, we'll need to transform it into one song per row. We'll also go ahead and convert the data to lowercase with .apply(lambda x: x.lower()). To do that, we do the following:

In [39]:

group = ['song', 'year', 'album', 'genre', 'artist', 'ranker_genre']lyrics_by_song = df.sort_values(group)\        .groupby(group).lyric\        .apply(' '.join)\        .apply(lambda x: x.lower())\        .reset_index(name='lyric')​lyrics_by_song["lyric"] = lyrics_by_song['lyric'].str.replace(r'[^\w\s]','')

## Splitting the Data

Splitting the Data¶

xxxxxxxxxx
Next we'll split our data into a training set and a testing set using only Country, Alt Rock, and Hip Hop. A quick note: because the lyrics are community-sourced some of the songs have incomplete or incorrect lyrics. A lot of the songs with less than 400 characters are just strings of nonsense characters. Therefore, I filtered those songs out as they didn't contribute any value or insight to the model.

Next we'll split our data into a training set and a testing set using only Country, Alt Rock, and Hip Hop. A quick note: because the lyrics are community-sourced some of the songs have incomplete or incorrect lyrics. A lot of the songs with less than 400 characters are just strings of nonsense characters. Therefore, I filtered those songs out as they didn't contribute any value or insight to the model.

In [40]:

from sklearn.utils import shufflefrom nltk.corpus import stopwords​genres = [    'Country', 'alt rock', 'Hip Hop',]​LYRIC_LEN = 400 # each song has to be  400 charactersN = 10000 # number of records to pull from each genreRANDOM_SEED = 200 # random seed to make results repeatable​train_df = pd.DataFrame()test_df = pd.DataFrame()for genre in genres: # loop over each genre    subset = lyrics_by_song[ # create a subset         (lyrics_by_song.ranker_genre==genre) &         (lyrics_by_song.lyric.str.len()  LYRIC_LEN)    ]    train_set = subset.sample(n=N, random_state=RANDOM_SEED)    test_set = subset.drop(train_set.index)    train_df = train_df.append(train_set) # append subsets to the master sets    test_df = test_df.append(test_set)    train_df = shuffle(train_df)test_df = shuffle(test_df)

## Training the Model

Training the Model¶

Next, we'll train a model using word frequencies and `sklearn`'s `CountVectorizer`. The `CountVectorizer` is a quick and dirty way to train a language model by using simple word counts. Later we'll try a more sophisticated approach with the `TfidfVectorizer`.

Next, we'll train a model using word frequencies and sklearn's CountVectorizer. The CountVectorizer is a quick and dirty way to train a language model by using simple word counts. Later we'll try a more sophisticated approach with the TfidfVectorizer.

In [31]:

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.pipeline import Pipeline​# define our modeltext_clf = Pipeline(    [('vect', CountVectorizer()),     ('clf', MultinomialNB(alpha=0.1))])​# train our model on training datatext_clf.fit(train_df.lyric, train_df.ranker_genre)  ​# score our model on testing datapredicted = text_clf.predict(test_df.lyric)np.mean(predicted == test_df.ranker_genre)

Out[31]:

0.87216733086518738

xxxxxxxxxx
Not a bad first-pass model!​Word frequencies work fine here, but let's see if we can get a better model by using the `TfidfVectorizer`.​`tf-idf` stands for "term frequency-inverse document frequency". `tf` summarizes how often a given word appears within a document, while `idf` scales down words that appear frequently across documents. For example, if we were trying to figure out which rap artists were lyrically similar, the term `police` may not be very helpful as almost every rapper uses this term. But the term `detroit` may carry more weight as only a hand full of rappers use it. Thus, although `police` would have a higher `tf` score, `detroit` would have a higher `tf-idf` score and would be a more important feature in a language model.​So let's train a model using `tf-idf` scores as features.

Not a bad first-pass model!

Word frequencies work fine here, but let's see if we can get a better model by using the TfidfVectorizer.

tf-idf stands for "term frequency-inverse document frequency". tf summarizes how often a given word appears within a document, while idf scales down words that appear frequently across documents. For example, if we were trying to figure out which rap artists were lyrically similar, the term police may not be very helpful as almost every rapper uses this term. But the term detroit may carry more weight as only a hand full of rappers use it. Thus, although police would have a higher tf score, detroit would have a higher tf-idf score and would be a more important feature in a language model.

So let's train a model using tf-idf scores as features.

In [7]:

from sklearn.naive_bayes import MultinomialNBfrom sklearn.pipeline import Pipelinefrom sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer​# define our modeltext_clf = Pipeline(    [('vect', TfidfVectorizer()),     ('clf', MultinomialNB(alpha=0.1))])​# train our model on training datatext_clf.fit(train_df.lyric, train_df.ranker_genre)  ​# score our model on testing datapredicted = text_clf.predict(test_df.lyric)np.mean(predicted == test_df.ranker_genre)

Out[7]:

0.8641969486169736

Hmmm. Our model seems to have gotten worse. Let's try tuning a few hyperparameters, lemmatizing our data, customizing our tokenizer a bit, and filtering our words with `nltk`'s builtin stopword list.

Hmmm. Our model seems to have gotten worse. Let's try tuning a few hyperparameters, lemmatizing our data, customizing our tokenizer a bit, and filtering our words with nltk's builtin stopword list.

In [99]:

from sklearn.naive_bayes import MultinomialNBfrom sklearn.pipeline import Pipelinefrom nltk import word_tokenizefrom nltk.stem import WordNetLemmatizer​stop = list(set(stopwords.words('english'))) # stopwordswnl = WordNetLemmatizer() # lemmatizer​def tokenizer(x): # custom tokenizer    return (        wnl.lemmatize(w)         for w in word_tokenize(x)         if len(w)  2 and w.isalnum() # only words that are  2 characters    )                                 # and is alpha-numeric​# define our modeltext_clf = Pipeline(    [('vect', TfidfVectorizer(        ngram_range=(1, 2), # include bigrams        tokenizer=tokenizer,        stop_words=stop,        max_df=0.4, # ignore terms that appear in more than 40% of documents        min_df=4)), # ignore terms that appear in less than 4 documents     ('tfidf', TfidfTransformer()),     ('clf', MultinomialNB(alpha=0.1))])​# train our model on training datatext_clf.fit(train_df.lyric, train_df.ranker_genre)  ​# score our model on testing datapredicted = text_clf.predict(test_df.lyric)np.mean(predicted == test_df.ranker_genre)        if len(w)  2 and w.isalnum() # only words that are  2 characters

Out[99]:

0.88133131322658995

x
Hey! 1% better. I'll take it. We could keep tuning these hyperparameters to squeeze out more accuracy. For example, a more fine-tuned stopword list could help a lot; there are a [few strategies](https://stackoverflow.com/questions/16927494/how-to-select-stop-words-using-tf-idf-non-english-corpus) for constructing a good stopword list. For now, we'll go with our current model.​Now let's go beyond raw accuracy and see how it performs by looking at our confusion matrix for this model. 

Hey! 1% better. I'll take it. We could keep tuning these hyperparameters to squeeze out more accuracy. For example, a more fine-tuned stopword list could help a lot; there are a few strategies for constructing a good stopword list. For now, we'll go with our current model.

Now let's go beyond raw accuracy and see how it performs by looking at our confusion matrix for this model.

In [36]:

mat = confusion_matrix(test_df.ranker_genre, predicted)sns.heatmap(    mat.T, square=True, annot=True, fmt='d', cbar=False,    xticklabels=genres,     yticklabels=genres)plt.xlabel('true label')plt.ylabel('predicted label');

xxxxxxxxxx
Given this confusion matrix, we can calculate precision, recall, and f-score, which can be better metrics for evaluating a classifier than raw accuracy.​bRecall/b is the ability of the classifier to find all the positive results. That is, to clasify a rap song *as* a rap song. ​bPrecision/b is the ability of the classifier to not label a negative result as a positive one. That is, to not classify a country song as a rap song.​bF-score/b is the [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean) of precision and recall.​To compute recall, precision, and f-score, we'll use `precision_recall_fscore_support` from `sklearn.metrics`.

Given this confusion matrix, we can calculate precision, recall, and f-score, which can be better metrics for evaluating a classifier than raw accuracy.

Recall is the ability of the classifier to find all the positive results. That is, to clasify a rap song as a rap song.

Precision is the ability of the classifier to not label a negative result as a positive one. That is, to not classify a country song as a rap song.

F-score is the harmonic mean of precision and recall.

To compute recall, precision, and f-score, we'll use precision_recall_fscore_support from sklearn.metrics.

In [35]:

from sklearn.metrics import precision_recall_fscore_support​precision, recall, fscore, support = precision_recall_fscore_support(test_df.ranker_genre, predicted)​for n,genre in enumerate(genres):    genre = genre.upper()    print(genre+'_precision: {}'.format(precision[n]))    print(genre+'_recall: {}'.format(recall[n]))    print(genre+'_fscore: {}'.format(fscore[n]))    print(genre+'_support: {}'.format(support[n]))    print()

COUNTRY_precision: 0.9034659567125178
COUNTRY_recall: 0.9000933248194901
COUNTRY_fscore: 0.9017764873775898
COUNTRY_support: 20359

ALT ROCK_precision: 0.5072796934865901
ALT ROCK_recall: 0.9403409090909091
ALT ROCK_fscore: 0.6590343454454953
ALT ROCK_support: 1408

HIP HOP_precision: 0.9093471353899765
HIP HOP_recall: 0.8597383720930233
HIP HOP_fscore: 0.883847188324949
HIP HOP_support: 20640

bSupport/b is the number of each class in the actual true set. And the first thing I notice is that there aren't many alt rock songs being scored. Adding more alt rock songs could possibly improve our model. ​We do a good job all around on classifying hip hop and country songs. For alt rock songs, the recall score is great; that is, when it's actually an alt rock song, the model classifies it as an alt rock song 94% of the time. But, as we can see from our alt rock precision score and confusion matrix, the model classifies many hip hop songs as alt rock (963, to be exact), which is the main reason this score is so low.​Let's throw some new data at our model and see how well it does predicting what genre these lyrics belong to. 

Support is the number of each class in the actual true set. And the first thing I notice is that there aren't many alt rock songs being scored. Adding more alt rock songs could possibly improve our model.

We do a good job all around on classifying hip hop and country songs. For alt rock songs, the recall score is great; that is, when it's actually an alt rock song, the model classifies it as an alt rock song 94% of the time. But, as we can see from our alt rock precision score and confusion matrix, the model classifies many hip hop songs as alt rock (963, to be exact), which is the main reason this score is so low.

Let's throw some new data at our model and see how well it does predicting what genre these lyrics belong to.

In [13]:

text_clf.predict(    [        "i stand for the red white and blue",        "flow so smooth they say i rap in cursive", #bars *insert fire emoji*        "take my heart and carve it out",        "there is no end to the madness",        "sitting on my front porch drinking sweet tea",        "sitting on my front porch sippin on cognac",        "dog died and my pick up truck wont start",        "im invisible and the drugs wont help",        "i hope you choke in your sleep thinking of me",        "i wonder what genre a song about data science and naive bayes and hyper parameters and maybe a little scatter plots would be"    ])

Out[13]:

array(['Country', 'Hip Hop', 'alt rock', 'alt rock', 'Country', 'Hip Hop',
       'Country', 'alt rock', 'alt rock', 'Hip Hop'], 
      dtype='U8')

This seems to classify lyrics pretty well. Not sure about that last lyric though. But, then again, maybe the classifier does as good a job as any human would do classifying those cool data science lyrics?

This seems to classify lyrics pretty well. Not sure about that last lyric though. But, then again, maybe the classifier does as good a job as any human would do classifying those cool data science lyrics?

## Top Hip Hop Songs

Top Hip Hop Songs¶

Let's retrieve the songs with the highest probability of being hip hop. I'm guessing this will be a prolific artist who's language influences the entire genre. First, though, we need to score each song then merge it in to our dataset.

Let's retrieve the songs with the highest probability of being hip hop. I'm guessing this will be a prolific artist who's language influences the entire genre. First, though, we need to score each song then merge it in to our dataset.

In [100]:

data = train_df.append(test_df) # entire datasetpredicts = text_clf.predict_proba(data.lyric) # score each song​data['Country'], data['Hip_Hop'], data['Alt_Rock'] = ['','',''] # create empty columnsfor n,row in enumerate(data.itertuples()): # merge scored data into our dataset    data.loc[row.Index, 'Country'] = predicts[n][0]    data.loc[row.Index, 'Hip_Hop'] = predicts[n][1]    data.loc[row.Index, 'Alt_Rock'] = predicts[n][2]

The top 20 most-hip hop songs are:

The top 20 most-hip hop songs are:

In [48]:

columns_of_interest = [    'artist', 'song', 'album',     'ranker_genre', 'Hip_Hop',     'Alt_Rock', 'Country']​data[columns_of_interest]\    .sort_values(['Hip_Hop', 'Alt_Rock', 'Country'], ascending=[0, 1, 1])\    .head(20)

Out[48]:

	artist	song	album	ranker_genre	Hip_Hop	Alt_Rock	Country
89883	Snoop_Dogg	Set It Off	Tha Last Meal (2000)	Hip Hop	1	1.02507e-12	2.59403e-13
42315	2Pac	Hit 'Em Up	Live (2004)	Hip Hop	1	3.22494e-12	7.39967e-15
42316	2Pac	Hit 'em Up	Greatest Hits (1998)	Hip Hop	1	3.25809e-12	8.41491e-15
34832	Too_$hort	Get In Where You Fit In	Get In Where You Fit In (1993)	Hip Hop	1	3.50784e-12	2.47664e-15
33580	Snoop_Dogg	Freestyle Conversation	Tha Doggfather (1996)	Hip Hop	1	3.01647e-12	9.0946e-13
25109	Snoop_Dogg	Doggy Dogg World	Doggystyle (1993)	Hip Hop	1	1.38902e-11	8.70554e-13
25110	Snoop_Dogg	Doggy Dogg World	Death Row's Snoop Doggy Dogg At His Best (2001)	Hip Hop	1	1.53723e-11	8.72738e-13
79165	Twista	Overdose	Adrenaline Rush (1997)	Hip Hop	1	1.76744e-11	2.68287e-14
25253	Snoop_Dogg	Don Doggy	Paid Tha Cost To Be Da Bo$$ (2002)	Hip Hop	1	1.8457e-11	1.12283e-12
91632	MC_Ren	Shot Caller	Ruthless for Life (1998)	Hip Hop	1	2.29e-11	6.95705e-13
85027	Jay-Z	Reservoir Dogs	Vol. 2... Hard Knock Life (1998)	Hip Hop	1	3.23132e-11	1.87116e-13
76063	Scarface	O.G. To Me	The Last of a Dying Breed (2000)	Hip Hop	1	3.2801e-11	2.43453e-13
2711	Krayzie_Bone	A Thugga Level	Thug On Da Line (2001)	Hip Hop	1	4.52958e-11	7.43026e-13
119078	Too_$hort	What Happened to the Groupies	Can't Stay Away (1999)	Hip Hop	1	4.58972e-11	4.83204e-12
50805	T.I.	I'm Straight	King (2006)	Hip Hop	1	4.30728e-11	7.70825e-12
19296	Tech_N9ne	Come Gangsta	Bad Season (2010)	Hip Hop	1	4.98534e-11	2.54802e-12
19295	Tech_N9ne	Come Gangsta	Everready (The Religion) (2006)	Hip Hop	1	4.98534e-11	2.54802e-12
97664	MC_Ren	Still the Same Nigga	The Villain in Black (1996)	Hip Hop	1	6.05274e-11	2.06619e-12
3266	D12	Activity As Phuctivity	The Underground EP (1997)	Hip Hop	1	6.54908e-11	6.91093e-14
32646	Snoop_Dogg	For All My Niggaz & Bitches	Doggystyle (1993)	Hip Hop	1	6.36369e-11	2.30267e-12

And the most hip hop song is Set it Off by Snoop Dogg, who also seems to be the most hip hop rapper, as he has 6 of the top 20 most hip hop songs. Also, it shouldn't be surprising that a lot of these songs are pre-2000, which is the age that hip hop really began to take shape. From this analysis, it seems a lot of the language of hip hop was being defined during those years.​Because the lyrics are community-sourced, there are some duplicate songs. In the real world, we'd want to get rid of these duplicate rows.

And the most hip hop song is Set it Off by Snoop Dogg, who also seems to be the most hip hop rapper, as he has 6 of the top 20 most hip hop songs. Also, it shouldn't be surprising that a lot of these songs are pre-2000, which is the age that hip hop really began to take shape. From this analysis, it seems a lot of the language of hip hop was being defined during those years.

Because the lyrics are community-sourced, there are some duplicate songs. In the real world, we'd want to get rid of these duplicate rows.

## Hip Hop Songs that have Alt Rock and Country Lyrics

Hip Hop Songs that have Alt Rock and Country Lyrics¶

xxxxxxxxxx
Next, let's see which hip hop songs have the most alt rock lyrics. To do this, we'll query our data for only hip hop songs and then sort by the `Alt_Rock` column. I don't have any guesses as to which songs this will be. Maybe songs by Childish Gambino? Or Tech N9ne? Let's see.

Next, let's see which hip hop songs have the most alt rock lyrics. To do this, we'll query our data for only hip hop songs and then sort by the Alt_Rock column. I don't have any guesses as to which songs this will be. Maybe songs by Childish Gambino? Or Tech N9ne? Let's see.

In [116]:

data[data.ranker_genre=='Hip Hop'][columns_of_interest]\    .sort_values(['Alt_Rock', 'Hip_Hop'], ascending=[0, 1])\    .head(20) # Top 20

Out[116]:

	artist	song	album	ranker_genre	Hip_Hop	Alt_Rock	Country
109588	Busta_Rhymes	There's Only One Year Left!!! (Intro)	E.L.E. (Extinction Level Event): The Final World Front (1998)	Hip Hop	0.000587058	0.999413	7.39381e-08
3277	Lauryn_Hill	Adam Lives In Theory	MTV Unplugged (2002)	Hip Hop	0.00145159	0.998285	0.000263027
114762	Immortal_Technique	Ultimas Palabras	The Martyr (2011)	Hip Hop	0.00282868	0.997171	4.61514e-07
46223	Lauryn_Hill	I Get Out	MTV Unplugged (2002)	Hip Hop	0.00272789	0.996981	0.000291271
33562	Lauryn_Hill	Freedom Time	MTV Unplugged (2002)	Hip Hop	0.00346428	0.996294	0.000241836
61391	Lupe_Fiasco	Letting Go	Lasers (2011)	Hip Hop	0.00417445	0.994444	0.00138115
30230	Kid_Cudi	Fade 2 Red	Speedin' Bullet 2 Heaven (2015)	Hip Hop	0.00555762	0.993203	0.00123917
115250	Lupe_Fiasco	Unforgivable Youth	Food & Liquor II: The Great American Rap Album Pt. 1 (2012)	Hip Hop	0.00624787	0.993113	0.000638979
20096	Kid_Cudi	Copernicus Landing	Satellite Flight: The Journey to Mother Moon (2014)	Hip Hop	0.00505283	0.993016	0.00193124
19897	Kid_Cudi	Confused!	Speedin' Bullet 2 Heaven (2015)	Hip Hop	0.00505283	0.993016	0.00193124
68598	Kid_Cudi	Melting	Speedin' Bullet 2 Heaven (2015)	Hip Hop	0.00579108	0.991223	0.00298592
76331	Lauryn_Hill	Oh Jerusalem	MTV Unplugged (2002)	Hip Hop	0.00301647	0.991191	0.00579247
90235	Yelawolf	Shadows	Trial by Fire (2017)	Hip Hop	0.00304338	0.990284	0.00667227
57560	Lauryn_Hill	Just Like Water	MTV Unplugged (2002)	Hip Hop	0.00161961	0.98694	0.0114401
81732	Common	Pops Belief	The Dreamer/The Believer (2011)	Hip Hop	0.0102264	0.985932	0.00384123
106671	Tech_N9ne	The Noose	Welcome to Strangeland (2011)	Hip Hop	0.0164305	0.982261	0.00130803
100753	Cypress_Hill	Take My Pain	Rise Up (2010)	Hip Hop	0.018484	0.98088	0.000635567
27546	Tech_N9ne	Drowning	Something Else (2013)	Hip Hop	0.0161966	0.978309	0.00549422
28263	Kid_Cudi	Edge of the Earth / Post Mortem Boredom	Speedin' Bullet 2 Heaven (2015)	Hip Hop	0.0082987	0.977075	0.0146258
43911	Chance_The_Rapper	How Great	Coloring Book (2016)	Hip Hop	0.0240995	0.972694	0.00320653

Wow. Didn't expect some of these results. Lauryn Hill seems to be the alt rock hip hop queen. Although Busta Rhymes has the most alt rock song, Lauryn Hill has 5 of the top 20 and, as we'll see from our visualization below, 12 of the top 100 most alt rock hip hop songs.​Now, let's see which hip hop songs have the most country lyrics. Again, no guesses. Maybe a southern rapper, like Ludacris or Yelawolf?

Wow. Didn't expect some of these results. Lauryn Hill seems to be the alt rock hip hop queen. Although Busta Rhymes has the most alt rock song, Lauryn Hill has 5 of the top 20 and, as we'll see from our visualization below, 12 of the top 100 most alt rock hip hop songs.

Now, let's see which hip hop songs have the most country lyrics. Again, no guesses. Maybe a southern rapper, like Ludacris or Yelawolf?

In [117]:

data[data.ranker_genre=='Hip Hop'][columns_of_interest]\    .sort_values(['Country'], ascending=[0])\    .head(20)

Out[117]:

	artist	song	album	ranker_genre	Hip_Hop	Alt_Rock	Country
35371	Ghostface_Killah	Ghostface X-Mas	GhostDeini The Great (2008)	Hip Hop	0.0104447	0.00121425	0.988341
51878	Queen_Latifah	If I Had You	The Dana Owens Album (2004)	Hip Hop	0.0099924	0.0174937	0.972514
25455	Queen_Latifah	Don't Cry Baby	Trav'lin' Light (2007)	Hip Hop	0.0229387	0.0271649	0.949896
107717	Queen_Latifah	The Same Love That Made Me Laugh	The Dana Owens Album (2004)	Hip Hop	0.0181416	0.0324947	0.949364
8226	Childish_Gambino	Baby Boy	"Awaken, My Love!" (2016)	Hip Hop	0.0130007	0.0379516	0.949048
101930	Yelawolf	Tennessee Love	Trunk Muzik Returns (2013)	Hip Hop	0.0537624	0.000793593	0.945444
101931	Yelawolf	Tennessee Love	Love Story (2015)	Hip Hop	0.0537624	0.000793593	0.945444
56413	Bizzy_Bone	Jesus	The Gift (2001)	Hip Hop	0.0556137	0.0303563	0.91403
31606	Drake	Find Your Love	Thank Me Later (2010)	Hip Hop	0.0194045	0.0760984	0.904497
94728	Scarface	Someday	The Fix (2002)	Hip Hop	0.0289326	0.0727989	0.898269
102174	DMX	Thank You	Grand Champ (2003)	Hip Hop	0.0420652	0.0918535	0.866081
85518	Yelawolf	Ride or Die	Trial by Fire (2017)	Hip Hop	0.119646	0.019628	0.860726
30491	DMX	Fallin'	For The Love Of Money (2010)	Hip Hop	0.0965481	0.0566869	0.846765
56470	DMX	Jesus Loves Me	Walk With Me Now And You'll Fly With Me Later (The Mixtape) (2011)	Hip Hop	0.0592676	0.0952431	0.845489
69851	Eminem	Mockingbird	Encore (2004)	Hip Hop	0.0682687	0.0878662	0.843865
69854	Eminem	Mockingbird	Curtain Call: The Hits (2005)	Hip Hop	0.0682687	0.0878662	0.843865
25638	Ol%27_Dirty_Bastard	Don't Go Breaking My Heart	A Son Unique (2005)	Hip Hop	0.0614408	0.0950812	0.843478
51547	Childish_Gambino	I. Flight of the Navigator	Because the Internet (2013)	Hip Hop	0.00828495	0.166198	0.825517
4056	Yelawolf	Alabama Gotdamn	Friday The 13th (2011)	Hip Hop	0.193608	0.00277894	0.803613
122788	Will_Smith	Willow Is a Player	Born to Reign (2002)	Hip Hop	0.105539	0.0913535	0.803108

xxxxxxxxxx
Well damn. If Lauryn Hill is the alt rock hip hop queen, then Queen Latifah is the queen of country hip hop, at least lyrically.

Well damn. If Lauryn Hill is the alt rock hip hop queen, then Queen Latifah is the queen of country hip hop, at least lyrically.

## Visualizing Our Results

Visualizing Our Results¶

xxxxxxxxxx
I've also created a dashboard that you can play around with. It visualizes what we just did with our dataframes. Namely, you can look up which songs are most likely to belong to a different genre. In the upper left quadrant, you have the top 1,000 hip hop songs that have alt rock lyrics; you can also choose which genre you'd like to analyze with the drop down options. In the upper right quandrant, there's a table of the top 100 songs based on the filter of the upper left quadrant. In the lower left quadrant, you can see the lyrics weighted by tf-idf scores to allow you to visualize which words are hip hop, alt rock, and country. Lastly, in the lower right quadrant, you have a scatter plot with the tf-idf scores for each word for each genre. This graph is another way of visualizing the lower left quadrant. ​With these graphs, you'll get more insight into why exactly the model classified a song a certain way.​bTo get started, first select a song from the upper left scatter plot./b

I've also created a dashboard that you can play around with. It visualizes what we just did with our dataframes. Namely, you can look up which songs are most likely to belong to a different genre. In the upper left quadrant, you have the top 1,000 hip hop songs that have alt rock lyrics; you can also choose which genre you'd like to analyze with the drop down options. In the upper right quandrant, there's a table of the top 100 songs based on the filter of the upper left quadrant. In the lower left quadrant, you can see the lyrics weighted by tf-idf scores to allow you to visualize which words are hip hop, alt rock, and country. Lastly, in the lower right quadrant, you have a scatter plot with the tf-idf scores for each word for each genre. This graph is another way of visualizing the lower left quadrant.

With these graphs, you'll get more insight into why exactly the model classified a song a certain way.

To get started, first select a song from the upper left scatter plot. (Dashboard is best viewed on non-mobile device.)

Try these songs to get you started:
Immortal Technique's , which is a rap song that has lots of alt rock lyrics.
Joan Jett and the Blackhearts's , an alt rock song with country lyrics.
Deftones' , an alt rock song with hip hop lyrics.

xxxxxxxxxx
---

xxxxxxxxxx

A quick note on the lower right scatter plot. For each genre-word combination, we have a tf-idf score. The genre that has the highest tf-idf score for a given word will have that genre's color (legend at the top). Additionally, the points are sized by tf (term frequency) to show how often that word is used within a certain genre. With this graph, you can get an idea of how lyrically dominant a certain genre is in a given song.

These results look pretty good, even the alt rock songs. If you choose "Alt_Rock songs that have Hip_Hop lyrics", the top song is Rage Against The Machine's "F\*ck Tha Police" which has obvious hip hop overtones. Some may even say it *is* a hip hop song. Also, among the top of that list are the songs birthed from the Jay-Z-Linkin Park collaboration. Again, arguably hip hop songs, so the classifier does well here.

Also, if you choose "Country songs that have Hip_Hop lyrics" you'll notice that the top song is Taylor Swift's Thug Story featuring T-Pain. The lyrics in the lower left box and the lower right tf-idf scatter plot will show that this song is lyrically hip hop even if musically it couldn't be further from it.

A quick note on the lower right scatter plot. For each genre-word combination, we have a tf-idf score. The genre that has the highest tf-idf score for a given word will have that genre's color (legend at the top). Additionally, the points are sized by tf (term frequency) to show how often that word is used within a certain genre. With this graph, you can get an idea of how lyrically dominant a certain genre is in a given song.

These results look pretty good, even the alt rock songs. If you choose "Alt_Rock songs that have Hip_Hop lyrics", the top song is Rage Against The Machine's which has obvious hip hop overtones. Some may even say it is a hip hop song. Also, among the top of that list are the songs birthed from the Jay-Z-Linkin Park collaboration. Again, arguably hip hop songs, so the classifier does well here.

Also, if you choose "Country songs that have Hip_Hop lyrics" you'll notice that the top song is Taylor Swift's featuring T-Pain. The lyrics in the lower left box and the lower right tf-idf scatter plot will show that this song is lyrically hip hop even if musically it couldn't be further from it.

## Up Next

Up Next¶

Next, I'd like to perform some topic modeling on musical lyrics. But I may be putting most of my effort into [Achoo](https://tmthyjames.github.io/tools/prediction/Achoo-beta-0.1/) for the foreseeable future. Either way, I'll be reporting back soon.

Next, I'd like to perform some topic modeling on musical lyrics. But I may be putting most of my effort into Achoo for the foreseeable future. Either way, I'll be reporting back soon.

Using Naive Bayes to Predict a Song’s Genre Given its Lyrics

Contents¶

Quick Note on Naive Bayes¶

Getting the Data¶

Loading the Data¶

Splitting the Data¶

Training the Model¶

Top Hip Hop Songs¶

Hip Hop Songs that have Alt Rock and Country Lyrics¶

Visualizing Our Results¶

Up Next¶

Share on

You May Also Enjoy

SQLCell 2.0: Redesigning SQLCell for JupyterLab

So you want to write a book? A conversation with Manning author John Berryman

Concept Frequency-Inverse Concept Document Frquency: Analyzing Concepts in Text

Think Twice Before You Accept That Fancy Data Science Job