Using Naive Bayes to Predict a Song’s Genre Given its Lyrics
---
xxxxxxxxxx
In the [last post](https://tmthyjames.github.io/posts/Analyzing-Rap-Lyrics-Using-Word-Vectors/) we analyzed rap lyrics using word vectors. We mostly did some first-pass analysis and very little prediction. In this post, we'll actually focus on predictions and visualizing our results. I'll use Python's machine-learning library, a href="http://scikit-learn.org/stable/"scikit-learn/a, to build a a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier"naive Bayes classifier/a to predict a song's genre given its lyrics. To get the data, we'll use [Cypher](https://github.com/tmthyjames/cypher), a new Python package [I recently released](https://tmthyjames.github.io/tools/Cypher/) that retrieves music lyrics. To visualize the results, I'll use [D3](https://d3js.org/) and [D3Plus](https://d3plus.org/), which is a nice wrapper for D3.
In the last post we analyzed rap lyrics using word vectors. We mostly did some first-pass analysis and very little prediction. In this post, we'll actually focus on predictions and visualizing our results. I'll use Python's machine-learning library, scikit-learn, to build a naive Bayes classifier to predict a song's genre given its lyrics. To get the data, we'll use Cypher, a new Python package I recently released that retrieves music lyrics. To visualize the results, I'll use D3 and D3Plus, which is a nice wrapper for D3.
xxxxxxxxxx
## Contents
• [Quick Note on Naive Bayes](#Quick-Note-on-Naive-Bayes)br/
• [Getting the Data](#Getting-the-Data)br/
• [Loading the Data](#Loading-the-Data)br/
• [Splitting the Data](#Splitting-the-Data)br/
• [Training the Model](#Training-the-Model)br/
• [Top Hip Hop Songs](#Top-Hip-Hop-Songs)br/
• [Hip Hop Songs that have Alt Rock and Country Lyrics](#Hip-Hop-Songs-that-have-Alt-Rock-and-Country-Lyrics)br/
• [Visualizing Our results](#Visualizing-Our-Results) (With d3.js)br/
• [Up Next](#Up-Next)br/
Contents¶
• Quick Note on Naive Bayes
• Getting the Data
• Loading the Data
• Splitting the Data
• Training the Model
• Top Hip Hop Songs
• Hip Hop Songs that have Alt Rock and Country Lyrics
• Visualizing Our results (With d3.js)
• Up Next
## Quick Note on Naive Bayes
Quick Note on Naive Bayes¶
The naive Bayes classifier is based on [Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) and known for its simplicity, accuracy, and speed, particularly when it comes to text classification, which is what our aim is for this post. In short, as Wikipedia puts it, Bayes' Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if a musical genre is related to lyrics, then, with Bayes' Theorem, we can more accuarately assess the probability that a certain song belongs to a particular genre, compared to the assessment of the probability of a genre made without knowledge of a song's lyrics. For more on Bayes' Theorem, check this [post](https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/) out.
The naive Bayes classifier is based on Bayes' Theorem and known for its simplicity, accuracy, and speed, particularly when it comes to text classification, which is what our aim is for this post. In short, as Wikipedia puts it, Bayes' Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if a musical genre is related to lyrics, then, with Bayes' Theorem, we can more accuarately assess the probability that a certain song belongs to a particular genre, compared to the assessment of the probability of a genre made without knowledge of a song's lyrics. For more on Bayes' Theorem, check this post out.
## Getting the Data
Getting the Data¶
xxxxxxxxxx
The data was retrieved with [Cypher](https://github.com/tmthyjames/cypher). The data and code used for this post is available on the Cypher's [GitHub page](https://github.com/tmthyjames/cypher/tree/master/notebooks). Since the data takes so long to retrieve (there are over 900 hundred artists), I plan on adding a feature to Cypher that allows the user to load already-retrieved data if it exists, other wise it will retrieve the data like normal. For now, you can just download it from the [GitHub page](https://github.com/tmthyjames/cypher/tree/master/notebooks).
I started this post with the intention of trying to classify 10 genres: pop, blues, heavy metal, classic rock, indie folk, RnB, punk rock, screamo, country, and rap.
I ran into a few problems with this as classic rock lyrically was very similar to country; indie folk was also similar to country; punk rock, heavy metal, and screamo were all similar; and RnB and rap were very similar. It's not surprising; as the number of classes grows, it becomes harder to correctly classify. I may write a post on my trouble with this approach if there is interest in it, or just post the results of trying to predict all 10 genres.
Anyways, to get the data, I used [Ranker](https://ranker.com) to get a list of the top 100 artists of each genre. They have a nice API endpoint you can hit to get all the artists so you don't have to web scrape.
The data was retrieved with Cypher. The data and code used for this post is available on the Cypher's GitHub page. Since the data takes so long to retrieve (there are over 900 hundred artists), I plan on adding a feature to Cypher that allows the user to load already-retrieved data if it exists, other wise it will retrieve the data like normal. For now, you can just download it from the GitHub page.
I started this post with the intention of trying to classify 10 genres: pop, blues, heavy metal, classic rock, indie folk, RnB, punk rock, screamo, country, and rap.
I ran into a few problems with this as classic rock lyrically was very similar to country; indie folk was also similar to country; punk rock, heavy metal, and screamo were all similar; and RnB and rap were very similar. It's not surprising; as the number of classes grows, it becomes harder to correctly classify. I may write a post on my trouble with this approach if there is interest in it, or just post the results of trying to predict all 10 genres.
Anyways, to get the data, I used Ranker to get a list of the top 100 artists of each genre. They have a nice API endpoint you can hit to get all the artists so you don't have to web scrape.
## Loading the Data
Loading the Data¶
To load the data, we'll use [pandas'](https://pandas.pydata.org/) `read_csv` method. We'll also clean up the genres due to the problems mentioned above about lyrical similarity. The three genres we'll try to predict are country, rap, and alt rock since those genres are clearly different. For our purposes, we'll classify metal, punk, and screamo as "alt rock". Here's how we do it:
To load the data, we'll use pandas' read_csv
method. We'll also clean up the genres due to the problems mentioned above about lyrical similarity. The three genres we'll try to predict are country, rap, and alt rock since those genres are clearly different. For our purposes, we'll classify metal, punk, and screamo as "alt rock". Here's how we do it:
pandas as pd
numpy as np
df = pd.read_csv('lyrics.csv')
df['ranker_genre'] = np.where(
(df['ranker_genre'] == 'screamo')|
(df['ranker_genre'] == 'punk rock')|
(df['ranker_genre'] == 'heavy metal'),
'alt rock',
df['ranker_genre']
)
The data is available as one lyric per row. To train our classifier, we'll need to transform it into one *song* per row. We'll also go ahead and convert the data to lowercase with `.apply(lambda x: x.lower())`. To do that, we do the following:
The data is available as one lyric per row. To train our classifier, we'll need to transform it into one song per row. We'll also go ahead and convert the data to lowercase with .apply(lambda x: x.lower())
. To do that, we do the following:
group = ['song', 'year', 'album', 'genre', 'artist', 'ranker_genre']
lyrics_by_song = df.sort_values(group)\
.groupby(group).lyric\
.apply(' '.join)\
.apply(lambda x: x.lower())\
.reset_index(name='lyric')
lyrics_by_song["lyric"] = lyrics_by_song['lyric'].str.replace(r'[^\w\s]','')
## Splitting the Data
Splitting the Data¶
xxxxxxxxxx
Next we'll split our data into a training set and a testing set using only Country, Alt Rock, and Hip Hop. A quick note: because the lyrics are community-sourced some of the songs have incomplete or incorrect lyrics. A lot of the songs with less than 400 characters are just strings of nonsense characters. Therefore, I filtered those songs out as they didn't contribute any value or insight to the model.
Next we'll split our data into a training set and a testing set using only Country, Alt Rock, and Hip Hop. A quick note: because the lyrics are community-sourced some of the songs have incomplete or incorrect lyrics. A lot of the songs with less than 400 characters are just strings of nonsense characters. Therefore, I filtered those songs out as they didn't contribute any value or insight to the model.
from sklearn.utils import shuffle
from nltk.corpus import stopwords
genres = [
'Country', 'alt rock', 'Hip Hop',
]
LYRIC_LEN = 400 # each song has to be 400 characters
N = 10000 # number of records to pull from each genre
RANDOM_SEED = 200 # random seed to make results repeatable
train_df = pd.DataFrame()
test_df = pd.DataFrame()
for genre in genres: # loop over each genre
subset = lyrics_by_song[ # create a subset
(lyrics_by_song.ranker_genre==genre) &
(lyrics_by_song.lyric.str.len() LYRIC_LEN)
]
train_set = subset.sample(n=N, random_state=RANDOM_SEED)
test_set = subset.drop(train_set.index)
train_df = train_df.append(train_set) # append subsets to the master sets
test_df = test_df.append(test_set)
train_df = shuffle(train_df)
test_df = shuffle(test_df)
## Training the Model
Training the Model¶
Next, we'll train a model using word frequencies and `sklearn`'s `CountVectorizer`. The `CountVectorizer` is a quick and dirty way to train a language model by using simple word counts. Later we'll try a more sophisticated approach with the `TfidfVectorizer`.
Next, we'll train a model using word frequencies and sklearn
's CountVectorizer
. The CountVectorizer
is a quick and dirty way to train a language model by using simple word counts. Later we'll try a more sophisticated approach with the TfidfVectorizer
.
sklearn.feature_extraction.text import CountVectorizer
sklearn.naive_bayes import MultinomialNB
sklearn.pipeline import Pipeline
# define our model
text_clf = Pipeline(
[('vect', CountVectorizer()),
('clf', MultinomialNB(alpha=0.1))])
# train our model on training data
text_clf.fit(train_df.lyric, train_df.ranker_genre)
# score our model on testing data
predicted = text_clf.predict(test_df.lyric)
np.mean(predicted == test_df.ranker_genre)
xxxxxxxxxx
Not a bad first-pass model!
Word frequencies work fine here, but let's see if we can get a better model by using the `TfidfVectorizer`.
`tf-idf` stands for "term frequency-inverse document frequency". `tf` summarizes how often a given word appears within a document, while `idf` scales down words that appear frequently across documents. For example, if we were trying to figure out which rap artists were lyrically similar, the term `police` may not be very helpful as almost every rapper uses this term. But the term `detroit` may carry more weight as only a hand full of rappers use it. Thus, although `police` would have a higher `tf` score, `detroit` would have a higher `tf-idf` score and would be a more important feature in a language model.
So let's train a model using `tf-idf` scores as features.
Not a bad first-pass model!
Word frequencies work fine here, but let's see if we can get a better model by using the TfidfVectorizer
.
tf-idf
stands for "term frequency-inverse document frequency". tf
summarizes how often a given word appears within a document, while idf
scales down words that appear frequently across documents. For example, if we were trying to figure out which rap artists were lyrically similar, the term police
may not be very helpful as almost every rapper uses this term. But the term detroit
may carry more weight as only a hand full of rappers use it. Thus, although police
would have a higher tf
score, detroit
would have a higher tf-idf
score and would be a more important feature in a language model.
So let's train a model using tf-idf
scores as features.
sklearn.naive_bayes import MultinomialNB
sklearn.pipeline import Pipeline
sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
# define our model
text_clf = Pipeline(
[('vect', TfidfVectorizer()),
('clf', MultinomialNB(alpha=0.1))])
# train our model on training data
text_clf.fit(train_df.lyric, train_df.ranker_genre)
# score our model on testing data
predicted = text_clf.predict(test_df.lyric)
np.mean(predicted == test_df.ranker_genre)
Hmmm. Our model seems to have gotten worse. Let's try tuning a few hyperparameters, lemmatizing our data, customizing our tokenizer a bit, and filtering our words with `nltk`'s builtin stopword list.
Hmmm. Our model seems to have gotten worse. Let's try tuning a few hyperparameters, lemmatizing our data, customizing our tokenizer a bit, and filtering our words with nltk
's builtin stopword list.
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
stop = list(set( .words('english'))) #
wnl = WordNetLemmatizer() # lemmatizer
def tokenizer(x): # custom tokenizer
return (
wnl.lemmatize(w)
for w in word_tokenize(x)
if len(w) 2 and w.isalnum() # only words that are 2 characters
) # and is alpha-numeric
# define our model
text_clf = Pipeline(
[('vect', TfidfVectorizer(
ngram_range=(1, 2), # include bigrams
tokenizer=tokenizer,
stop_words=stop,
max_df=0.4, # ignore terms that appear in more than 40% of documents
min_df=4)), # ignore terms that appear in less than 4 documents
('tfidf', TfidfTransformer()),
('clf', MultinomialNB(alpha=0.1))])
# train our model on training data
text_clf.fit(train_df.lyric, train_df.ranker_genre)
# score our model on testing data
predicted = text_clf.predict(test_df.lyric)
np.mean(predicted == test_df.ranker_genre)
x
Hey! 1% better. I'll take it. We could keep tuning these hyperparameters to squeeze out more accuracy. For example, a more fine-tuned stopword list could help a lot; there are a [few strategies](https://stackoverflow.com/questions/16927494/how-to-select-stop-words-using-tf-idf-non-english-corpus) for constructing a good stopword list. For now, we'll go with our current model.
Now let's go beyond raw accuracy and see how it performs by looking at our confusion matrix for this model.
Hey! 1% better. I'll take it. We could keep tuning these hyperparameters to squeeze out more accuracy. For example, a more fine-tuned stopword list could help a lot; there are a few strategies for constructing a good stopword list. For now, we'll go with our current model.
Now let's go beyond raw accuracy and see how it performs by looking at our confusion matrix for this model.
confusion_matrix(test_df.ranker_genre, predicted) =
sns.heatmap(
T, square=True, annot=True, fmt='d', cbar=False, .
xticklabels=genres,
yticklabels=genres
)
plt.xlabel('true label')
plt.ylabel('predicted label');
xxxxxxxxxx
Given this confusion matrix, we can calculate precision, recall, and f-score, which can be better metrics for evaluating a classifier than raw accuracy.
bRecall/b is the ability of the classifier to find all the positive results. That is, to clasify a rap song *as* a rap song.
bPrecision/b is the ability of the classifier to not label a negative result as a positive one. That is, to not classify a country song as a rap song.
bF-score/b is the [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean) of precision and recall.
To compute recall, precision, and f-score, we'll use `precision_recall_fscore_support` from `sklearn.metrics`.
Given this confusion matrix, we can calculate precision, recall, and f-score, which can be better metrics for evaluating a classifier than raw accuracy.
Recall is the ability of the classifier to find all the positive results. That is, to clasify a rap song as a rap song.
Precision is the ability of the classifier to not label a negative result as a positive one. That is, to not classify a country song as a rap song.
F-score is the harmonic mean of precision and recall.
To compute recall, precision, and f-score, we'll use precision_recall_fscore_support
from sklearn.metrics
.
sklearn.metrics import precision_recall_fscore_support
precision, recall, fscore, support = precision_recall_fscore_support(test_df.ranker_genre, predicted)
for n,genre in enumerate(genres):
genre = genre.upper()
print(genre+'_precision: {}'.format(precision[n]))
print(genre+'_recall: {}'.format(recall[n]))
print(genre+'_fscore: {}'.format(fscore[n]))
print(genre+'_support: {}'.format(support[n]))
print()
bSupport/b is the number of each class in the actual true set. And the first thing I notice is that there aren't many alt rock songs being scored. Adding more alt rock songs could possibly improve our model.
We do a good job all around on classifying hip hop and country songs. For alt rock songs, the recall score is great; that is, when it's actually an alt rock song, the model classifies it as an alt rock song 94% of the time. But, as we can see from our alt rock precision score and confusion matrix, the model classifies many hip hop songs as alt rock (963, to be exact), which is the main reason this score is so low.
Let's throw some new data at our model and see how well it does predicting what genre these lyrics belong to.
Support is the number of each class in the actual true set. And the first thing I notice is that there aren't many alt rock songs being scored. Adding more alt rock songs could possibly improve our model.
We do a good job all around on classifying hip hop and country songs. For alt rock songs, the recall score is great; that is, when it's actually an alt rock song, the model classifies it as an alt rock song 94% of the time. But, as we can see from our alt rock precision score and confusion matrix, the model classifies many hip hop songs as alt rock (963, to be exact), which is the main reason this score is so low.
Let's throw some new data at our model and see how well it does predicting what genre these lyrics belong to.
text_clf.predict(
[
"i stand for the red white and blue",
"flow so smooth they say i rap in cursive", #bars *insert fire emoji*
"take my heart and carve it out",
"there is no end to the madness",
"sitting on my front porch drinking sweet tea",
"sitting on my front porch sippin on cognac",
"dog died and my pick up truck wont start",
"im invisible and the drugs wont help",
"i hope you choke in your sleep thinking of me",
"i wonder what genre a song about data science and naive bayes and hyper parameters and maybe a little scatter plots would be"
]
)
This seems to classify lyrics pretty well. Not sure about that last lyric though. But, then again, maybe the classifier does as good a job as any human would do classifying those cool data science lyrics?
This seems to classify lyrics pretty well. Not sure about that last lyric though. But, then again, maybe the classifier does as good a job as any human would do classifying those cool data science lyrics?
## Top Hip Hop Songs
Top Hip Hop Songs¶
Let's retrieve the songs with the highest probability of being hip hop. I'm guessing this will be a prolific artist who's language influences the entire genre. First, though, we need to score each song then merge it in to our dataset.
Let's retrieve the songs with the highest probability of being hip hop. I'm guessing this will be a prolific artist who's language influences the entire genre. First, though, we need to score each song then merge it in to our dataset.
train_df.append(test_df) # entire dataset =
predicts = text_clf.predict_proba( .lyric) # score each song
'Country'], ['Hip_Hop'], ['Alt_Rock'] = ['','',''] # create empty columns [
for n,row in enumerate( .itertuples()): # merge scored into our dataset
loc[row.Index, 'Country'] = predicts[n][0] .
loc[row.Index, 'Hip_Hop'] = predicts[n][1] .
loc[row.Index, 'Alt_Rock'] = predicts[n][2] .
The top 20 most-hip hop songs are:
The top 20 most-hip hop songs are:
= [
'artist', 'song', 'album',
'ranker_genre', 'Hip_Hop',
'Alt_Rock', 'Country'
]
data[ ]\
.sort_values(['Hip_Hop', 'Alt_Rock', 'Country'], ascending=[0, 1, 1])\
.head(20)
And the most hip hop song is Set it Off by Snoop Dogg, who also seems to be the most hip hop rapper, as he has 6 of the top 20 most hip hop songs. Also, it shouldn't be surprising that a lot of these songs are pre-2000, which is the age that hip hop really began to take shape. From this analysis, it seems a lot of the language of hip hop was being defined during those years.
Because the lyrics are community-sourced, there are some duplicate songs. In the real world, we'd want to get rid of these duplicate rows.
And the most hip hop song is Set it Off by Snoop Dogg, who also seems to be the most hip hop rapper, as he has 6 of the top 20 most hip hop songs. Also, it shouldn't be surprising that a lot of these songs are pre-2000, which is the age that hip hop really began to take shape. From this analysis, it seems a lot of the language of hip hop was being defined during those years.
Because the lyrics are community-sourced, there are some duplicate songs. In the real world, we'd want to get rid of these duplicate rows.
## Hip Hop Songs that have Alt Rock and Country Lyrics
Hip Hop Songs that have Alt Rock and Country Lyrics¶
xxxxxxxxxx
Next, let's see which hip hop songs have the most alt rock lyrics. To do this, we'll query our data for only hip hop songs and then sort by the `Alt_Rock` column. I don't have any guesses as to which songs this will be. Maybe songs by Childish Gambino? Or Tech N9ne? Let's see.
Next, let's see which hip hop songs have the most alt rock lyrics. To do this, we'll query our data for only hip hop songs and then sort by the Alt_Rock
column. I don't have any guesses as to which songs this will be. Maybe songs by Childish Gambino? Or Tech N9ne? Let's see.
ranker_genre=='Hip Hop'][columns_of_interest]\ [ .
.sort_values(['Alt_Rock', 'Hip_Hop'], ascending=[0, 1])\
.head(20) # Top 20
Wow. Didn't expect some of these results. Lauryn Hill seems to be the alt rock hip hop queen. Although Busta Rhymes has the most alt rock song, Lauryn Hill has 5 of the top 20 and, as we'll see from our visualization below, 12 of the top 100 most alt rock hip hop songs.
Now, let's see which hip hop songs have the most country lyrics. Again, no guesses. Maybe a southern rapper, like Ludacris or Yelawolf?
Wow. Didn't expect some of these results. Lauryn Hill seems to be the alt rock hip hop queen. Although Busta Rhymes has the most alt rock song, Lauryn Hill has 5 of the top 20 and, as we'll see from our visualization below, 12 of the top 100 most alt rock hip hop songs.
Now, let's see which hip hop songs have the most country lyrics. Again, no guesses. Maybe a southern rapper, like Ludacris or Yelawolf?
ranker_genre=='Hip Hop'][columns_of_interest]\ [ .
.sort_values(['Country'], ascending=[0])\
.head(20)
xxxxxxxxxx
Well damn. If Lauryn Hill is the alt rock hip hop queen, then Queen Latifah is the queen of country hip hop, at least lyrically.
Well damn. If Lauryn Hill is the alt rock hip hop queen, then Queen Latifah is the queen of country hip hop, at least lyrically.
## Visualizing Our Results
Visualizing Our Results¶
xxxxxxxxxx
I've also created a dashboard that you can play around with. It visualizes what we just did with our dataframes. Namely, you can look up which songs are most likely to belong to a different genre. In the upper left quadrant, you have the top 1,000 hip hop songs that have alt rock lyrics; you can also choose which genre you'd like to analyze with the drop down options. In the upper right quandrant, there's a table of the top 100 songs based on the filter of the upper left quadrant. In the lower left quadrant, you can see the lyrics weighted by tf-idf scores to allow you to visualize which words are hip hop, alt rock, and country. Lastly, in the lower right quadrant, you have a scatter plot with the tf-idf scores for each word for each genre. This graph is another way of visualizing the lower left quadrant.
With these graphs, you'll get more insight into why exactly the model classified a song a certain way.
bTo get started, first select a song from the upper left scatter plot./b
I've also created a dashboard that you can play around with. It visualizes what we just did with our dataframes. Namely, you can look up which songs are most likely to belong to a different genre. In the upper left quadrant, you have the top 1,000 hip hop songs that have alt rock lyrics; you can also choose which genre you'd like to analyze with the drop down options. In the upper right quandrant, there's a table of the top 100 songs based on the filter of the upper left quadrant. In the lower left quadrant, you can see the lyrics weighted by tf-idf scores to allow you to visualize which words are hip hop, alt rock, and country. Lastly, in the lower right quadrant, you have a scatter plot with the tf-idf scores for each word for each genre. This graph is another way of visualizing the lower left quadrant.
With these graphs, you'll get more insight into why exactly the model classified a song a certain way.
To get started, first select a song from the upper left scatter plot. (Dashboard is best viewed on non-mobile device.)
Try these songs to get you started:
Immortal Technique's , which is a rap song that has lots of alt rock lyrics.
Joan Jett and the Blackhearts's , an alt rock song with country lyrics.
Deftones' , an alt rock song with hip hop lyrics.
Genre | Artist | Song | Hip_Hop | Alt_Rock | Country |
---|
Click on a song in the scatter plot to see more
Click on a song in the scatter plot to see more
xxxxxxxxxx
---
xxxxxxxxxx
A quick note on the lower right scatter plot. For each genre-word combination, we have a
tf-idf
score. The genre that has the highesttf-idf
score for a given word will have that genre's color (legend at the top). Additionally, the points are sized bytf
(term frequency) to show how often that word is used within a certain genre. With this graph, you can get an idea of how lyrically dominant a certain genre is in a given song.
These results look pretty good, even the alt rock songs. If you choose "Alt_Rock songs that have Hip_Hop lyrics", the top song is Rage Against The Machine's "F\*ck Tha Police" which has obvious hip hop overtones. Some may even say it *is* a hip hop song. Also, among the top of that list are the songs birthed from the Jay-Z-Linkin Park collaboration. Again, arguably hip hop songs, so the classifier does well here.
Also, if you choose "Country songs that have Hip_Hop lyrics" you'll notice that the top song is Taylor Swift's Thug Story featuring T-Pain. The lyrics in the lower left box and the lower right tf-idf scatter plot will show that this song is lyrically hip hop even if musically it couldn't be further from it.
A quick note on the lower right scatter plot. For each genre-word combination, we have a tf-idf
score. The genre that has the highest tf-idf
score for a given word will have that genre's color (legend at the top). Additionally, the points are sized by tf
(term frequency) to show how often that word is used within a certain genre. With this graph, you can get an idea of how lyrically dominant a certain genre is in a given song.
These results look pretty good, even the alt rock songs. If you choose "Alt_Rock songs that have Hip_Hop lyrics", the top song is Rage Against The Machine's which has obvious hip hop overtones. Some may even say it is a hip hop song. Also, among the top of that list are the songs birthed from the Jay-Z-Linkin Park collaboration. Again, arguably hip hop songs, so the classifier does well here.
Also, if you choose "Country songs that have Hip_Hop lyrics" you'll notice that the top song is Taylor Swift's featuring T-Pain. The lyrics in the lower left box and the lower right tf-idf scatter plot will show that this song is lyrically hip hop even if musically it couldn't be further from it.
## Up Next
Up Next¶
[Achoo](https://tmthyjames.github.io/tools/prediction/Achoo-beta-0.1/) for the foreseeable future. Either way, I'll be reporting back soon. , I'd like to perform some topic modeling on musical lyrics. But I may be putting most of my effort into
Next, I'd like to perform some topic modeling on musical lyrics. But I may be putting most of my effort into Achoo for the foreseeable future. Either way, I'll be reporting back soon.