In the last post we analyzed rap lyrics using word vectors. We mostly did some first-pass analysis and very little prediction. In this post, we'll actually focus on predictions and visualizing our results. I'll use Python's machine-learning library, scikit-learn, to build a naive Bayes classifier to predict a song's genre given its lyrics. To get the data, we'll use Cypher, a new Python package I recently released that retrieves music lyrics. To visualize the results, I'll use D3 and D3Plus, which is a nice wrapper for D3.
The naive Bayes classifier is based on Bayes' Theorem and known for its simplicity, accuracy, and speed, particularly when it comes to text classification, which is what our aim is for this post. In short, as Wikipedia puts it, Bayes' Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if a musical genre is related to lyrics, then, with Bayes' Theorem, we can more accuarately assess the probability that a certain song belongs to a particular genre, compared to the assessment of the probability of a genre made without knowledge of a song's lyrics. For more on Bayes' Theorem, check this post out.
The data was retrieved with Cypher. The data and code used for this post is available on the Cypher's GitHub page. Since the data takes so long to retrieve (there are over 900 hundred artists), I plan on adding a feature to Cypher that allows the user to load already-retrieved data if it exists, other wise it will retrieve the data like normal. For now, you can just download it from the GitHub page.
I started this post with the intention of trying to classify 10 genres: pop, blues, heavy metal, classic rock, indie folk, RnB, punk rock, screamo, country, and rap.
I ran into a few problems with this as classic rock lyrically was very similar to country; indie folk was also similar to country; punk rock, heavy metal, and screamo were all similar; and RnB and rap were very similar. It's not surprising; as the number of classes grows, it becomes harder to correctly classify. I may write a post on my trouble with this approach if there is interest in it, or just post the results of trying to predict all 10 genres.
Anyways, to get the data, I used Ranker to get a list of the top 100 artists of each genre. They have a nice API endpoint you can hit to get all the artists so you don't have to web scrape.
To load the data, we'll use pandas'
read_csv method. We'll also clean up the genres due to the problems mentioned above about lyrical similarity. The three genres we'll try to predict are country, rap, and alt rock since those genres are clearly different. For our purposes, we'll classify metal, punk, and screamo as "alt rock". Here's how we do it:
The data is available as one lyric per row. To train our classifier, we'll need to transform it into one song per row. We'll also go ahead and convert the data to lowercase with
.apply(lambda x: x.lower()). To do that, we do the following:
Next we'll split our data into a training set and a testing set using only Country, Alt Rock, and Hip Hop. A quick note: because the lyrics are community-sourced some of the songs have incomplete or incorrect lyrics. A lot of the songs with less than 400 characters are just strings of nonsense characters. Therefore, I filtered those songs out as they didn't contribute any value or insight to the model.
Next, we'll train a model using word frequencies and
CountVectorizer is a quick and dirty way to train a language model by using simple word counts. Later we'll try a more sophisticated approach with the
Not a bad first-pass model!
Word frequencies work fine here, but let's see if we can get a better model by using the
tf-idf stands for "term frequency-inverse document frequency".
tf summarizes how often a given word appears within a document, while
idf scales down words that appear frequently across documents. For example, if we were trying to figure out which rap artists were lyrically similar, the term
police may not be very helpful as almost every rapper uses this term. But the term
detroit may carry more weight as only a hand full of rappers use it. Thus, although
police would have a higher
detroit would have a higher
tf-idf score and would be a more important feature in a language model.
So let's train a model using
tf-idf scores as features.
Hmmm. Our model seems to have gotten worse. Let's try tuning a few hyperparameters, lemmatizing our data, customizing our tokenizer a bit, and filtering our words with
nltk's builtin stopword list.
Hey! 1% better. I'll take it. We could keep tuning these hyperparameters to squeeze out more accuracy. For example, a more fine-tuned stopword list could help a lot; there are a few strategies for constructing a good stopword list. For now, we'll go with our current model.
Now let's go beyond raw accuracy and see how it performs by looking at our confusion matrix for this model.
Given this confusion matrix, we can calculate precision, recall, and f-score, which can be better metrics for evaluating a classifier than raw accuracy.
Recall is the ability of the classifier to find all the positive results. That is, to clasify a rap song as a rap song.
Precision is the ability of the classifier to not label a negative result as a positive one. That is, to not classify a country song as a rap song.
F-score is the harmonic mean of precision and recall.
To compute recall, precision, and f-score, we'll use
COUNTRY_precision: 0.9034659567125178 COUNTRY_recall: 0.9000933248194901 COUNTRY_fscore: 0.9017764873775898 COUNTRY_support: 20359 ALT ROCK_precision: 0.5072796934865901 ALT ROCK_recall: 0.9403409090909091 ALT ROCK_fscore: 0.6590343454454953 ALT ROCK_support: 1408 HIP HOP_precision: 0.9093471353899765 HIP HOP_recall: 0.8597383720930233 HIP HOP_fscore: 0.883847188324949 HIP HOP_support: 20640
Support is the number of each class in the actual true set. And the first thing I notice is that there aren't many alt rock songs being scored. Adding more alt rock songs could possibly improve our model.
We do a good job all around on classifying hip hop and country songs. For alt rock songs, the recall score is great; that is, when it's actually an alt rock song, the model classifies it as an alt rock song 94% of the time. But, as we can see from our alt rock precision score and confusion matrix, the model classifies many hip hop songs as alt rock (963, to be exact), which is the main reason this score is so low.
Let's throw some new data at our model and see how well it does predicting what genre these lyrics belong to.
array(['Country', 'Hip Hop', 'alt rock', 'alt rock', 'Country', 'Hip Hop', 'Country', 'alt rock', 'alt rock', 'Hip Hop'], dtype='U8')
This seems to classify lyrics pretty well. Not sure about that last lyric though. But, then again, maybe the classifier does as good a job as any human would do classifying those cool data science lyrics?
Let's retrieve the songs with the highest probability of being hip hop. I'm guessing this will be a prolific artist who's language influences the entire genre. First, though, we need to score each song then merge it in to our dataset.
The top 20 most-hip hop songs are:
|89883||Snoop_Dogg||Set It Off||Tha Last Meal (2000)||Hip Hop||1||1.02507e-12||2.59403e-13|
|42315||2Pac||Hit 'Em Up||Live (2004)||Hip Hop||1||3.22494e-12||7.39967e-15|
|42316||2Pac||Hit 'em Up||Greatest Hits (1998)||Hip Hop||1||3.25809e-12||8.41491e-15|
|34832||Too_$hort||Get In Where You Fit In||Get In Where You Fit In (1993)||Hip Hop||1||3.50784e-12||2.47664e-15|
|33580||Snoop_Dogg||Freestyle Conversation||Tha Doggfather (1996)||Hip Hop||1||3.01647e-12||9.0946e-13|
|25109||Snoop_Dogg||Doggy Dogg World||Doggystyle (1993)||Hip Hop||1||1.38902e-11||8.70554e-13|
|25110||Snoop_Dogg||Doggy Dogg World||Death Row's Snoop Doggy Dogg At His Best (2001)||Hip Hop||1||1.53723e-11||8.72738e-13|
|79165||Twista||Overdose||Adrenaline Rush (1997)||Hip Hop||1||1.76744e-11||2.68287e-14|
|25253||Snoop_Dogg||Don Doggy||Paid Tha Cost To Be Da Bo$$ (2002)||Hip Hop||1||1.8457e-11||1.12283e-12|
|91632||MC_Ren||Shot Caller||Ruthless for Life (1998)||Hip Hop||1||2.29e-11||6.95705e-13|
|85027||Jay-Z||Reservoir Dogs||Vol. 2... Hard Knock Life (1998)||Hip Hop||1||3.23132e-11||1.87116e-13|
|76063||Scarface||O.G. To Me||The Last of a Dying Breed (2000)||Hip Hop||1||3.2801e-11||2.43453e-13|
|2711||Krayzie_Bone||A Thugga Level||Thug On Da Line (2001)||Hip Hop||1||4.52958e-11||7.43026e-13|
|119078||Too_$hort||What Happened to the Groupies||Can't Stay Away (1999)||Hip Hop||1||4.58972e-11||4.83204e-12|
|50805||T.I.||I'm Straight||King (2006)||Hip Hop||1||4.30728e-11||7.70825e-12|
|19296||Tech_N9ne||Come Gangsta||Bad Season (2010)||Hip Hop||1||4.98534e-11||2.54802e-12|
|19295||Tech_N9ne||Come Gangsta||Everready (The Religion) (2006)||Hip Hop||1||4.98534e-11||2.54802e-12|
|97664||MC_Ren||Still the Same Nigga||The Villain in Black (1996)||Hip Hop||1||6.05274e-11||2.06619e-12|
|3266||D12||Activity As Phuctivity||The Underground EP (1997)||Hip Hop||1||6.54908e-11||6.91093e-14|
|32646||Snoop_Dogg||For All My Niggaz & Bitches||Doggystyle (1993)||Hip Hop||1||6.36369e-11||2.30267e-12|
And the most hip hop song is Set it Off by Snoop Dogg, who also seems to be the most hip hop rapper, as he has 6 of the top 20 most hip hop songs. Also, it shouldn't be surprising that a lot of these songs are pre-2000, which is the age that hip hop really began to take shape. From this analysis, it seems a lot of the language of hip hop was being defined during those years.
Because the lyrics are community-sourced, there are some duplicate songs. In the real world, we'd want to get rid of these duplicate rows.
Next, let's see which hip hop songs have the most alt rock lyrics. To do this, we'll query our data for only hip hop songs and then sort by the
Alt_Rock column. I don't have any guesses as to which songs this will be. Maybe songs by Childish Gambino? Or Tech N9ne? Let's see.
|109588||Busta_Rhymes||There's Only One Year Left!!! (Intro)||E.L.E. (Extinction Level Event): The Final World Front (1998)||Hip Hop||0.000587058||0.999413||7.39381e-08|
|3277||Lauryn_Hill||Adam Lives In Theory||MTV Unplugged (2002)||Hip Hop||0.00145159||0.998285||0.000263027|
|114762||Immortal_Technique||Ultimas Palabras||The Martyr (2011)||Hip Hop||0.00282868||0.997171||4.61514e-07|
|46223||Lauryn_Hill||I Get Out||MTV Unplugged (2002)||Hip Hop||0.00272789||0.996981||0.000291271|
|33562||Lauryn_Hill||Freedom Time||MTV Unplugged (2002)||Hip Hop||0.00346428||0.996294||0.000241836|
|61391||Lupe_Fiasco||Letting Go||Lasers (2011)||Hip Hop||0.00417445||0.994444||0.00138115|
|30230||Kid_Cudi||Fade 2 Red||Speedin' Bullet 2 Heaven (2015)||Hip Hop||0.00555762||0.993203||0.00123917|
|115250||Lupe_Fiasco||Unforgivable Youth||Food & Liquor II: The Great American Rap Album Pt. 1 (2012)||Hip Hop||0.00624787||0.993113||0.000638979|
|20096||Kid_Cudi||Copernicus Landing||Satellite Flight: The Journey to Mother Moon (2014)||Hip Hop||0.00505283||0.993016||0.00193124|
|19897||Kid_Cudi||Confused!||Speedin' Bullet 2 Heaven (2015)||Hip Hop||0.00505283||0.993016||0.00193124|
|68598||Kid_Cudi||Melting||Speedin' Bullet 2 Heaven (2015)||Hip Hop||0.00579108||0.991223||0.00298592|
|76331||Lauryn_Hill||Oh Jerusalem||MTV Unplugged (2002)||Hip Hop||0.00301647||0.991191||0.00579247|
|90235||Yelawolf||Shadows||Trial by Fire (2017)||Hip Hop||0.00304338||0.990284||0.00667227|
|57560||Lauryn_Hill||Just Like Water||MTV Unplugged (2002)||Hip Hop||0.00161961||0.98694||0.0114401|
|81732||Common||Pops Belief||The Dreamer/The Believer (2011)||Hip Hop||0.0102264||0.985932||0.00384123|
|106671||Tech_N9ne||The Noose||Welcome to Strangeland (2011)||Hip Hop||0.0164305||0.982261||0.00130803|
|100753||Cypress_Hill||Take My Pain||Rise Up (2010)||Hip Hop||0.018484||0.98088||0.000635567|
|27546||Tech_N9ne||Drowning||Something Else (2013)||Hip Hop||0.0161966||0.978309||0.00549422|
|28263||Kid_Cudi||Edge of the Earth / Post Mortem Boredom||Speedin' Bullet 2 Heaven (2015)||Hip Hop||0.0082987||0.977075||0.0146258|
|43911||Chance_The_Rapper||How Great||Coloring Book (2016)||Hip Hop||0.0240995||0.972694||0.00320653|
Wow. Didn't expect some of these results. Lauryn Hill seems to be the alt rock hip hop queen. Although Busta Rhymes has the most alt rock song, Lauryn Hill has 5 of the top 20 and, as we'll see from our visualization below, 12 of the top 100 most alt rock hip hop songs.
Now, let's see which hip hop songs have the most country lyrics. Again, no guesses. Maybe a southern rapper, like Ludacris or Yelawolf?
|35371||Ghostface_Killah||Ghostface X-Mas||GhostDeini The Great (2008)||Hip Hop||0.0104447||0.00121425||0.988341|
|51878||Queen_Latifah||If I Had You||The Dana Owens Album (2004)||Hip Hop||0.0099924||0.0174937||0.972514|
|25455||Queen_Latifah||Don't Cry Baby||Trav'lin' Light (2007)||Hip Hop||0.0229387||0.0271649||0.949896|
|107717||Queen_Latifah||The Same Love That Made Me Laugh||The Dana Owens Album (2004)||Hip Hop||0.0181416||0.0324947||0.949364|
|8226||Childish_Gambino||Baby Boy||"Awaken, My Love!" (2016)||Hip Hop||0.0130007||0.0379516||0.949048|
|101930||Yelawolf||Tennessee Love||Trunk Muzik Returns (2013)||Hip Hop||0.0537624||0.000793593||0.945444|
|101931||Yelawolf||Tennessee Love||Love Story (2015)||Hip Hop||0.0537624||0.000793593||0.945444|
|56413||Bizzy_Bone||Jesus||The Gift (2001)||Hip Hop||0.0556137||0.0303563||0.91403|
|31606||Drake||Find Your Love||Thank Me Later (2010)||Hip Hop||0.0194045||0.0760984||0.904497|
|94728||Scarface||Someday||The Fix (2002)||Hip Hop||0.0289326||0.0727989||0.898269|
|102174||DMX||Thank You||Grand Champ (2003)||Hip Hop||0.0420652||0.0918535||0.866081|
|85518||Yelawolf||Ride or Die||Trial by Fire (2017)||Hip Hop||0.119646||0.019628||0.860726|
|30491||DMX||Fallin'||For The Love Of Money (2010)||Hip Hop||0.0965481||0.0566869||0.846765|
|56470||DMX||Jesus Loves Me||Walk With Me Now And You'll Fly With Me Later (The Mixtape) (2011)||Hip Hop||0.0592676||0.0952431||0.845489|
|69851||Eminem||Mockingbird||Encore (2004)||Hip Hop||0.0682687||0.0878662||0.843865|
|69854||Eminem||Mockingbird||Curtain Call: The Hits (2005)||Hip Hop||0.0682687||0.0878662||0.843865|
|25638||Ol%27_Dirty_Bastard||Don't Go Breaking My Heart||A Son Unique (2005)||Hip Hop||0.0614408||0.0950812||0.843478|
|51547||Childish_Gambino||I. Flight of the Navigator||Because the Internet (2013)||Hip Hop||0.00828495||0.166198||0.825517|
|4056||Yelawolf||Alabama Gotdamn||Friday The 13th (2011)||Hip Hop||0.193608||0.00277894||0.803613|
|122788||Will_Smith||Willow Is a Player||Born to Reign (2002)||Hip Hop||0.105539||0.0913535||0.803108|
Well damn. If Lauryn Hill is the alt rock hip hop queen, then Queen Latifah is the queen of country hip hop, at least lyrically.
I've also created a dashboard that you can play around with. It visualizes what we just did with our dataframes. Namely, you can look up which songs are most likely to belong to a different genre. In the upper left quadrant, you have the top 1,000 hip hop songs that have alt rock lyrics; you can also choose which genre you'd like to analyze with the drop down options. In the upper right quandrant, there's a table of the top 100 songs based on the filter of the upper left quadrant. In the lower left quadrant, you can see the lyrics weighted by tf-idf scores to allow you to visualize which words are hip hop, alt rock, and country. Lastly, in the lower right quadrant, you have a scatter plot with the tf-idf scores for each word for each genre. This graph is another way of visualizing the lower left quadrant.
With these graphs, you'll get more insight into why exactly the model classified a song a certain way.
To get started, first select a song from the upper left scatter plot. (Dashboard is best viewed on non-mobile device.)
Try these songs to get you started:
Immortal Technique's , which is a rap song that has lots of alt rock lyrics.
Joan Jett and the Blackhearts's , an alt rock song with country lyrics.
Deftones' , an alt rock song with hip hop lyrics.
Click on a song in the scatter plot to see more
Click on a song in the scatter plot to see more
A quick note on the lower right scatter plot. For each genre-word combination, we have a
tf-idf score. The genre that has the highest
tf-idf score for a given word will have that genre's color (legend at the top). Additionally, the points are sized by
tf (term frequency) to show how often that word is used within a certain genre. With this graph, you can get an idea of how lyrically dominant a certain genre is in a given song.
These results look pretty good, even the alt rock songs. If you choose "Alt_Rock songs that have Hip_Hop lyrics", the top song is Rage Against The Machine's which has obvious hip hop overtones. Some may even say it is a hip hop song. Also, among the top of that list are the songs birthed from the Jay-Z-Linkin Park collaboration. Again, arguably hip hop songs, so the classifier does well here.
Also, if you choose "Country songs that have Hip_Hop lyrics" you'll notice that the top song is Taylor Swift's featuring T-Pain. The lyrics in the lower left box and the lower right tf-idf scatter plot will show that this song is lyrically hip hop even if musically it couldn't be further from it.