Adku: My Music Tastes are 13.3% Terrible, Says Algorithm

Thursday, January 12, 2012

My Music Tastes are 13.3% Terrible, Says Algorithm

Adku engineers devote 20% of their time working on projects of their own choosing. These projects are usually a little bit more removed from our daily work and are a great outlet for chasing down some of our crazier ideas. Earlier, Leah wrote about Mildred, a “20% time” project for visualizing your friends’ Facebook likes. Visualizing data is a great way to intuitively spot patterns and draw some quick conclusions, but the real fun comes in automating that process with machine learning. Today, I’m going to explain how Latent Dirichlet Allocation works and how we used it to draw conclusions about people’s music tastes from their Facebook likes.

Latent Dirichlet Allocation (LDA) was invented in 2003 primarily for automatically inferring the topics in a text document (the original paper also mentions applications for collaborative filtering). For example, LDA could be used to determine that a bioinformatics scientific paper is about evolutionary biology and computers, or perhaps that a particular court case is concerned with torts and constitutional law. It works by applying statistical analysis on the words in the document. For the bioinformatics paper, LDA would recognize that the document contained words related to evolutionary biology like “genome” and “selection” as well as computer words like “big O” and “algorithm.”

The basic idea behind LDA is that every document is generated by a simple probabilistic model. Suppose we wanted to randomly generate a document using this probabilistic model. To begin, we select the topic(s) of the document by drawing a green die out of a bag. There are many dice in the bag and each side of a die represents a topic. If we decide that there are 6 topics in all, then each green die will have 6 sides. Drawing a die that is heavily weighted towards the “computer” and “evolutionary biology” topics means that the document will be about computers and evolutionary biology. There are also 6 red dice--one die for each topic. Each side of a red die represents a word. The red die for the computer topic, for example, will be heavily weighted toward words like “RAM” and “processor.”

Here’s how each word in the document is generated. First, the green die is rolled to determine the topic of the word. The red die corresponding to that topic is then rolled to determine the word itself.

An example roll could have the green die giving us the topic “computers.” We then find the “computers” red die and roll it, giving us the word “RAM.” “RAM” is appended to the document. Note that the order of the words generated is not taken into account at all; the document is just a “bag of words.”

All the variables involved in generating documents--the green and red dice, the assortment of dice in the bag--are so-called latent variables. The dice represent multinomials; the bag of dice represents a Dirichlet. In the computer example, we assumed that the latent variables were known to us (i.e. we had dice and bags of dice to draw from). In real life, however, the latent variables are hidden from us. The only information we have are the words that are in each document. The goal of LDA is to figure out the values of the latent variables by working backwards (using Bayesian inference) from a collection of documents, and allocate each word in each document to a particular topic.

Now that we know a little bit about how LDA works, applying it to Facebook likes is straightforward. Each person represents a document, and that person's likes is analogous to the words in a document. The Mildred visualization indicates that people tend to have more music likes, so we'll just look at music likes for now. This makes it easier to get an intuitive grasp for how good the LDA results are. If the topics recovered by LDA end up matching recognizable music genres, then it's probably doing something right.

The five most common likes for each topic as computed by LDA are shown in the tables below. The number of topics was arbitrarily set at 20. Some of the topics are labelled with a very heavy hand :-).

1. Indie music I (n=869)	2. Contemporary Pop (n=709)	3. Oldies Pop (n=740)	4. Rap and Hip Hop (n=736)	5. 90’s (n=643)
Radiohead	Rihanna	Pink Floyd	Jay-Z	Red Hot Chili Peppers
Sufjan Stevens	Lady Gaga	The Beatles	Kanye West	Incubus
Bon Iver	Beyoncé	Radiohead	Lil Wayne	Nirvana
Belle & Sebastian	Kanye West	Queen	Kid Cudi	Sublime
Wilco	Katy Perry	Metallica	Eminem	Pearl Jam

6. Pop (n=576)	7. Electronic/Trance/House (n=442)	8. 00’s (n=759)	9. 90’s/00’s I (n=400)	10. (n=375)
Lady Gaga	Tiësto	Death Cab for Cutie	U2	Lady Gaga
Taylor Swift	deadmau5	Coldplay	Muse	Coldplay
Michael Jackson	Lady Gaga	The Killers	Coldplay	Jay Chou
Beyoncé	Daft Punk	Muse	Weezer	THE FU
Katy Perry	Armin van Buuren	Weezer	Oasis	Eminem

11. Emotional Male Singers I (n=404)	12. Classics (n=700)	13. (n=520)	14. Rock (n=376)	15. (n=276)
John Mayer	The Beatles	Bob Marley	Red Hot Chili Peppers	Lady Gaga
Dave Matthews Band	Bob Dylan	Jack Johnson	AC/DC	Carousel
U2	Pink Floyd	Johnny Cash	Aerosmith	The Beatles
Jason Mraz	Queen	Ben Harper	Queen	Dream Theater
Jack Johnson	The Doors	O.A.R.	Van Halen	Justin Diamond

16. Emotional Male Signers II (n=620)	17. 90’s/00’s II (n=872)	18. Indie Music II (n=864)	19. (n=497)	20. Classical (n=459)
Ben Folds	Linkin Park	Radiohead	Regina Spektor	The Beatles
Coldplay	Coldplay	Arcade Fire	The Beatles	Beethoven
Jack Johnson	Green Day	Daft Punk	Frank Sinatra	Mozart
Dave Matthews Band	Nickelback	Beck	Coldplay	Chopin
John Mayer	The Fray	MGMT	Michael Bublé	Bach

Table 1. The twenty topics recovered by running LDA on Facebook music likes. Only the five most common likes for each topic are shown. The number of likes classified within each topic is shown in parentheses. 1517 people with a collective total of 11837 music likes were analyzed.

Just for fun, let’s take a look at how LDA classifies my music tastes. I have 15 music likes:

Vampire Weekend, Arctic Monkeys, Death Cab for Cutie, Eric Clapton, Jack Johnson, The Killers, Oasis, Pink Floyd, Postal Service, Radiohead, Third Eye Blind, Weezer, Wilco, Muse and The Beatles. The majority of my likes (9/15) are classified under topic 8: 00’s music. Three of my likes are classified as topic 3, Oldies Pop. I also have two likes in topic 16 and one like in topic 19.

We can also look at Carlos’ music preferences. He likes The Smashing Pumpkins, The Killers, Coldplay, Pearl Jam, The Verve, Weezer, Quartus, Michael Jackson and Bush. Carlos’ likes are evenly split between topics 3, 8 and 1: Oldies Pop, 00’s and Indie Music.

You could imagine several ways of building a simple recommendation engine from these results. A simple one is to use the generative process, just like how a we generated a “computer” document. The music recommendations for Carlos would come from a document that is generated from the Oldies Pop, 00’s and Indie Music topics. These recommendations are explanable. We can tell Carlos that Metallica was recommended because he likes Oldies Pop.

A fully fleshed out recommendation engine sounds like another 20% project, though. Until then, try not to snicker at my music tastes too much!