Adku engineers devote 20% of their time working on projects of their own choosing. These projects are usually a little bit more removed from our daily work and are a great outlet for chasing down some of our crazier ideas. Earlier, Leah wrote about Mildred, a “20% time” project for visualizing your friends’ Facebook likes. Visualizing data is a great way to intuitively spot patterns and draw some quick conclusions, but the real fun comes in automating that process with machine learning. Today, I’m going to explain how Latent Dirichlet Allocation works and how we used it to draw conclusions about people’s music tastes from their Facebook likes.
Latent Dirichlet Allocation (LDA) was invented in 2003 primarily for automatically inferring the topics in a text document (the original paper also mentions applications for collaborative filtering). For example, LDA could be used to determine that a bioinformatics scientific paper is about evolutionary biology and computers, or perhaps that a particular court case is concerned with torts and constitutional law. It works by applying statistical analysis on the words in the document. For the bioinformatics paper, LDA would recognize that the document contained words related to evolutionary biology like “genome” and “selection” as well as computer words like “big O” and “algorithm.”
The basic idea behind LDA is that every document is generated by a simple probabilistic model. Suppose we wanted to randomly generate a document using this probabilistic model. To begin, we select the topic(s) of the document by drawing a green die out of a bag. There are many dice in the bag and each side of a die represents a topic. If we decide that there are 6 topics in all, then each green die will have 6 sides. Drawing a die that is heavily weighted towards the “computer” and “evolutionary biology” topics means that the document will be about computers and evolutionary biology. There are also 6 red dice--one die for each topic. Each side of a red die represents a word. The red die for the computer topic, for example, will be heavily weighted toward words like “RAM” and “processor.”
Here’s how each word in the document is generated. First, the green die is rolled to determine the topic of the word. The red die corresponding to that topic is then rolled to determine the word itself.
An example roll could have the green die giving us the topic “computers.” We then find the “computers” red die and roll it, giving us the word “RAM.” “RAM” is appended to the document. Note that the order of the words generated is not taken into account at all; the document is just a “bag of words.”
All the variables involved in generating documents--the green and red dice, the assortment of dice in the bag--are so-called latent variables. The dice represent multinomials; the bag of dice represents a Dirichlet. In the computer example, we assumed that the latent variables were known to us (i.e. we had dice and bags of dice to draw from). In real life, however, the latent variables are hidden from us. The only information we have are the words that are in each document. The goal of LDA is to figure out the values of the latent variables by working backwards (using Bayesian inference) from a collection of documents, and allocate each word in each document to a particular topic.
Now that we know a little bit about how LDA works, applying it to Facebook likes is straightforward. Each person represents a document, and that person's likes is analogous to the words in a document. The Mildred visualization indicates that people tend to have more music likes, so we'll just look at music likes for now. This makes it easier to get an intuitive grasp for how good the LDA results are. If the topics recovered by LDA end up matching recognizable music genres, then it's probably doing something right.
The five most common likes for each topic as computed by LDA are shown in the tables below. The number of topics was arbitrarily set at 20. Some of the topics are labelled with a very heavy hand :-).
|1. Indie music I (n=869)||2. Contemporary Pop (n=709)||3. Oldies Pop (n=740)||4. Rap and Hip Hop (n=736)||5. 90’s (n=643)|
|Radiohead||Rihanna||Pink Floyd||Jay-Z||Red Hot Chili Peppers|
|Sufjan Stevens||Lady Gaga||The Beatles||Kanye West||Incubus|
|Bon Iver||Beyoncé||Radiohead||Lil Wayne||Nirvana|
|Belle & Sebastian||Kanye West||Queen||Kid Cudi||Sublime|
|Wilco||Katy Perry||Metallica||Eminem||Pearl Jam|
|6. Pop (n=576)||7. Electronic/Trance/House (n=442)||8. 00’s (n=759)||9. 90’s/00’s I (n=400)||10. (n=375)|
|Lady Gaga||Tiësto||Death Cab for Cutie||U2||Lady Gaga|
|Michael Jackson||Lady Gaga||The Killers||Coldplay||Jay Chou|
|Beyoncé||Daft Punk||Muse||Weezer||THE FU|
|Katy Perry||Armin van Buuren||Weezer||Oasis||Eminem|
|11. Emotional Male Singers I (n=404)||12. Classics (n=700)||13. (n=520)||14. Rock (n=376)||15. (n=276)|
|John Mayer||The Beatles||Bob Marley||Red Hot Chili Peppers||Lady Gaga|
|Dave Matthews Band||Bob Dylan||Jack Johnson||AC/DC||Carousel|
|U2||Pink Floyd||Johnny Cash||Aerosmith||The Beatles|
|Jason Mraz||Queen||Ben Harper||Queen||Dream Theater|
|Jack Johnson||The Doors||O.A.R.||Van Halen||Justin Diamond|
|16. Emotional Male Signers II (n=620)||17. 90’s/00’s II (n=872)||18. Indie Music II (n=864)||19. (n=497)||20. Classical (n=459)|
|Ben Folds||Linkin Park||Radiohead||Regina Spektor||The Beatles|
|Coldplay||Coldplay||Arcade Fire||The Beatles||Beethoven|
|Jack Johnson||Green Day||Daft Punk||Frank Sinatra||Mozart|
|Dave Matthews Band||Nickelback||Beck||Coldplay||Chopin|
|John Mayer||The Fray||MGMT||Michael Bublé||Bach|
Table 1. The twenty topics recovered by running LDA on Facebook music likes. Only the five most common likes for each topic are shown. The number of likes classified within each topic is shown in parentheses. 1517 people with a collective total of 11837 music likes were analyzed.
Just for fun, let’s take a look at how LDA classifies my music tastes. I have 15 music likes:
Vampire Weekend, Arctic Monkeys, Death Cab for Cutie, Eric Clapton, Jack Johnson, The Killers, Oasis, Pink Floyd, Postal Service, Radiohead, Third Eye Blind, Weezer, Wilco, Muse and The Beatles. The majority of my likes (9/15) are classified under topic 8: 00’s music. Three of my likes are classified as topic 3, Oldies Pop. I also have two likes in topic 16 and one like in topic 19.
We can also look at Carlos’ music preferences. He likes The Smashing Pumpkins, The Killers, Coldplay, Pearl Jam, The Verve, Weezer, Quartus, Michael Jackson and Bush. Carlos’ likes are evenly split between topics 3, 8 and 1: Oldies Pop, 00’s and Indie Music.
You could imagine several ways of building a simple recommendation engine from these results. A simple one is to use the generative process, just like how a we generated a “computer” document. The music recommendations for Carlos would come from a document that is generated from the Oldies Pop, 00’s and Indie Music topics. These recommendations are explanable. We can tell Carlos that Metallica was recommended because he likes Oldies Pop.
A fully fleshed out recommendation engine sounds like another 20% project, though. Until then, try not to snicker at my music tastes too much!