Thursday, January 12, 2012

My Music Tastes are 13.3% Terrible, Says Algorithm

Adku engineers devote 20% of their time working on projects of their own choosing. These projects are usually a little bit more removed from our daily work and are a great outlet for chasing down some of our crazier ideas. Earlier, Leah wrote about Mildred, a “20% time” project for visualizing your friends’ Facebook likes. Visualizing data is a great way to intuitively spot patterns and draw some quick conclusions, but the real fun comes in automating that process with machine learning. Today, I’m going to explain how Latent Dirichlet Allocation works and how we used it to draw conclusions about people’s music tastes from their Facebook likes. 

Latent Dirichlet Allocation (LDA) was invented in 2003 primarily for automatically inferring the topics in a text document (the original paper also mentions applications for collaborative filtering). For example, LDA could be used to determine that a bioinformatics scientific paper is about evolutionary biology and computers, or perhaps that a particular court case is concerned with torts and constitutional law. It works by applying statistical analysis on the words in the document. For the bioinformatics paper, LDA would recognize that the document contained words related to evolutionary biology like “genome” and “selection” as well as computer words like “big O” and “algorithm.” 

The basic idea behind LDA is that every document is generated by a simple probabilistic model. Suppose we wanted to randomly generate a document using this probabilistic model. To begin, we select the topic(s) of the document by drawing a green die out of a bag. There are many dice in the bag and each side of a die represents a topic. If we decide that there are 6 topics in all, then each green die will have 6 sides. Drawing a die that is heavily weighted towards the “computer” and “evolutionary biology” topics means that the document will be about computers and evolutionary biology. There are also 6 red dice--one die for each topic. Each side of a red die represents a word. The red die for the computer topic, for example, will be heavily weighted toward words like “RAM” and “processor.”

Here’s how each word in the document is generated. First, the green die is rolled to determine the topic of the word. The red die corresponding to that topic is then rolled to determine the word itself.

An example roll could have the green die giving us the topic “computers.” We then find the “computers” red die and roll it, giving us the word “RAM.” “RAM” is appended to the document. Note that the order of the words generated is not taken into account at all; the document is just a “bag of words.”

All the variables involved in generating documents--the green and red dice, the assortment of dice in the bag--are so-called latent variables. The dice represent multinomials; the bag of dice represents a Dirichlet. In the computer example, we assumed that the latent variables were known to us (i.e. we had dice and bags of dice to draw from). In real life, however, the latent variables are hidden from us. The only information we have are the words that are in each document. The goal of LDA is to figure out the values of the latent variables by working backwards (using Bayesian inference) from a collection of documents, and allocate each word in each document to a particular topic. 

Now that we know a little bit about how LDA works, applying it to Facebook likes is straightforward. Each person represents a document, and that person's likes is analogous to the words in a document. The Mildred visualization indicates that people tend to have more music likes, so we'll just look at music likes for now. This makes it easier to get an intuitive grasp for how good the LDA results are. If the topics recovered by LDA end up matching recognizable music genres, then it's probably doing something right.

The five most common likes for each topic as computed by LDA are shown in the tables below. The number of topics was arbitrarily set at 20. Some of the topics are labelled with a very heavy hand :-).

1. Indie music I (n=869)2. Contemporary Pop (n=709)3. Oldies Pop (n=740)4. Rap and Hip Hop (n=736)5. 90’s (n=643)
RadioheadRihannaPink FloydJay-ZRed Hot Chili Peppers
Sufjan StevensLady GagaThe BeatlesKanye WestIncubus
Bon IverBeyoncéRadioheadLil WayneNirvana
Belle & SebastianKanye WestQueenKid CudiSublime
WilcoKaty PerryMetallicaEminemPearl Jam

6. Pop (n=576)7. Electronic/Trance/House (n=442)8. 00’s (n=759)9. 90’s/00’s I (n=400)10. (n=375)
Lady GagaTiëstoDeath Cab for CutieU2Lady Gaga
Taylor Swiftdeadmau5ColdplayMuseColdplay
Michael JacksonLady GagaThe KillersColdplayJay Chou
BeyoncéDaft PunkMuseWeezerTHE FU
Katy PerryArmin van BuurenWeezerOasisEminem

11. Emotional Male Singers I (n=404)12. Classics (n=700)13. (n=520)14. Rock (n=376)15. (n=276)
John MayerThe BeatlesBob MarleyRed Hot Chili PeppersLady Gaga
Dave Matthews BandBob DylanJack JohnsonAC/DCCarousel
U2Pink FloydJohnny CashAerosmithThe Beatles
Jason MrazQueenBen HarperQueenDream Theater
Jack JohnsonThe DoorsO.A.R.Van HalenJustin Diamond

16. Emotional Male Signers II (n=620)17. 90’s/00’s II (n=872)18. Indie Music II (n=864)19. (n=497)20. Classical (n=459)
Ben FoldsLinkin ParkRadioheadRegina SpektorThe Beatles
ColdplayColdplayArcade FireThe BeatlesBeethoven
Jack JohnsonGreen DayDaft PunkFrank SinatraMozart
Dave Matthews BandNickelbackBeckColdplayChopin
John MayerThe FrayMGMTMichael BubléBach

Table 1. The twenty topics recovered by running LDA on Facebook music likes. Only the five most common likes for each topic are shown. The number of likes classified within each topic is shown in parentheses. 1517 people with a collective total of 11837 music likes were analyzed.

Just for fun, let’s take a look at how LDA classifies my music tastes. I have 15 music likes:

Vampire Weekend, Arctic Monkeys, Death Cab for Cutie, Eric Clapton, Jack Johnson, The Killers, Oasis, Pink Floyd, Postal Service, Radiohead, Third Eye Blind, Weezer, Wilco, Muse and The Beatles. The majority of my likes (9/15) are classified under topic 8: 00’s music. Three of my likes are classified as topic 3, Oldies Pop. I also have two likes in topic 16 and one like in topic 19.

We can also look at Carlos’ music preferences. He likes The Smashing Pumpkins, The Killers, Coldplay, Pearl Jam, The Verve, Weezer, Quartus, Michael Jackson and Bush. Carlos’ likes are evenly split between topics 3, 8 and 1: Oldies Pop, 00’s and Indie Music.

You could imagine several ways of building a simple recommendation engine from these results. A simple one is to use the generative process, just like how a we generated a “computer” document. The music recommendations for Carlos would come from a document that is generated from the Oldies Pop, 00’s and Indie Music topics. These recommendations are explanable. We can tell Carlos that Metallica was recommended because he likes Oldies Pop.

A fully fleshed out recommendation engine sounds like another 20% project, though. Until then, try not to snicker at my music tastes too much!


  1. Singer now a days also do rap and I don't appreciate it a lot. I want the old music way back around 90's it was perfect for me.

    rita ora music

  2. We are always interested in the "neglect" of us. But we are "abandoning" who cares about us ...
    Friv | Unblocked | ABCya | Yepi

  3. Superb. I really enjoyed very much with this article here. Really it is an amazing article I had ever read. I hope it will help a lot for all. Thank you so much for this amazing posts and please keep update like this excellent article.thank you for sharing such a great blog with us. expecting for your.

    seo company in india

  4. academic-probation

  5. Finding the time and actual effort to create a superb article like this is great thing. I’ll learn many new stuff right here! Good luck for the next post buddy..
    Java Training in Chennai

  6. يمكنكم الان الحصول على رقم توكيل كاريير لجميع انواع الاجهزة الكهربائية صيانة معتمدة وعلى اعلى مستوى

  7. يمكنكم الان الحصول على رقم توكيل كاريير لجميع انواع الاجهزة الكهربائية صيانة معتمدة وعلى اعلى مستوى

  8. يمكنكم الان من خلال شركة ابادة حشرات بالمنزل التخلص من كافة انواع الحشرات بوسائل امنة وطرق فعالة غير مضرة باسعار مميزة ومتاحة للجميع
    للتواصل معنا

  9. It is just in crises that payday credits prove to be useful. Payday credits likewise safeguard you out of circumstances of ricocheted checks and late installment punishments by influencing the fitting money to progress accessible. Payday Loans Chicago

  10. This was an nice and amazing and the given contents were very useful and the precision has given here is good.
    AWS Training in Chennai

  11. ان الوصول بالموقع الالكترونى الى النتائج الاولى فى محرك البحث افضل شركات تسويق الكترونى فى الرياض بواسطة شركات تسويق مواقع فى السعودية افضل شركة تسويق الكترونى والرياض او غيرها تعتبر افضل شركات التسويق الالكترونى فى السعودية من دول العالم اعطى ميزه كبيره للخدمات شركة موشن جرافيك والمنتجات بأن تتخطى وتخرج خارج الحدود الاقليمية للدولة وتساعد على سرعة الانتشار افضل شركة تصميم مواقع في الامارات بسهولة وتكاليف قليله ,وزيادة عدد الزائرين لموقعك مم يزيد شركات تسويق الكتروني في الامارات من احتمالية بيع الخدمة او المنتج شركات تصميم مواقع في السعوديه الذى يقدمة الموقع حيث ان التسويق الالكترونى يساعد شركات السوشيال ميديا فى السعودية على ظهور نتائج ايجابيه فى فتره قصيره افضل شركة تصميم مواقع فى السعودية وكسب عملاء جدد فى وقت افضل شركة تصميم مواقع فى الرياض وزمن قصير

  12. Thanks for sharing this blog, I am reading your post from the beginning, it was so interesting to read. Visit for
    Web Designing Company in Delhi

  13. Play online casinos with fun and money fun slots Fun money, take while there.

  14. This comment has been removed by the author.

  15. Эксклюзивная лента светодиодная для подсветки дизайнерского освещения и уникальных светильников я обычно беру у Ekodio


  16. And indeed, I’m just always astounded concerning the remarkable things served by you. Some four facts on this page are undeniably the most effective I’ve had.

    cloud computing courses in chennai | cloud computing training in chennai | cloud training in chennai | cloud certification in chennai | cloud computing classes in chennai

  17. Wow, amazing blog layout! How long have you been blogging for? you make blogging look easy. The overall look of your website is fantastic, let alone the content!

    3d animation Company
    Best Chatbot Development Company
    Mobile app development in Coimbatore

  18. I enjoyed it, thanks for posting it, I hope this post of yours will be more appreciated by it really excellent. I do not feel sorry for taking the time to read this post, it is really nice and useful to me, thanks for posting it.
    Games io 2019, Jogos para crianças 2019, Jogos online 360, cá koi mini

  19. This comment has been removed by the author.

  20. Find your favorite sport at BT Sport, home of live sport, to watch all the recent videos, TV catch up, news, outcomes, fixtures and more.

  21. In their running defense, Carolina ought to be greatly enhanced. This year's Panthers should have one of the finest defensive elements of the league, who https://triadbex.comwill be using 3-4 alignment. Gerald McCoy, the largest freelancer team that has signed this off-season, was a six-time Pro Bowler with the Buccaneers during his time. The 31-year-old was inspired to show everyone, after the first Pro Bowl in 2011, that he is still one of the League's first defensive fighting.

  22. interesting.. where else you gonna use that knowledge in your life?