Users can listen to tracks (music) and a list of tracks listened to by each user is stored.

I store a ranking score for each track, and it is an integer value. The higher this "top rank" is, the higher it ranks in their top tracks list. I normalize this value for the top 200 tracks per user into a value greater than 0, and less than or equal to 1, where 1 is assigned to the track with the highest top rank out of the top 200 tracks and some value greater than 0 to the lowest ranked track.

I'd like to know the best way to quickly find similar users for a given user based on their track data. Ideally the solution would require just a query with a sort, but I'm guessing it's probably going to require some aggregation and data processing. One solution I have is to separate tracks into genres, and then compare aggregate genres, but this seems to be a weak way of solving the problem. Also I'm not sure how well comparing every user to each other is going to scale out.


asked Jun 23 '10 at 16:48

chrischen's gravatar image


2 Answers:

This problem is known as Collaborative Filtering. I think the most popular solutions are based on linear models based on a Singular Value Decomposition of the Users/Items score matrix (which is very sparse hence efficient implementation should leverage that fact). SVD allows you to explain most of the variance in a lower dimensional space where similarity distances are faster to compute.

There is an implementation of such a recommender system in the Apache Mahout project (disclaimer I have not tried it myself).

answered Jun 23 '10 at 18:04

ogrisel's gravatar image


What you are interested in is defining an appropriate similarity measure for users, and finding the most similiar users under this measure.

Convert each user to a feature vector, one feature (AKA component or dimension) per song. Most entries will be zero. In fact, only 200 will be non-zero, based upon how you describe it.

I'm not sure that your normalization scheme is appropriate, because the weight of a particular user is proportional to the how much bigger his top choice is than all the other choices. For example, if his top song has weight 1000 and his second best song has weight 100, then he is normalized to 1, 0.1. Whereas if his top song has weight 1000 and his second best song has weight 900, then he is normalized to 1, 0.9. The second user has more weight. Why?

Consider doing an l2-normalization (so that the sum of the squares is 1). This corresponds to the intuition that each user gets equal weight, when using the dot product as the similarity measure.

Or, if you want some users to be more important (because they are tastemakers), consider l2-normalizing the scores and having a separate "user weight" parameter for each user, which you then multiply by the normalized scores.

Your feature vectors will end up looking like term-document matrices from information retrieval, so you can try their similarity functions. Your similarity measure between two users can be the dot product. People also use the Cosine similarity, but that's the same thing as the dot product with normalization. If you use the dot product, then you are free to make some users have higher weight, as I described earlier.

There are two difficulties with this approach:

  1. Retrieval, using the naive approach of comparing against every other user, is slow. Considering using a faster retrieval algorithm, like LSH, or just using Lucene to do retrieval. But before you make retrieval faster, first do it the slow way to at least make sure that the similarity measure is sensible.
  2. If your feature vectors are very sparse, e.g. most features only appear for one user (i.e. most songs only are present in one user's list), then you will have to smooth your feature vectors in some way. Smoothing by genre or artist is a good place to start.

answered Jun 23 '10 at 18:10

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

edited Jun 23 '10 at 18:31

Your feature vectors will end up looking like term-document matrices from information retrieval, so you can try their similarity functions.

What do you mean by feature vectors, and could you elaborate on term-document matrices? I'm not exactly sure how the dot-product of the vectors results in correlation between two users' music preferences.


(Jun 23 '10 at 18:55) chrischen

I read up on term-document matrices. So each user vector would resemble a row in the matrix. The columns would be songs, or artists, or genres.

(Jun 24 '10 at 01:51) chrischen
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.