Users can listen to tracks (music) and a list of tracks listened to by each user is stored.
I store a ranking score for each track, and it is an integer value. The higher this "top rank" is, the higher it ranks in their top tracks list. I normalize this value for the top 200 tracks per user into a value greater than 0, and less than or equal to 1, where 1 is assigned to the track with the highest top rank out of the top 200 tracks and some value greater than 0 to the lowest ranked track.
I'd like to know the best way to quickly find similar users for a given user based on their track data. Ideally the solution would require just a query with a sort, but I'm guessing it's probably going to require some aggregation and data processing. One solution I have is to separate tracks into genres, and then compare aggregate genres, but this seems to be a weak way of solving the problem. Also I'm not sure how well comparing every user to each other is going to scale out.
asked Jun 23 '10 at 16:48
This problem is known as Collaborative Filtering. I think the most popular solutions are based on linear models based on a Singular Value Decomposition of the Users/Items score matrix (which is very sparse hence efficient implementation should leverage that fact). SVD allows you to explain most of the variance in a lower dimensional space where similarity distances are faster to compute.
There is an implementation of such a recommender system in the Apache Mahout project (disclaimer I have not tried it myself).
answered Jun 23 '10 at 18:04
What you are interested in is defining an appropriate similarity measure for users, and finding the most similiar users under this measure.
Convert each user to a feature vector, one feature (AKA component or dimension) per song. Most entries will be zero. In fact, only 200 will be non-zero, based upon how you describe it.
I'm not sure that your normalization scheme is appropriate, because the weight of a particular user is proportional to the how much bigger his top choice is than all the other choices. For example, if his top song has weight 1000 and his second best song has weight 100, then he is normalized to 1, 0.1. Whereas if his top song has weight 1000 and his second best song has weight 900, then he is normalized to 1, 0.9. The second user has more weight. Why?
Consider doing an l2-normalization (so that the sum of the squares is 1). This corresponds to the intuition that each user gets equal weight, when using the dot product as the similarity measure.
Or, if you want some users to be more important (because they are tastemakers), consider l2-normalizing the scores and having a separate "user weight" parameter for each user, which you then multiply by the normalized scores.
Your feature vectors will end up looking like term-document matrices from information retrieval, so you can try their similarity functions. Your similarity measure between two users can be the dot product. People also use the Cosine similarity, but that's the same thing as the dot product with normalization. If you use the dot product, then you are free to make some users have higher weight, as I described earlier.
There are two difficulties with this approach: