Revision history[back]
click to hide/show revision 1
Revision n. 1

Jun 23 '10 at 18:10

Joseph%20Turian's gravatar image

Joseph Turian
576051125146

What you are interested in is defining an appropriate similarity measure for users, and finding the most similiar users under this measure.

I'm not sure that your normalization scheme is appropriate, because the weight of a particular user is proportional to the how much bigger his top choice is than all the other choices. For example, if his top song has weight 1000 and his second best song has weight 100, then he is normalized to 1, 0.1. Whereas if his top song has weight 1000 and his second best song has weight 900, then he is normalized to 1, 0.9. The second user has more weight. Why?

Consider doing an l2-normalization (so that the sum of the squares is 1). This corresponds to the intuition that each user gets equal weight, when using the dot product as the similarity measure.

Or, if you want some users to be more important (because they are tastemakers), consider l2-normalizing the scores and having a separate "user weight" parameter for each user, which you then multiply by the normalized scores.

Your feature vectors will end up looking like term-document matrices from information retrieval, so you can try their similarity functions. Your similarity measure between two users can be the dot product. People also use the Cosine similarity, but that's the same thing as the dot product with normalization. If you use the dot product, then you are free to make some users have higher weight, as I described earlier.

There are two difficulties with this approach:

  1. Retrieval, using the naive approach of comparing against every other user, is slow. Considering using a faster retrieval algorithm, like LSH, or just using Lucene to do retrieval. But before you make retrieval faster, first do it the slow way to at least make sure that the similarity measure is sensible.
  2. If your feature vectors are very sparse, e.g. most features only appear for one user (i.e. most songs only are present in one user's list), then you will have to smooth your feature vectors in some way. Smoothing by genre or artist is a good place to start.
click to hide/show revision 2
Revision n. 2

Jun 23 '10 at 18:31

Joseph%20Turian's gravatar image

Joseph Turian
576051125146

What you are interested in is defining an appropriate similarity measure for users, and finding the most similiar users under this measure.

Convert each user to a feature vector, one feature (AKA component or dimension) per song. Most entries will be zero. In fact, only 200 will be non-zero, based upon how you describe it.

I'm not sure that your normalization scheme is appropriate, because the weight of a particular user is proportional to the how much bigger his top choice is than all the other choices. For example, if his top song has weight 1000 and his second best song has weight 100, then he is normalized to 1, 0.1. Whereas if his top song has weight 1000 and his second best song has weight 900, then he is normalized to 1, 0.9. The second user has more weight. Why?

Consider doing an l2-normalization (so that the sum of the squares is 1). This corresponds to the intuition that each user gets equal weight, when using the dot product as the similarity measure.

Or, if you want some users to be more important (because they are tastemakers), consider l2-normalizing the scores and having a separate "user weight" parameter for each user, which you then multiply by the normalized scores.

Your feature vectors will end up looking like term-document matrices from information retrieval, so you can try their similarity functions. Your similarity measure between two users can be the dot product. People also use the Cosine similarity, but that's the same thing as the dot product with normalization. If you use the dot product, then you are free to make some users have higher weight, as I described earlier.

There are two difficulties with this approach:

  1. Retrieval, using the naive approach of comparing against every other user, is slow. Considering using a faster retrieval algorithm, like LSH, or just using Lucene to do retrieval. But before you make retrieval faster, first do it the slow way to at least make sure that the similarity measure is sensible.
  2. If your feature vectors are very sparse, e.g. most features only appear for one user (i.e. most songs only are present in one user's list), then you will have to smooth your feature vectors in some way. Smoothing by genre or artist is a good place to start.

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.