I am having trouble understanding if I am using Adj. Cosine similarity correctly for an item-based CF recommender system (5-star). For most cases Adj Cosine sim works great. But some instances of the data seems to really throw off the ability of the algorithm to perform correctly. Here is an example of two vectors that perform poorly:

vector1 = [0.1875 10.1666 0.4545 0.8750 0.0999 0.4166 -0.2270, 1]

vector2 = [0.1875 10.1666 0.4545 0.8750 0.0999 0.4166 -0.2270, -1]

When these two vectors are input into the formula a value of around 0.11 is output. However, these to vectors are exactly the same except for the last values, namely 1 and -1. If these last values are deleted the similarity jumps to 1.0.

How can I combat this issue of one "rating" having so much impact on the adj cosine sim value?

asked Jul 22 '11 at 17:50

arasraj's gravatar image

arasraj
466610

edited Jul 22 '11 at 17:50

Have you tried normalising the ratings to [0, 1]? The distance should not weight any one entry so much then.

(Jul 23 '11 at 03:44) Robert Layton

By subtracting the average user rating out of each actual rating we obtain negative values for ratings below average and positive values for ratings above average. So when two users have ratings that are "opposite" this should create vectors in almost opposite directions and result in a low cosine distance. This is to account for the fact that users have different rating scales. If I normalize the ratings (vector dimensions) to be [0,1] won't I lose the gain that Adj Cosine Sim provides?

(Jul 23 '11 at 15:17) arasraj
1

There is no canonical solution I know of, but you can robustify in a variety of ways: If you know individual users, you can normalize them by individual variance. You can threshold the amount that one rating can effect similarity. You can compare the median n out of n+k quantiles of ratings. You can remove users that give weird ratings. You can perform some more complex regression on rating vectors that incorporates beliefs you have about structural similarity (e.g. make a bayesian model).

(Jul 23 '11 at 18:15) Jacob Jensen

If using the threshold technique, I understand how you would weight an overall item similarity score but not so much on how to apply a threshold to individual ratings.

(Jul 25 '11 at 13:51) arasraj
Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.