|
I am having trouble understanding if I am using Adj. Cosine similarity correctly for an item-based CF recommender system (5-star). For most cases Adj Cosine sim works great. But some instances of the data seems to really throw off the ability of the algorithm to perform correctly. Here is an example of two vectors that perform poorly: vector1 = [0.1875 10.1666 0.4545 0.8750 0.0999 0.4166 -0.2270, 1] vector2 = [0.1875 10.1666 0.4545 0.8750 0.0999 0.4166 -0.2270, -1] When these two vectors are input into the formula a value of around 0.11 is output. However, these to vectors are exactly the same except for the last values, namely 1 and -1. If these last values are deleted the similarity jumps to 1.0. How can I combat this issue of one "rating" having so much impact on the adj cosine sim value? |
Have you tried normalising the ratings to [0, 1]? The distance should not weight any one entry so much then.
By subtracting the average user rating out of each actual rating we obtain negative values for ratings below average and positive values for ratings above average. So when two users have ratings that are "opposite" this should create vectors in almost opposite directions and result in a low cosine distance. This is to account for the fact that users have different rating scales. If I normalize the ratings (vector dimensions) to be [0,1] won't I lose the gain that Adj Cosine Sim provides?
There is no canonical solution I know of, but you can robustify in a variety of ways: If you know individual users, you can normalize them by individual variance. You can threshold the amount that one rating can effect similarity. You can compare the median n out of n+k quantiles of ratings. You can remove users that give weird ratings. You can perform some more complex regression on rating vectors that incorporates beliefs you have about structural similarity (e.g. make a bayesian model).
If using the threshold technique, I understand how you would weight an overall item similarity score but not so much on how to apply a threshold to individual ratings.