|
I am building a recommender system based on item based CF. I am using Adj Cosine Similarity as my similarity metric. Based on my experience learning from textbooks, one can consider two items to be very similar if they have an similarity score of 0.7 or greater. However, these textbook examples do not resemble the real world. Especially when dealing with web data I would think that "what is considered a high similarity" between items is a little lower. In my experience so far I have seen some pretty low scores (0.2) return relevant results. What is considered a "good" similarity score between items in CF when you are dealing with web data? |
|
If you knew the correct threshold to use, you would be a very rich person. The threshold gives a recall/precision trade-off - higher required similarities give a higher precision but worse recall. This means less products will be recommended, but those that are are more likely to be relevant. There are two ways you can calculate what threshold. Firstly, you can get some training data, train a model based on past data and determine a threshold. Secondly, in practice you update the similarity through A/B testing and work out whether the value should be increased or decreased based on what makes more profit. That said, you can sidestep the issue and just rank by similarity, then return the top 10 items. |
|
I don't think there are any good rules of thumb for this. It's going to depend on how much web data you have, how clean it is (are advertisements/headers/footers stripped out?), what vector space you're measuring in (LSA? words? bigrams? stemming? If so, with which stemmer?), how similar of content you're looking for, how you want to trade off precision/recall, etc. What I'd recommend is that you manually determine the similarity of a few hundred webpage pairs. Then you can see what transformation you need to make to line up your metric values with your own human judgement. This also gives you data to test alternate settings on: how much accuracy do things like stemming/bigrams buy you? My data is actually user id's, item id's, and a 5-star rating of user on item. By web data I meant data which is gathered from a user interacting on the web and not necessarily a web page. The item vector includes user ratings as its dimensions. This should simplify things I would think. A possible problem with me manually determining similarity for items is that CF can find non-related items that users tend to both enjoy?
(Jul 12 '11 at 16:09)
arasraj
1
Even better, then. If you've already got user ratings you can use those. That way the human judgement is from a group, rather than one person's preference. Just hold some of the user ratings out as a test set. You should be able to then correlate how your metric corresponds with similar ratings. Then you can see things like at "similiarity > 0.5, 40% of 5 star reviews for one item correspond to a 5 star review of the other", and use that to choose an appropriate cutoff.
(Jul 13 '11 at 09:26)
Paul Barba
|
When working on short sentence data (e.g. a sentence in a review or news story) one of my professors heavily favors 0.15 as his cutoff.