1
2

I have a very huge social network database (4m+) of users connections and their interests. I plan to write friend recommendation algorithm with weighted cosine measure/other solutions. What is the best way to generate training dataset, and optimize weight on it? I think about getting for sampled users random friends and random non friend and optimize separate plane.

This question is marked "community wiki".

asked Sep 06 '10 at 11:31

yura's gravatar image

yura
771294048

edited Sep 07 '10 at 04:53


3 Answers:

If you have the time that each friend was made, I would recommend to use the previous year's connections to predict the previous month's (a caveat below). This will enable you to ensure that you can predict future entries rather than just random entries. The issue you may has is that your graph is not completed: there may be friends not met simply because the person hasn't seen them yet (which, is where your algorithm comes in). Practically, this means that you shouldn't use 'lack of link' as a method for saying people aren't similar, just use 'existence of a link' to say that they are similar.

answered Sep 07 '10 at 19:58

Robert%20Layton's gravatar image

Robert Layton
1520102337

What was done in the netflix dataset is (translated to your problem) to keep all the users and delete randomly some connections, and then try to add them back with your model. This naturally reflects that connections follow themselves over time. Also, if you can, choose a treshold in time and keep all connections before that for training and use all connections from after that for testing.

Do be careful when reducing collaborative filtering to binary classification; it's not a natural setting for this problem and you might make it artificially too easy. It's more readily seen as a matrix completion problem, or as a label propagation problem, or things like that. So don't just divide the training set into positive and negative examples and divide those between training/validation/test.

You should read the winner papers in the netflix dataset for more information, but calibrating good baselines (accounting for the fact that people with lots of friends should have a higher prior probability of having other friends, etc) and matrix factorization (using regularized stochastic gradient descent to fit the SVD, probably) go a long way. You should certainly use these techniques as baselines. See Bellkor's page for more info.

The hardest thing in recommendation algorithms is being sure that there are no remaining confounding factors, which can be pretty much impossible. For example, if there's already some proto-recommendation-system in effect on the site where the data was gathered this might bias the answers to systems that mimic its behavior. For example, in early versions of facebook where you were encouraged to browse your friends' friends for contacts a recommendation algorithm based on counting the number of friends in common would probably beat one based on common interests, while if the data came from a more interest-based site (like a web forum with many subforums) interest-based recommendation systems will predict better than friendship-based, even though in either of these scenarios what would be most beneficial for the community could be recommendations orthogonal to what is already obvious.

answered Sep 06 '10 at 13:43

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

edited Sep 06 '10 at 15:18

thanks, idea with prediction future by time treshold is very useful.

(Sep 08 '10 at 15:52) yura

If you are planning to optimize ensembles, a nice way to do this is to obtain a small set of rankings for items (i.e. triplets (x, y, z) saying that y is a better recommendation for x than z) and train these using SVMs. See http://www.joachims.org/publications/joachims_02c.pdf

answered Sep 06 '10 at 16:18

Erik's gravatar image

Erik
1

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.