|
Hi all, I've been playing around for fun with trying to write a recommendation system for Twitter users on who to follow. So I've collected my entire depth 2 graph (i.e. union of people I follow, and everyone they follow, total of ~14000 people) with Twitter API (Tweepy, recommended), and also collected 100 tweets sample from every user. Given this database, I would like to recommend the best people to follow. What works very well already is just to look at the graph, and find people who are often followed (by someone in the group), but I don't follow. I found about 10 great new people just yesterday using this. However, I am curious if one can do much better also looking at the tweets and text, by finding people who talk about the same kinds of things. I am interested in hearing what approaches people think could work well on this kind of data. Things I tried: preprocessing: everything to lower case, get rid of most punctuation and stopwords. Then
where numoccurs(w, x) returns number of times user x said word w, or 0 if they never did. Basically, the min acts like an AND operation, and makes it so that if they both talk about something a lot, it will get a large score. Otherwise not as large. I also tried scaling each word's contribution based on how often it comes up overall. (so common words don't give as much score), and I think this improved it a bit. This approach alone already found many more interesting people than only using the graph, but I'm wondering if one can do better.
What do you guys think? I'm a bit confused about what I'm doing because I'm kinda new to NLP-- most of my experience is in vision, but there are no HOG/SIFT features to compute here :) I'll try to collect a larger database because 100 tweets is not too many indeed. It's complicated because Twitter API only gives 20 tweets in one API call, and you can't use more than 350 calls an hour. Also, the API calls tend to fail randomly sometimes, and sometimes they always fail for some user for some reason. Annoying! ps: here http://cs.ubc.ca/~andrejk/tweets.txt is link to python dictionary containing 150 of the users and their counts for each word. It can be eval'd if you like a sample of data. |
|
This seems like a great binary classifier problem, but with no negative examples, so I would treat this as one class classification. You train on only the known follower relationships, using whatever features you like. Then you run on potential follower pairs, and recommend those with a high score. This is also know as "novelty detection" - learning a model of "why" people follow each other, then quantifying how surprising it is when they don't. If you were to turn this into a real service, eventually you might collect some negative examples, namely when people fail to follow the recommendations. At that point, you might be able to incorporate a real binary classifier. Other features I might suggest involve implicit social relations, like how often people reply to or retweet someone else. As suggested, the links people include in their tweets are probably indicative as well. If you don't want to follow these and collect the text of the targets, simply using prefixes of the (unshortened) URLs might be useful: if you and I both keep sending out links to metaoptimize.com, that indicates something about our mutual interests. thank you! &welcome to MetaOptimize ;)
(Mar 31 '11 at 22:49)
karpathy
|
|
Hi Andrej, Funny to see you working on this, I was planning to do exactly the same. Couple of notes to get you started: You should start by having a look at the vector space model: you can compute the relatedness between two pieces of text using the cosine distance of the TF-IDF vectors. TF-IDF of words and bi-grams are the SIFT/HOG features of NLPers. I would also not restrict to the words and follower / followee features but also extract the common interactions with events such as "user You should also extract the text of the links occurring in tweets using a tool such as boilerpipe and use it as a new set of features to describe your users (using TF-IDF or words and bi-grams too). |
|
Yet another approach towards this problem is collaborative-filtering+features. Imagine you have a matrix with one row per user and one column per feature, where you put a feature for each user that the row-user follows and also things such as some presence of words in tweets of such-and-such user, geographical region, type of account, etc). Then you compute a low-rank SVD of this matrix (this should be fast, as the matrix is sparse) and you can compute a "score" for how much a user u should follow a user v by the dot-product U_u Sigma V_v, where the SVD of this matrix is U Sigma V. Of course, this can be enriched in N different ways, specially as this SVD will have a strong bias towards not following (as most users do not follow most users), and choosing the truncation level is hard. Read for example the papers on the netflix prize to get an idea of how recommendations work for more complex datasets. |