Hi all,
I've been playing around for fun with trying to write a recommendation system for Twitter users on who to follow. So I've collected my entire depth 2 graph (i.e. union of people I follow, and everyone they follow, total of ~14000 people) with Twitter API (Tweepy, recommended), and also collected 100 tweets sample from every user.
Given this database, I would like to recommend the best people to follow. What works very well already is just to look at the graph, and find people who are often followed (by someone in the group), but I don't follow. I found about 10 great new people just yesterday using this. However, I am curious if one can do much better also looking at the tweets and text, by finding people who talk about the same kinds of things. I am interested in hearing what approaches people think could work well on this kind of data.
Things I tried: preprocessing: everything to lower case, get rid of most punctuation and stopwords. Then
Created histogram of words and their occurrences for every user. Then compatibility score of two users x,y is:
for every word w of x: score += min(numoccurs(w, x), numoccurs(w, y))
where numoccurs(w, x) returns number of times user x said word w, or 0 if they never did. Basically, the min acts like an AND operation, and makes it so that if they both talk about something a lot, it will get a large score. Otherwise not as large. I also tried scaling each word's contribution based on how often it comes up overall. (so common words don't give as much score), and I think this improved it a bit. This approach alone already found many more interesting people than only using the graph, but I'm wondering if one can do better.
- I used Turian's word embeddings 50-D to project every word into the space, but it's not clear what should be done next. Every user has about 600 unique words they used. Many are found in the embeddings database. (Since this is not English, it's Twitter :))
What do you guys think? I'm a bit confused about what I'm doing because I'm kinda new to NLP-- most of my experience is in vision, but there are no HOG/SIFT features to compute here :) I'll try to collect a larger database because 100 tweets is not too many indeed. It's complicated because Twitter API only gives 20 tweets in one API call, and you can't use more than 350 calls an hour. Also, the API calls tend to fail randomly sometimes, and sometimes they always fail for some user for some reason. Annoying!
ps: here http://cs.ubc.ca/~andrejk/tweets.txt is link to python dictionary containing 150 of the users and their counts for each word. It can be eval'd if you like a sample of data.