|
I’m thinking of adding a feature to the TalkingPuffin Twitter client, where, after some training with the user, it can rank incoming tweets according to their predicted value. What solutions are there for the Java virtual machine (Scala or Java preferred) to do this sort of thing? |
|
In general you'd probably need some supervision for that, and a user marking 10 or so tweets as high-value is far too little supervision (this is not enough for you to zero in on specific words, for example, or to find out the weights for the people followed by your user). Luckily, this is essentially the same problem behind the gmail priority inbox. There was a paper on a recent workshop describing their approach in more detail, but the essential ideas are:
I recommend you read that paper, as it also explains details of their algorithm, environment, and assumptions. For 1. you need to find a way to estimate the impressions each tweet got and then you get a much better score for each tweet. #clicks/#impressions. I think that inducing good semantic representations for the tweets should help the classifier. So Alexandre what is the best way to do this for short documents such as tweets? [I know it has been posted somewhere I can't find it though] Alternatively Leon's suggestions seem plausible, concatenate all tweets for each user and run LDA or some other unsupervised clustering suggested here, should give you some good representation for each user's preference.
(Feb 21 '11 at 06:46)
Oliver Mitevski
1
Doing LDA only on the tweets for each user seems wasteful, specially as new users won't have many tweets in their timeline, and it would mean the topics for each user would be different. I'd do online lda (there is a good implementation in vowpal wabbit) on a subset of the firehose (maybe even the free subset) and use the topics assigned by this model for each tweet as features for the global and per-user regressor that predicts the possibility of something interesting happening.
(Feb 21 '11 at 09:10)
Alexandre Passos ♦
|
|
How about the following idea for modeling user preferences. For all tweets in a user's history you compute tf-idf (or only term frequencies) feature vectors. You weight each of these tf-idf vector by some quality score which should be computed as suggested in Alexandre's answer, say #click/#impressions also time to make the latest tweets more important. Then sum them all up, therefore you are feature vectors won't be as sparse as the feature vectors for the tweets themselves. Then you can use LDA, LSI or some unsupervised clustering, to get better and semantic representations for these pseudo-documents which model user preferences. Next time you want to rank new tweets for a user, compute their tf-idf, map them in the latent/semantic space and rank them according to euclidean distance to the pseudo-document for this particular user. (there are some details to be worked out but generally I think it should work out.) |
|
This is kind of a simplistic approach but: Try to run LDA over your current tweets, so you know which topics is the user more interested in, although, I'm not sure how LDA would work with 160 characters, you can always concatenate tweets. Then, with that, try to run an online LDA to analyze incoming tweets, and with that you might get an idea on how interesting is the tweet for him. You can go crazy and try to use hirerchichal models to look for meta topics relating all the tweets of the user. |
I would search for java implementations of naive bayes classifier, aimed for text classification. A quick search brings up http://classifier4j.sourceforge.net/usage.html
I asked a similar question for this purpose here and here.