1
1

I’m thinking of adding a feature to the TalkingPuffin Twitter client, where, after some training with the user, it can rank incoming tweets according to their predicted value. What solutions are there for the Java virtual machine (Scala or Java preferred) to do this sort of thing?

asked Feb 19 '11 at 21:40

Dave%20Briccetti's gravatar image

Dave Briccetti
16122

I would search for java implementations of naive bayes classifier, aimed for text classification. A quick search brings up http://classifier4j.sourceforge.net/usage.html

(Feb 19 '11 at 23:04) Yaroslav Bulatov

I asked a similar question for this purpose here and here.

(Feb 21 '11 at 06:16) Oliver Mitevski

3 Answers:

In general you'd probably need some supervision for that, and a user marking 10 or so tweets as high-value is far too little supervision (this is not enough for you to zero in on specific words, for example, or to find out the weights for the people followed by your user). Luckily, this is essentially the same problem behind the gmail priority inbox. There was a paper on a recent workshop describing their approach in more detail, but the essential ideas are:

  1. Define a metric of what is a valuable tweet for your users. One simple way of doing this is saying that a tweet is valuable if links on it are clicked, it is retweeted, or it is replied in a certain timeframe after being posted.
  2. Learn a classifier that tries to predict wether or not a tweet will fit those criteria. This is a natural setting for online learning algorithms, since first the algorithm has to predict a relevance for a tweet and soon afterwards it is told wether it got that right or wrong (and this is why the timeframe above is imporant). The google people use an online logistic regression algorithm that has the nice bonus of giving you well-calibrated probabilities that you can tune. Another nice feature of online algorithms for this problem is that they are naturally suited to forgetting past behavior and adapting to changing preferences (this is achieved by using a too-aggressive learning rate schedule)
  3. Use as many interesting features as you can in that classifier. A nice thing of the google approach is that they learn both a global classifier for all users and a local classifier for each user that adapts to his/her preferences. Hence a new user will already have some pretty good defaults as soon as he installs your application (for example, @replies are usually important, tweets from people you're following recently are also more easily important, as are tweets of people with many followers or many retweets).

I recommend you read that paper, as it also explains details of their algorithm, environment, and assumptions.

answered Feb 20 '11 at 14:04

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

For 1. you need to find a way to estimate the impressions each tweet got and then you get a much better score for each tweet. #clicks/#impressions.

I think that inducing good semantic representations for the tweets should help the classifier. So Alexandre what is the best way to do this for short documents such as tweets? [I know it has been posted somewhere I can't find it though]

Alternatively Leon's suggestions seem plausible, concatenate all tweets for each user and run LDA or some other unsupervised clustering suggested here, should give you some good representation for each user's preference.

(Feb 21 '11 at 06:46) Oliver Mitevski
1

Doing LDA only on the tweets for each user seems wasteful, specially as new users won't have many tweets in their timeline, and it would mean the topics for each user would be different. I'd do online lda (there is a good implementation in vowpal wabbit) on a subset of the firehose (maybe even the free subset) and use the topics assigned by this model for each tweet as features for the global and per-user regressor that predicts the possibility of something interesting happening.

(Feb 21 '11 at 09:10) Alexandre Passos ♦

How about the following idea for modeling user preferences. For all tweets in a user's history you compute tf-idf (or only term frequencies) feature vectors. You weight each of these tf-idf vector by some quality score which should be computed as suggested in Alexandre's answer, say #click/#impressions also time to make the latest tweets more important. Then sum them all up, therefore you are feature vectors won't be as sparse as the feature vectors for the tweets themselves. Then you can use LDA, LSI or some unsupervised clustering, to get better and semantic representations for these pseudo-documents which model user preferences. Next time you want to rank new tweets for a user, compute their tf-idf, map them in the latent/semantic space and rank them according to euclidean distance to the pseudo-document for this particular user. (there are some details to be worked out but generally I think it should work out.)

answered Feb 21 '11 at 08:38

Oliver%20Mitevski's gravatar image

Oliver Mitevski
753172640

This is kind of a simplistic approach but:

Try to run LDA over your current tweets, so you know which topics is the user more interested in, although, I'm not sure how LDA would work with 160 characters, you can always concatenate tweets.

Then, with that, try to run an online LDA to analyze incoming tweets, and with that you might get an idea on how interesting is the tweet for him.

You can go crazy and try to use hirerchichal models to look for meta topics relating all the tweets of the user.

answered Feb 20 '11 at 22:08

Leon%20Palafox's gravatar image

Leon Palafox
31265471107

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.