Suppose you have a dataset which consists of pairs of tweet messages and the number of clicks for each collected from bit.ly for example. How would you go about predicting the popularity/quality of the tweet.

I suppose the question is fair, because the clicks mostly depend on the text (and perhaps some larger context that we have to ignore), and of course the user and the followers, but we can assume the dataset is from a single user and his followers.

So first what models could we employ, and second what other features could boost the performance, like the time of the clicks, time of publishing etc.

asked Jan 28 '11 at 06:33

Oliver%20Mitevski's gravatar image

Oliver Mitevski
872173144

edited Jan 28 '11 at 06:35


One Answer:

I think you should start with linear regression and see where that takes you.

Time of publishing / day could certainly help. You could bin times by the hour for example. Perhaps different messages get liked different times of the day? You could add a combined feature (text/time) to test this.

You will get sparse text representations, but perhaps you could augment them with word clusters / topic models. Another option is to use gappy character n-gram kernels, which will take care of morphological issues, but also possibly introduce noise.

It is reasonable to model relationship between users. If A likes B's tweets, it is much more likely that B likes A's as well (but you would have to normalize with the total likes of a user in some way). Could you run a page-rank style algorithm on the network of users?

answered Jan 28 '11 at 07:06

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

edited Jan 28 '11 at 07:07

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.