2
1

I'm trying to pin a sentiment score (-1 to +1) on the features of a product by using the sentiment scores of the adjectives that describe the features. Right now, I do this by averaging the sentiment scores (+1 -1 or 0) of all the adjectives that apply to a feature. The problem is that some features have a lot (>20) adjectives, where others have just 1 or 2. I want to shrink my estimated sentiment towards some average value in cases where there aren't that many observations - what are some principled ways to do this, and where do I learn more?

A secondary problem is that the sentiment scores attached to adjectives only have about 75% F-1, so a -1 (negative) or +1 labeled adjective may actually be neutral. While this may not be a problem when there are a lot of adjectives describing a feature, how do I adjust for it when there are only a few? Can I shrink these scores as well?

asked Jul 14 '10 at 18:17

aditi's gravatar image

aditi
83571933

edited Jul 14 '10 at 18:22


One Answer:

You can use a generative beta-binomial model. For each feature, assume its "positivieness probability" is sampled from a Beta distribution,

p_f ~ Beta(alpha, beta)

and assume each word relating to this feature's polarity is 1 with probability p_f. Let w_f be the total (out of n_f) positive words observed; then,

w_f ~ Binomial(p_f, n_f)

You can get shrinkage to the mean by choosing alpha = beta > 1, and shrink towards another default value by setting alpha and beta sot that the default probability is (beta - 1)/(alpha + beta - 2). Then, when you compute, for each review, its average positiveness (by averaging the posterior p_fs of its features), features with higher confidence (i.e., more observed adjectives) are going to "affect" the probability more than features with lower confidence (that will have a more diffuse distribution).

A nice thing about this sort of model is that you can incorporate a confidence parameter about each word's positiveness, by saying that the word will, with some global (or per-word) probability, actually be used with the opposite polarity. The model then stops being

p_f ~ Beta(alpha, beta)

w_fi ~ Bernoulli(p_f)

and becomes

p_f ~ Beta(alpha, beta)

R_w ~ Beta(alpha', beta')

z_fi ~ Bernoulli(R_w)

w_fi ~ if z_fi Bernoulli(p_f) else Bernoulli(1-p_f)

where R_w is the probability for each word having its usual polarity, or something like this. See, for example, the noise model of this paper or this paper.

answered Jul 14 '10 at 19:31

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1677242188306

Thanks! I ended up using pretty much this model.

(Jul 15 '10 at 18:56) aditi
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.