|
I'm trying to pin a sentiment score (-1 to +1) on the features of a product by using the sentiment scores of the adjectives that describe the features. Right now, I do this by averaging the sentiment scores (+1 -1 or 0) of all the adjectives that apply to a feature. The problem is that some features have a lot (>20) adjectives, where others have just 1 or 2. I want to shrink my estimated sentiment towards some average value in cases where there aren't that many observations - what are some principled ways to do this, and where do I learn more? A secondary problem is that the sentiment scores attached to adjectives only have about 75% F-1, so a -1 (negative) or +1 labeled adjective may actually be neutral. While this may not be a problem when there are a lot of adjectives describing a feature, how do I adjust for it when there are only a few? Can I shrink these scores as well? |
|
You can use a generative beta-binomial model. For each feature, assume its "positivieness probability" is sampled from a Beta distribution, p_f ~ Beta(alpha, beta) and assume each word relating to this feature's polarity is 1 with probability p_f. Let w_f be the total (out of n_f) positive words observed; then, w_f ~ Binomial(p_f, n_f) You can get shrinkage to the mean by choosing alpha = beta > 1, and shrink towards another default value by setting alpha and beta sot that the default probability is (beta - 1)/(alpha + beta - 2). Then, when you compute, for each review, its average positiveness (by averaging the posterior p_fs of its features), features with higher confidence (i.e., more observed adjectives) are going to "affect" the probability more than features with lower confidence (that will have a more diffuse distribution). A nice thing about this sort of model is that you can incorporate a confidence parameter about each word's positiveness, by saying that the word will, with some global (or per-word) probability, actually be used with the opposite polarity. The model then stops being p_f ~ Beta(alpha, beta) w_fi ~ Bernoulli(p_f) and becomes p_f ~ Beta(alpha, beta) R_w ~ Beta(alpha', beta') z_fi ~ Bernoulli(R_w) w_fi ~ if z_fi Bernoulli(p_f) else Bernoulli(1-p_f) where R_w is the probability for each word having its usual polarity, or something like this. See, for example, the noise model of this paper or this paper. Thanks! I ended up using pretty much this model.
(Jul 15 '10 at 18:56)
aditi
|