4
2

Can anyone describe techniques that are unsupervised for inducing word representations that can be used for word polarity detection?

[edit: Just to be clear: I would like to avoid using a seed set of words, or any form of initial supervision. I want an approach that is purely unsupervised in the first phase, then has a subsequent supervised or semi-supervised phase. (The unsupervised phase should do most of the learning, the subsequent fine-tuning should just be icing on the cake.) I don't want to do semi-supervised training initially, which is what using a seed set of words does.]

I imagine something of the following form would work, but I am looking for more concrete references (not just speculation): In the unsupervised step, represent each word as the distribution of words that cooccur with a window of k words from the focus word. (This assumes that words with the same polarity tends to cooccur in windows of size k.) In the possible supervised step, use a very small number (<< vocabulary size, e.g. 100) of labeled examples to learn a model, taking the coccurrence distribution (distributional word representation) and then mapping it to a probability of having positive valence. I don't actually care about this supervised step, I care only that my unsupervised step captures sufficient information in the word representation that they could be used for polarity detection without much training data.

Does anyone have any references about this topic?

asked Jul 05 '10 at 15:36

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

edited Jul 10 '10 at 16:29


6 Answers:

The approach you described is sensible: take a seed set of words and use context distributional similarity to induce more words. Another thing I'd try is to take text which already has sentiment like movies, game, restaurant or other kinds of reviews and correlate user ratings with words. There aren't many products for which you can't find sentiment-annotated data. Lastly, for a more resource-intensive version of the bootstrapping approach you'd described, there was a good paper at this most recent NAACL: The Viability of Web-derived Polarity Lexicons that might be of interest.

answered Jul 05 '10 at 16:04

aria42's gravatar image

aria42
209972441

Good reference Aria! I missed this one.

(Jul 05 '10 at 16:49) Delip Rao

The problem is, I would like to avoid using a seed set of words. I want an approach that is purely unsupervised, then has a subsequent supervised or semi-supervised phase. I don't want to do semi-supervised training initially.

Thanks for the NAACL 2010 reference, I'll check it out.

(Jul 05 '10 at 16:49) Joseph Turian ♦♦

Eh..no free lunch. I don't think its possible to get this in a totally unsupervised way, but I could be wrong.

(Jul 06 '10 at 14:42) aria42

Depends. For example, you can't get great POS tags in a purely unsupervised way (I think?), but you can train an unsupervised model and then at the end add a little supervision to transform the unsupervised representation into the desired supervised POS tags.

Similarly, my hope is to induce unsupervised word representations that contain enough lexico-semantic information that they can be fed to a supervised classifier with just a few labelled examples, and produce a high-quality model. Do you not believe this is possible?

(Jul 06 '10 at 19:06) Joseph Turian ♦♦

@Joseph, I think the idea you propose is similar to:

Peter D. Turney and Michael L. Littman. 2003. Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems, 21(4):315–346

The earliest work on this was by Hatzivassiloglou and McKeown (1997) who used the insight that, for English, the conjunction "and" links adjectives of similar polarity and the conjunction "but" links adjectives of opposite polarity.

Vasileios Hatzivassiloglou and Kathleen McKeown. 1997. Predicting the semantic orientation of adjectives. In Proceedings of the ACL, pages 174–181.

Jan Wiebe used distributional clustering to label adjectives. This is a "cluster-and-label" kind of semi-supervised learning and quite similar to @Joseph's suggestion.

Janyce M. Wiebe. 2000. Learning subjective adjectives from corpora. In Proceedings of the 2000 National Conference on Artificial Intelligence. AAAI.

In the presence of additional lexical resources like WordNet, graph based semi-supervised learning approaches can applied.

Delip Rao and Deepak Ravichandran. 2009. Semi-Supervised Polarity Lexicon Induction. In Proceedings of the EACL '09.

answered Jul 05 '10 at 16:42

Delip%20Rao's gravatar image

Delip Rao
6653912

edited Jul 05 '10 at 16:51

That's kind of cheating on your unsupervision requirement but if your data contains emoticons (smileys) these may serve as weak and noisy labels. For instance, this paper labels text by using the emoticons (J. Read, 2005. Using Emoticons to reduce Dependency in Machine Learning Techniques for Sentiment Classification).

Here is a more ngram-based approach using the same method: Twitter as a Corpus for Sentiment Analysis and Opinion Mining

answered Jul 06 '10 at 06:22

Ama%C3%A7%20Herda%C4%9Fdelen's gravatar image

Amaç Herdağdelen
1763813

It is unclear to me why you would want to be fully unsupervised in the first place. For English, there are a lot of freely available polarity lexicons that are pretty good. Also, getting seed sets (or even labeled training instances) for other languages is a relatively easy thing. Humans (even non-NLP experts) are pretty good at it, assuming that the sentiment is expressed fairly directly, as opposed to the sort of phenomena that was studied in Greene + Resnick (2009). Of course, we also have a huge amount of supervised data in terms of rated user reviews for a number of domains. Though this supervision is not direct if your goal is to learn lexicons, it is easy to imagine a number of scenarios where you could generate domain specific lexicons from it (e.g., vanilla mutual information with positive/negative reviews). Perhaps I am missing something obvious, but why would you not want to use all this easy to acquire knowledge? Even if you have a new domain with no resources, information from other domains should not be ignored, given that there are so many cross-domain sentiment terms (e.g., 'great', 'awesome', ...).

Once you have prior knowledge (via seed sets or some other polarity lexicon or some auxiliary signal) you can do something like a joint-sentiment-topic model or any number of different ways to piggy-back on this prior knowledge.

At some point, I think you will need to inject some weak prior knowledge, such as seed sets, into the model. Sentiment is often expressed via adjectives, which are distributionally similar to each other at a local level, e.g., "the shirt was great", "the shirt was bad", "the shirt was green". It might be possible to split of sentiment from non-sentiment adjectives, but positive-versus-negative adjectives will be tough since every focus word that can be 'good', can also be 'bad'. You may get something better by looking at document level co-occurrence, but even then, without a little prior knowledge to push the model in the right direction (even just with one or two words) I would imagine getting noisy results.

answered Jul 06 '10 at 11:11

Ryan%20McDonald's gravatar image

Ryan McDonald
10613

edited Jul 06 '10 at 19:07

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Ryan, you are correct that it is easy to specify seeds. For this reason, I am back and forth on whether I want to use them or not. The argue against using them is that my main interest is in purely unsupervised techniques for inducing word representations, such that all lexico-semantic information is contained in the word representation. So giving seed words for polarity does not lead to more general-purpose word representations. Does this make sense?

Regarding this idea of looking at document-wide cooccurrence distributions, this is also what I had in mind. The question is whether a word representation based upon this document-level cooccurrence is too noisy. I believe that clustering these word representations would be too noisy, but I am cautiously optimistic that these word representations could be mapped to polarity by using just the seed set as supervision. Not sure, though.

(Jul 06 '10 at 19:03) Joseph Turian ♦♦

I have some questions about your suggestion:

  1. how is the unsupervised step different from standard distributed word encoding algorithms? (like Collobert & Weston, or Mnih & Hinton)

  2. I think if you're looking explicitly for polarity it should do you good to think at the sentence- and document-level, as well, since this might incorporate some information about polarity that should not be present in a k word window.

  3. Isn't it weird to assume that words in the same small window have the same polarity? Shouldn't this suffer a lot from neutral words (such as stopwords, discourse markers, etc)? For example, this extract from an amazon review, "has written an ambitious, hugely human novel" is clearly of positive polarity, but no word in there except "hugely" and "human" could/should be generally interpreted as positive. I somewhat feel that a good word representation that captures polarity should be less local, ngram-like and more global, topic-like. Although in some experiments I made with sentiment classification bigrams were a lot more discriminative and easier to interpret than words. I'm not sure that this has been reproduced, however.

answered Jul 10 '10 at 16:47

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

I can describe an approach that I use in my work and link to a good paper about the topic.

I first trained the rainbow classifier on the standard movie reviews corpus that is included in the python NLTK. Then I sorted the list of the words in the classification corpus by info gain, ie got a list of words that were the most polar between the positive and negative categories. Then I did a simple count on each word to see if it occurred more often in the positive or negative reviews.

I used these words as seed lists and followed the procedure outlined in this paper, "Large-Scale Sentiment Analysis for News and Blogs". Essentially you start with your seed lists and then recursively query the WordNet database (also included with the python NLTK) to find synonyms. The details of the process are outlined in the paper. It can be a bit involved, but it's produced decent results for me. The polarity of words can change slightly between categories (ie for technology content "speedy" is positive but for dating content it might not be) but the movie reviews provide a good general starting point with 1000 documents in each category.

answered Jul 11 '10 at 04:50

Joel%20H's gravatar image

Joel H
6123

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.