|
Can anyone describe techniques that are unsupervised for inducing word representations that can be used for word polarity detection? [edit: Just to be clear: I would like to avoid using a seed set of words, or any form of initial supervision. I want an approach that is purely unsupervised in the first phase, then has a subsequent supervised or semi-supervised phase. (The unsupervised phase should do most of the learning, the subsequent fine-tuning should just be icing on the cake.) I don't want to do semi-supervised training initially, which is what using a seed set of words does.] I imagine something of the following form would work, but I am looking for more concrete references (not just speculation): In the unsupervised step, represent each word as the distribution of words that cooccur with a window of k words from the focus word. (This assumes that words with the same polarity tends to cooccur in windows of size k.) In the possible supervised step, use a very small number (<< vocabulary size, e.g. 100) of labeled examples to learn a model, taking the coccurrence distribution (distributional word representation) and then mapping it to a probability of having positive valence. I don't actually care about this supervised step, I care only that my unsupervised step captures sufficient information in the word representation that they could be used for polarity detection without much training data. Does anyone have any references about this topic? |
|
The approach you described is sensible: take a seed set of words and use context distributional similarity to induce more words. Another thing I'd try is to take text which already has sentiment like movies, game, restaurant or other kinds of reviews and correlate user ratings with words. There aren't many products for which you can't find sentiment-annotated data. Lastly, for a more resource-intensive version of the bootstrapping approach you'd described, there was a good paper at this most recent NAACL: The Viability of Web-derived Polarity Lexicons that might be of interest. Good reference Aria! I missed this one.
(Jul 05 '10 at 16:49)
Delip Rao
The problem is, I would like to avoid using a seed set of words. I want an approach that is purely unsupervised, then has a subsequent supervised or semi-supervised phase. I don't want to do semi-supervised training initially. Thanks for the NAACL 2010 reference, I'll check it out.
(Jul 05 '10 at 16:49)
Joseph Turian ♦♦
Eh..no free lunch. I don't think its possible to get this in a totally unsupervised way, but I could be wrong.
(Jul 06 '10 at 14:42)
aria42
Depends. For example, you can't get great POS tags in a purely unsupervised way (I think?), but you can train an unsupervised model and then at the end add a little supervision to transform the unsupervised representation into the desired supervised POS tags. Similarly, my hope is to induce unsupervised word representations that contain enough lexico-semantic information that they can be fed to a supervised classifier with just a few labelled examples, and produce a high-quality model. Do you not believe this is possible?
(Jul 06 '10 at 19:06)
Joseph Turian ♦♦
|
|
@Joseph, I think the idea you propose is similar to: Peter D. Turney and Michael L. Littman. 2003. Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems, 21(4):315–346 The earliest work on this was by Hatzivassiloglou and McKeown (1997) who used the insight that, for English, the conjunction "and" links adjectives of similar polarity and the conjunction "but" links adjectives of opposite polarity. Vasileios Hatzivassiloglou and Kathleen McKeown. 1997. Predicting the semantic orientation of adjectives. In Proceedings of the ACL, pages 174–181. Jan Wiebe used distributional clustering to label adjectives. This is a "cluster-and-label" kind of semi-supervised learning and quite similar to @Joseph's suggestion. Janyce M. Wiebe. 2000. Learning subjective adjectives from corpora. In Proceedings of the 2000 National Conference on Artificial Intelligence. AAAI. In the presence of additional lexical resources like WordNet, graph based semi-supervised learning approaches can applied. Delip Rao and Deepak Ravichandran. 2009. Semi-Supervised Polarity Lexicon Induction. In Proceedings of the EACL '09. |
|
That's kind of cheating on your unsupervision requirement but if your data contains emoticons (smileys) these may serve as weak and noisy labels. For instance, this paper labels text by using the emoticons (J. Read, 2005. Using Emoticons to reduce Dependency in Machine Learning Techniques for Sentiment Classification). Here is a more ngram-based approach using the same method: Twitter as a Corpus for Sentiment Analysis and Opinion Mining |
|
It is unclear to me why you would want to be fully unsupervised in the first place. For English, there are a lot of freely available polarity lexicons that are pretty good. Also, getting seed sets (or even labeled training instances) for other languages is a relatively easy thing. Humans (even non-NLP experts) are pretty good at it, assuming that the sentiment is expressed fairly directly, as opposed to the sort of phenomena that was studied in Greene + Resnick (2009). Of course, we also have a huge amount of supervised data in terms of rated user reviews for a number of domains. Though this supervision is not direct if your goal is to learn lexicons, it is easy to imagine a number of scenarios where you could generate domain specific lexicons from it (e.g., vanilla mutual information with positive/negative reviews). Perhaps I am missing something obvious, but why would you not want to use all this easy to acquire knowledge? Even if you have a new domain with no resources, information from other domains should not be ignored, given that there are so many cross-domain sentiment terms (e.g., 'great', 'awesome', ...). Once you have prior knowledge (via seed sets or some other polarity lexicon or some auxiliary signal) you can do something like a joint-sentiment-topic model or any number of different ways to piggy-back on this prior knowledge. At some point, I think you will need to inject some weak prior knowledge, such as seed sets, into the model. Sentiment is often expressed via adjectives, which are distributionally similar to each other at a local level, e.g., "the shirt was great", "the shirt was bad", "the shirt was green". It might be possible to split of sentiment from non-sentiment adjectives, but positive-versus-negative adjectives will be tough since every focus word that can be 'good', can also be 'bad'. You may get something better by looking at document level co-occurrence, but even then, without a little prior knowledge to push the model in the right direction (even just with one or two words) I would imagine getting noisy results. Ryan, you are correct that it is easy to specify seeds. For this reason, I am back and forth on whether I want to use them or not. The argue against using them is that my main interest is in purely unsupervised techniques for inducing word representations, such that all lexico-semantic information is contained in the word representation. So giving seed words for polarity does not lead to more general-purpose word representations. Does this make sense? Regarding this idea of looking at document-wide cooccurrence distributions, this is also what I had in mind. The question is whether a word representation based upon this document-level cooccurrence is too noisy. I believe that clustering these word representations would be too noisy, but I am cautiously optimistic that these word representations could be mapped to polarity by using just the seed set as supervision. Not sure, though.
(Jul 06 '10 at 19:03)
Joseph Turian ♦♦
|
|
I have some questions about your suggestion:
|
|
I can describe an approach that I use in my work and link to a good paper about the topic. I first trained the rainbow classifier on the standard movie reviews corpus that is included in the python NLTK. Then I sorted the list of the words in the classification corpus by info gain, ie got a list of words that were the most polar between the positive and negative categories. Then I did a simple count on each word to see if it occurred more often in the positive or negative reviews. I used these words as seed lists and followed the procedure outlined in this paper, "Large-Scale Sentiment Analysis for News and Blogs". Essentially you start with your seed lists and then recursively query the WordNet database (also included with the python NLTK) to find synonyms. The details of the process are outlined in the paper. It can be a bit involved, but it's produced decent results for me. The polarity of words can change slightly between categories (ie for technology content "speedy" is positive but for dating content it might not be) but the movie reviews provide a good general starting point with 1000 documents in each category. |