I am working on a document classification task. It seems tempting to try and integrate the recently released tool word2vec, either by training it on the documents itself or training it on a larger background collection (wikipedia).

One approach might to compute a tf-idf weighted average word vector for each document; another to cluster the words in the document using k-means (ala bag-of-visual-words in computer vision).

However, I don't really know of any successful attempts to use word vector representation for documents classification. Are there any? Is there a good reason it should not work?

asked Sep 24 '13 at 01:40

Maarten's gravatar image

Maarten
1315613

edited Sep 24 '13 at 09:52

I found this paper: http://ai.stanford.edu/~ang/papers/nipsdlufl10-ProbabilisticModelSemanticWordVectors.pdf

They use the mean word vector to do sentiment analysis. However, they adjusted their word model to work at the document level (like LDA).

(Sep 24 '13 at 05:36) Maarten

2 Answers:

It isn't nearly as easy to do this as one might imagine. Averaging a bunch of dense word vectors in low-dimensions will perform poorly and if you don't binarize the word counts first will be dominated by frequent words. If you have high dimensional sparse vectors then adding them makes more sense.

You can run k-means on the word vectors from word2vec and then encode each word as the id of the k-means centroid it is closest to. Then you can build a histogram. This would be like the word count histogram except using the k-means codebook made for the dense word vectors.

You could also fit a statistical model to each bag of vectors and characterize the bag based on the parameters of the model if they are identifiable and based on the score the model assigns to each training document otherwise. This paper has some more information about related approaches in the context of kernel methods: http://www.ai.mit.edu/projects/jmlr/papers/volume5/jebara04a/source/jebara04a.pdf

answered Sep 25 '13 at 15:20

gdahl's gravatar image

gdahl ♦
341453559

edited Sep 25 '13 at 15:22

Has anybody tried this suggestion of using word vectors & k-means centroid clustering?

(Oct 25 '14 at 04:38) Aly

Take a look at the Socher et al. paper for EMNLP next month: http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf

They use a recursive compositional model that at least scales to phrases (i.e. movie review), but may not work well for longer documents (i.e. news).

answered Sep 26 '13 at 14:04

Ben%20Gimpert's gravatar image

Ben Gimpert
614

edited Sep 26 '13 at 14:08

Interesting. But a) this is about sentiment analysis, which has different characteristics then traditional document classification; b) this will be very different to implement, compared to say just clustering word vectors which come out of word2vec. I also doubt that this model scales beyond sentences.

(Sep 27 '13 at 05:59) Maarten

That said, I believe that those guys are really onto something. I think in the next 10 years there will be a real breakthrough in NLP. Might be worth studying this paper.

(Sep 27 '13 at 06:03) Maarten
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.