|
I am working on a document classification task. It seems tempting to try and integrate the recently released tool word2vec, either by training it on the documents itself or training it on a larger background collection (wikipedia). One approach might to compute a tf-idf weighted average word vector for each document; another to cluster the words in the document using k-means (ala bag-of-visual-words in computer vision). However, I don't really know of any successful attempts to use word vector representation for documents classification. Are there any? Is there a good reason it should not work? |
|
It isn't nearly as easy to do this as one might imagine. Averaging a bunch of dense word vectors in low-dimensions will perform poorly and if you don't binarize the word counts first will be dominated by frequent words. If you have high dimensional sparse vectors then adding them makes more sense. You can run k-means on the word vectors from word2vec and then encode each word as the id of the k-means centroid it is closest to. Then you can build a histogram. This would be like the word count histogram except using the k-means codebook made for the dense word vectors. You could also fit a statistical model to each bag of vectors and characterize the bag based on the parameters of the model if they are identifiable and based on the score the model assigns to each training document otherwise. This paper has some more information about related approaches in the context of kernel methods: http://www.ai.mit.edu/projects/jmlr/papers/volume5/jebara04a/source/jebara04a.pdf Has anybody tried this suggestion of using word vectors & k-means centroid clustering?
(Oct 25 '14 at 04:38)
Aly
|
|
Take a look at the Socher et al. paper for EMNLP next month: http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf They use a recursive compositional model that at least scales to phrases (i.e. movie review), but may not work well for longer documents (i.e. news). Interesting. But a) this is about sentiment analysis, which has different characteristics then traditional document classification; b) this will be very different to implement, compared to say just clustering word vectors which come out of word2vec. I also doubt that this model scales beyond sentences.
(Sep 27 '13 at 05:59)
Maarten
That said, I believe that those guys are really onto something. I think in the next 10 years there will be a real breakthrough in NLP. Might be worth studying this paper.
(Sep 27 '13 at 06:03)
Maarten
|
I found this paper: http://ai.stanford.edu/~ang/papers/nipsdlufl10-ProbabilisticModelSemanticWordVectors.pdf
They use the mean word vector to do sentiment analysis. However, they adjusted their word model to work at the document level (like LDA).