0
1

Several very useful word representation datasets are available at: http://metaoptimize.com/projects/wordreprs/ however these seem to be most applicable to sequence labeling tasks. Are there similar pre-compiled datasets more applicable to document-level classification tasks?

Looking for off-the shelf dataset existed which would be helpful for doc classification, such as one created with LSA or LDA. The Brown Clusters didn't help, because I don't believe they tend to encode much topical information. I believe this would be transductive learning, if I used this data for features in a supervised setting.

asked Sep 28 '11 at 18:43

John%20Lehmann's gravatar image

John Lehmann
1225

edited Sep 29 '11 at 16:25

If you are looking for POS tagged datasets, there is this question on StackOverflow: http://stackoverflow.com/questions/1053961/looking-for-any-free-tagged-english-corpuses

(Sep 29 '11 at 00:06) Robert Layton

Can you explain a bit more about what you are looking for? Are you looking for word representations or algorithms for creating document representations?

(Sep 29 '11 at 00:16) gdahl ♦

Added clarifying comments above.

(Sep 29 '11 at 16:26) John Lehmann

2 Answers:

First of all, have you actually tried using this data in document-level tasks? My intuition is that it will probably make things better, even if it was generated with more non-local properties in mind.

The key question then is how to aggregate individual word representations into a document representation you can use in your feature vector. If this is your main concern I suggest you use a probabilistic topic model (such as LDA) to generate topic features. Dan Ramage suggests that making the dot product between two documents being 0.8*(tf-idf dot product) + 0.2*(dot product of document/topic allocations, normalized) is a sweet spot that's usually hard to beat.

answered Sep 29 '11 at 08:25

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Sep 29 '11 at 08:25

This paper uses distributed representations in document classification task. Although the method they used to induce distributed representations is not specifically intended for document classification, but these kinds of model can be modified to induced distributed representations for document classification.

In fact word representations are being used in supervised learning algorithms to generalize the parameters of unseen related words while testing in test set, these kinds of representations can improve the performance if they are appended as extra features in already existing supervised learning setup.

answered Feb 18 '13 at 18:29

Kuri_kuri's gravatar image

Kuri_kuri
293273040

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.