0
1

Several very useful word representation datasets are available at: http://metaoptimize.com/projects/wordreprs/ however these seem to be most applicable to sequence labeling tasks. Are there similar pre-compiled datasets more applicable to document-level classification tasks?

Looking for off-the shelf dataset existed which would be helpful for doc classification, such as one created with LSA or LDA. The Brown Clusters didn't help, because I don't believe they tend to encode much topical information. I believe this would be transductive learning, if I used this data for features in a supervised setting.

asked Sep 28 '11 at 18:43

John%20Lehmann's gravatar image

John Lehmann
1112

edited Sep 29 '11 at 16:25

If you are looking for POS tagged datasets, there is this question on StackOverflow: http://stackoverflow.com/questions/1053961/looking-for-any-free-tagged-english-corpuses

(Sep 29 '11 at 00:06) Robert Layton

Can you explain a bit more about what you are looking for? Are you looking for word representations or algorithms for creating document representations?

(Sep 29 '11 at 00:16) gdahl ♦

Added clarifying comments above.

(Sep 29 '11 at 16:26) John Lehmann

One Answer:

First of all, have you actually tried using this data in document-level tasks? My intuition is that it will probably make things better, even if it was generated with more non-local properties in mind.

The key question then is how to aggregate individual word representations into a document representation you can use in your feature vector. If this is your main concern I suggest you use a probabilistic topic model (such as LDA) to generate topic features. Dan Ramage suggests that making the dot product between two documents being 0.8*(tf-idf dot product) + 0.2*(dot product of document/topic allocations, normalized) is a sweet spot that's usually hard to beat.

answered Sep 29 '11 at 08:25

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1898244214335

edited Sep 29 '11 at 08:25

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.