|
Several very useful word representation datasets are available at: http://metaoptimize.com/projects/wordreprs/ however these seem to be most applicable to sequence labeling tasks. Are there similar pre-compiled datasets more applicable to document-level classification tasks? Looking for off-the shelf dataset existed which would be helpful for doc classification, such as one created with LSA or LDA. The Brown Clusters didn't help, because I don't believe they tend to encode much topical information. I believe this would be transductive learning, if I used this data for features in a supervised setting. |
|
First of all, have you actually tried using this data in document-level tasks? My intuition is that it will probably make things better, even if it was generated with more non-local properties in mind. The key question then is how to aggregate individual word representations into a document representation you can use in your feature vector. If this is your main concern I suggest you use a probabilistic topic model (such as LDA) to generate topic features. Dan Ramage suggests that making the dot product between two documents being 0.8*(tf-idf dot product) + 0.2*(dot product of document/topic allocations, normalized) is a sweet spot that's usually hard to beat. |
If you are looking for POS tagged datasets, there is this question on StackOverflow: http://stackoverflow.com/questions/1053961/looking-for-any-free-tagged-english-corpuses
Can you explain a bit more about what you are looking for? Are you looking for word representations or algorithms for creating document representations?
Added clarifying comments above.