I was wondering if anyone could point me to, or could list the general classes of feature representations for text. I'm having a hard time finding a good listing anywhere (and tutorials almost all stick to bag-of-words).

asked Dec 12 '11 at 17:35

The wordreps on the metaoptimize page, link, might be a place to start. There's also links to papers providing details.

answered Dec 12 '11 at 19:25

It is important to recognise why you are working on text. Is it topic clustering, sentiment analysis, authorship attribution, language analysis or any other method?

Once you work out that, google for those phrases, rather than "feature representations".

To give a specific answer, feature types are generally in four categories. Syntactic features (including POS tagging), structural features (i.e. sentence length, number of paragraphs), lexical features (including character n-grams) and content specific features (i.e. email character encoding, which doesn't make sense in most other contexts). Which ones you use, and how they are used, depends on the application you are doing.

answered Dec 19 '11 at 19:52

