|
I'm working on an unsupervised method for incrementally converting variable length sequences, or portions of sequences, into a fixed size vector representation. The idea is to use the method in combination with any standard supervised classifier (svm, logistic regression, random forest, etc) to perform problem specific classification. I'm hoping to get some recommendations for datasets which might be good to benchmark against. Initially things like POS tagging and NER come to mind, but I would also like to expand past NLP problems. Suggestions for available datasets are appreciated. Even better is papers which provide recent results for different methods (HMMs, CRFs, etc) applied to these type of problems. Thanks. |
|
Images are sequences, for one. Also, maybe tasks like sentiment analysis or document classification would work better in this formalism, as the output is one variable and not a whole sequence. There is also interested in classifiers that work on things like DNA sequences, EEG data, ECG data, etc. Can you point towards any specific datasets? And ideally, if possible, papers that use them?
(Oct 18 '11 at 14:47)
gdahl ♦
I don't think it is very useful to think of images as sequences. I think of images as being at least 2d, and "sequence" to me means 1d. Both are examples of variable dimensionality data and both can exhibit "dimension-hopping" where if each pixel or position is an input for a classifier then changes to the input that shouldn't change the class can change what input the same information comes in on, but other than that image data and sequence data are quite different.
(Oct 18 '11 at 14:51)
gdahl ♦
@gdahl I'm not sure I would agree with you. Images can definitely be viewed as sequences. Cursive hand writing recognition is a good example. Less obvious would be something like nips_eyebm.pdf, where they perform image classification using a sequence of fixations. When I said sequence, all I meant was structured in time, i.e. video, natural language, etc. @Alexandre, thanks for the suggestions. Sentiment Analysis is something I've been thinking about, short text such as the twitter data available here might be a good place to start. Looking around here I found the bouncing balls dataset mentioned in this thread. I trained a model and the samples don't look awful. edit: accidentally deleted comment
(Oct 19 '11 at 00:01)
alto
|
I have been working on something similar off and on for a while and I also have been unable to find really compelling benchmark problems. Although I don't know much about the area, there have to be some biological sequence data.