|
I have been working on text classification for a year. And the best technique that works for me is a linear classifier. Usually it is an svm with linear kernel or sgd with simple handcrafted features like unigram and bigram combination. Classifier operates in a high-dimensional space on very sparse vectors. So that allows to capture some even complicated syntactic patterns. But sometimes that approach fails to capture 'deeper' patterns. E.g. assume you need to capture some text messages of the people who need a medical treatment but do not say about it directly. For that kind of problem my classifier gives very low stats. So I believe there should be some clever ways to produce 'right features'. I tried PCA, latent semantic indexing, clustering and some other transformations. But all that I tried so far can't beat that simple liner svm + uni-bigram baseline. I feel that I am missing something. Maybe I need the large scale.. Maybe I need some deep learning tricks for feature learning.. I got stuck and need some help :). So the questions are:
Any links to the relevant publications are very welcomed. Thanks! UPDATE I feel that I need to explain what I'm trying to do. I'm working on classification of short messages E.g. facebook comments or twits to capture people sentiments or intentions. For example, when people ask for something on specific topic - "what is the best smartphone to upgrade to?" or express dislikes - "I hate this new iTunes!" . Informally, I can distinguish three categories I usually work with:
Thanks.
showing 5 of 6
show all
|
|
I've had very good experiences with creating features from Latent Dirichlet Allocation topics. I used the topics-as-features from LDA and unigram words as features together, and this usually gives better classification than either LDA or unigram features alone. To get any good quality of the topics LDA finds, you'll need to have a sizable corpus, though. Alternatively you could find a large corpus that's relevant to your text classification task and train on this instead. Another thing I've found, is that optimization of the hyperparameters in LDA makes a big difference for the quality of the topics. How many topics do you typically extract for building additional text classification features? 100, 1000, more?
(Jan 05 '13 at 08:53)
ogrisel
For my tasks I've extracted 100-150 topics, but the number of topics is a model parameter, so it really depends on how many topics you believe the text you want to classify actually contains. My approach has been to just guess a number of topics, look at the words in the resulting topics after training the LDA, and then increasing the number of topics and retraining if I see a lot of topics that are obviously mixtures of several different topics. According to this paper, you can set the number of topics as high as you want without reducing topic quality if you use asymmetric priors and hyperparameter optimization, so you could theoretically set number of topics very high and let the optimization sort it out, but that means you'll probably have to do some feature selection down the road for text classification tasks.
(Jan 05 '13 at 09:38)
Audun Mathias Øygard
Thanks. I have not tried such approach. I wonder if that significantly differs from using pca or kmeans transformed vectors as features.. Trying to get through the paper you pointed.
(Jan 06 '13 at 07:37)
Konstantin
1
It's not too unsimilar to PCA and LSI, so you might try doing the same thing (topics + unigrams as features) with LSI first. I tried with LSI as well, but got better results from LDA. The software I used was MALLET by the way.
(Jan 06 '13 at 19:24)
Audun Mathias Øygard
Could you tell me more on how to represent topic features? Usually, using bow features, a document is represented like "+1 0:12 1:2 2:10 3:12 ...", which the first one is class label, and then each word index and its count(just to make things clear, I am sure you know this). And how exactly should I use topics and unigrams as features? Let's say I got a topic like this: 'word' 0.012 + 'number' 0.011 + 'computer' * 0.009 + ... how to combine K topics information into the unigram features? or should I just use the topic distribution as features, simply throw them in the classifiers? Thank you very much!
(Mar 27 '13 at 14:58)
Zhibo Xiao
|
|
I'd like to add extra note. I have implemented pca transformation and used the components as features as Audun Mathias Øygard suggested. I used svm classifier with non-linear rbf kernel on top of the features. It gave me a 5% increase (from 70% to 75%) comparing to linear classifier in higher-dimensional space. Quite interesting. I know that it's pointless to discuss an unknown dataset. But anyway for whom it may be interesting:
My interpretation here is that the dataset is noisy (in terms of finding decision boundary) and pca reduces that noise. ps Audun, thanks Very interesting results, thanks for sharing!
(Jan 16 '13 at 11:36)
Alex Measure
|
|
Linear classifiers are amazing when it comes to text classification. Due to the high dimensionality of the data it is easy to find the a separating hyperplane. I have been faced with the same problem in almost the same setting (sentiment analysis). As you said I tried many things and nothing worked significantly better. As the text is informal and there are many slang expressions I even tried using things like Metaphone to be able to capture expressions which sound similar. It didn't help. I also tried LSI and LDA - it didn't bring any improvement either (I didn't try using them in conjunction with the BOW, tough). The only thing that brought somewhat better results was the trick Jacob mentioned, appending the part of speech tags to the words. I personally really like this idea. But, don't use the classical POS taggers, I recommended you to try something like this. POS tagging can be seen as some very low level disambiguation of word senses, so my hypothesis is that this is the reason why it brings some improvement. Anyway, if you are aiming for deeper understanding of the text, as you have mentioned in the update, I think this kind of shallow features will not help you much. Parsers and the similar tools are one way to do this. But, at least I don't know of any nice way to do machine learning for sentiment analysis on top of them (I don't like the idea of making hand crafted rules). I would recommend you to try using Wordnet, but as you are dealing with informal text, this is almost impossible. In fact, that the main problem you are facing it that making deeper analysis requires using tools which are trained and work with formal text and are still not adopted to text like the one present in the social media. This POS tagger a cited above is one attempt to bridge this gap. BTW, if you haven't seen this tutorial, take a look, it may give you some ideas of what you may try next. Martin, thanks for the comprehensive answer. Actually, I have tried different pos tagger and even written my own (svm-based). Though, it didn't give me improvement in quality. The wordnet could help but as you mentioned it doesn't work good for informal speech acts. And I do not really like the linguistic approach. It is too hard and very fragile. And from my experience it doesn't give improvement over 'pure' statistical methods. Thanks for the tutorial.
(Jan 06 '13 at 07:29)
Konstantin
|
|
Can you say more about what your data look like? I've been working on text classification and similarly have best results with linear classifier and simple features (typically SGD-LR and some combination of word and char ngrams). One of the things I do is speech act classification, e.g. identifying whether a string is asking a question. People can do this in very indirect ways (not least by omitting obvious orthographic cues), but I still get quite acceptable results, even with a comparatively small training set. I've got a good amount of unlabeled data, and will be investigating semi-supervised approaches in the near future. If anything good comes of it I'll come back and update this answer. fmailhot, take a look at the update. I have the same experience - it is possible to capture even indirect question. But there are some categories for which this approach simply fails (for me). It would be great if you find something. Thanks.
(Jan 05 '13 at 04:56)
Konstantin
|
|
Not sure if this will help, but here's some other ideas for feature extraction:
From what you said, I'd guess that wordnet features and word ngrams are the most likely to be useful. But my experience is the same as yours - it's very difficult to beat basic unigrams + bigrams. 2
In my experience, character-level n-grams can actually degrade performance. (I tried them to compensate for noise due to sloppy typing and spelling variations in product names in short texts.)
(Jan 04 '13 at 18:44)
larsmans
Jacob, it is quite close to what I do when playing around with feature s. As you pointed sometimes prefixes and postfixes can help (but not dramatically from my experience).
(Jan 05 '13 at 04:50)
Konstantin
|
The first relevant paper that comes to mind here is Collobert et al's Natural Language Processing (almost) from Scratch. They train a neural network architecture with (almost) no human-engineered features and (almost) no labels, and obtain very good results. It's also a relatively simple approach, at least as far as the math goes. 1
Those are not text classification task but higher level, sequence tagging oriented tasks such as POS tagging, named entity detection, semantic role labeling and dependency parsing. Those tasks are more involved than text classification and require indeed more expressive power than a simple linear classification model. For text classification though Collobert et al give no evidence that neural network embedding can beat pure linear model trained on simple word or bigrams features.
(Jan 04 '13 at 15:17)
ogrisel
Thanks, yeastwars. They apply feature learning to pos/ner/chunks etc tasks as ogrisel said. It's a little bit different task. Though the approach they use is somewhat I'm looking for. The paper propose an ability to catch deep relations using unlabeled texts. I'm not sure if it could be applied to text classification directly but anyway it is interesting. Thanks.
(Jan 04 '13 at 15:38)
Konstantin
|
I've heard (informally) that features induced by topic models/dimensionality reduction in conjunction with ordinary (uni- or bigram) BOW features give better results than the BOW features by themselves. Did you try that? Also, document metadata (title/author fields etc.) might work, if it's not too noisy.
I'm curious about what you are trying to do... could you point me at some example paper?
SeanV, take a look at the update. Unfortunately, I'm not from the academia world..
larsmans, it is interesting. I didn't tried it in conjunction. I will. Thanks.
maybe i'm old fashioned but I really think if you want to handle those cases you need to use a true language model rather than memorising word associations.
Do you mean the classical morph-syntactic-semantic analysis? linguistics is really hard and fragile. statistical method much more robust imo.
they are not mutually exclusive. its just the level of representation you use to do statistics on. just as you wouldn't do text classification using the pixel level. I am sure there is a lot of work combining traditional linguistics with ML just as with computer vision. It might be worth finding out about the watson system used to win jeopardy- which obviously is overkill. But at least it is a proper functioning system using semantic/syntactic analysis to solve real world problem
the fact that the simplest possible model (linear svm with uni/bigram features) is highlighting that "statistics" is not the answer - you have to provide a deeper level of analysis. Ask yourself if you would be able to extract the meaning of your "hard" sentences from the BOW representation