|
I am currently having a data set, class 1 with about 8000 short text files and class 2 with about 3000 short text files. I applied LibSVM and tried a couple of parameter combinations in the cross-validation experiment. Generally the class 1 precision falls into the range of (85%, 90%); the class 2 precision falls into the range of (70% , 75%); the recall of both class 1 and class 2 fall into the range of (80% , 85%). For the text classification purposes, I built text feature space following the common approaches, tokening the document, filtering the stopwords and building the word vector using tf-idf or binary frequency, etc. I also tried n-gram model to build the feature space. But these approaches did not improve the performance a lot. I would like to know are there any other ways that may help tune the LibSVM to improve the performance. LibSVM provides grid search for parameter setting up, but it runs pretty slow. |
|
You should figure out what's going on first. 1) Look at the PRC and RCL on the training data. Do they hit near 100%? 2) Look at the validation examples that the SVM is the most wrong on. Literally sit down and look at them. Is there a pattern? Is there a particular type of error that it makes? This is very important to do. 3) Look at the PRC and RCL curves on the validation data, as you vary the # of training instances from 10 to 11000. Does it level off? If so you need better features. If not, you could try unsupervised learning on a larger corpus before the SVM step. After this, you will have a better sense of what's going on and we can make more recommendations. |
|
SVM's are usually somewhat sensitive to the regularization hyper-parameter, finding the best value using grid search is worthwhile. I would also think hard about whether there are any additional features that would be useful for your classification task, there almost certainly are. One way to identify these is to closely examine the classification errors your algorithm makes, and then think about what sorts of information your algorithm is missing or mis-understanding that is causing it to make mistakes. For example, if you're doing positive/negative sentiment classification, you will find that negating words can be very important. You might think words like "good" or "awesome" indicate a very positive sentiment, but that's not the case if the phrase it came from was "there is nothing good about this movie." One might address this, for example, by looking for negating words in a sentence and then modifying the other words in the sentence to "not_good", etc before you feed them to your algorithm. |