|
I am writing my own feature selection algorithm. The task is basically to eliminate some feature that I have from this data. The brute force would be to use every single words in the document as a feature to classify this as either positive or negative. Now I want to write a feature selection that would make it do a better job. What do you guys think?
showing 5 of 13
show all
|
|
An alternative to feature engineering is to use a latent SVM (LSVM). Yessenalina et al. describe an application of the LSVM to sentiment analysis that essentially learns which sentences in a text are most relevant to the opinion (i.e., the class label). This isn't feature selection per se, but it could make your feature values less noisy. A link to the paper, and their implementation can be found here: http://projects.yisongyue.com/svmsle/ This might be a more complex solution than what you had in mind, but I think it's worth pointing out. |
Before you spend too much time on developing feature selection heuristics, I would suggest you train an L2-regularized linear SVM with binary features (word w_j occurs or not occurs in document d_i). Be careful to tune the regularization parameter of the SVM using cross validation. Once you have this baseline, you can consider different feature representations, feature selection etc, but in my experience, it is pretty hard to beat that baseline. You must also be really careful not to do too many experiments on the same data set. Even using cross-validation or bootstrapping, soon enough you will have fine-tuned everything to that particular data set.
I'd like to do this, but if you open that dataset on the link above.. the size of that binary features would be very-very large. Weka takes 2 hours to run the classification... that's why I would like to jump in and write the feature selection to avoid that 2 hour
In my experience, good feature selection takes much longer than the actual training of the classifier. Shouldn't each instance be sparse? In that case, feature selection won't buy you that much. I would suggest that you use a fast linear SVM / logistic regression implementation such as liblinear (http://www.csie.ntu.edu.tw/~cjlin/liblinear/), or Vowpal Wabbit (http://hunch.net/~vw/). If the size of the parameter vectors is an issue, consider using feature hashing to save memory. This is already implemented in VW.
Yes it's a very sparse instance, I've had the arff formatted in a sparse arff. You're saying that eliminating several unimportant features won't help? Why is that? I think what's causing it to be slow is the number of features it has.. it has around 47k features..
Alex, what Oscar is suggesting is that to get a substantial reduction in computation time you're going to have to cut a large fraction of the features, and to do this without running a classifier is very difficult. 45k features per example is actually a very small number of features, and you should be able to train a linear SVM or logistic regression classifier in less than a minute on most hardware, using a proper implementation that directly supports sparse vectors. Doing feature selection to cut a substantial fraction of the features is likely to take far longer than just training a classifier and probably won't improve the results substantially.
are you sure that it can be done in less than a minute? try to run this in weka and see if you can do it in less than a minute.. http://dl.dropbox.com/u/19680269/movie.arff I tried using J48 and it won't stop till an hour.
Decision trees are typically much slower to train compared to linear SVMs. Still, I don't know what goes on inside Weka. Training even a SimpleLogistic classifier seems to take a long time. However, when I dumped the data in libsvm format and ran liblinear from the command line, it took me 2.5 seconds to run a 5-fold cross validation with an accuracy of 82.05% (I didn't run any parameter tuning). If you need to use Weka, which is convenient for getting confusion matrices and so on, I suggest you add liblinear to your classpath and use the liblinear wrapper in Weka.
Did you try with my data? What do you mean by the libsvm format? isn't everything in weka based off arff format? If I want to do parameter tuning in libsvm what are some important ones to adjust?
Yes, I ran this on the movie.arff data you linked to. By "libsvm format" I mean the format used by the libsvm and liblinear software. You can dump your data to this format in Weka by using the "Save..." button in the Experimenter and selecting ".libsvm" as the file format. You can then use the cross validation capabilities in liblinear from the command line. However, you should also be able to run liblinear directly from Weka. I have very little experience with Weka, so I suggest you look on its documentation on how to set this up (it seems like you need to set some classpath).
With liblinear you really only need to tune the C-parameter.
Did you try running directly without converting it to libsvm format and just straight from the arff? It breaks directly when evaluating the 3rd fold, although I have increased the heap to some large amount.
From that experimenter, when I click on save, it doesn't give me an option for a .libsvm.. just the .exp
I managed to run a 66% split with a VotedPerceptron inside Weka, but I didn't have time to wait for any other algorithms to finish.
Sorry, I meant the save button in the Explorer, not in the Experimenter.
when opening the libsvm file, I now get the use a larger heap problem.. I have the max heap set to 1500m... how much memory do you have? I have set that maxheap in RunWeka.ini to 1500m, can't go higher than that
See also: Text classification with very few labeled examples? (Overfitting rare features) and Text classifiers using accidental features