My feature space is large and includes free form text. I've had some success using a naive bayes classifier on it, but I'd like to try more sophisticated classifiers - likely neural networks and SVMs.

For the text I'm creating a feature for each word (eg. "surfing and skiing" results in the features: word:surfing=true, word:and=true, word:skiing=true).

I'm looking for a good way to prune my feature space and to pick out salient pieces from the free text as input for the NN/SVM/Other.

It occurred to me that I could train the naive bayes classifier, select the "most informative" (eg. as defined by nltk's most_informative_features), and use those as input to the NN/SVM/Other.

Is this a reasonable approach? Are there better approaches?

asked Dec 09 '10 at 22:35

Parand's gravatar image


7 Answers:

There's nothing particularly wrong with the approach that you outlined. However, there are many other options. What you're describing is some form of dimensionality reduction -- whether feature selection or a more general approach. The feature you're using now is a traditional bag-of-words representation of text. tf*idf is definitely something you should know about. It's a remarkably useful and transparent way to determine the discriminative value of words (or features generally) to documents within a corpus. (Naturally, you can stretch the definition of a "document" to be whatever group of text you like.)

As terrific as tf*idf is, it's really just the tip of the iceberg. You could easily use the conditional entropy of a document (or class variable) given a term H(c|w). You could use LSA or PLSA, if you're prepared to wrangle with singular value decomposition (SVD) of a potentially enormous matrix. Multi Dimensional Scaling (MDS) and Principle Components Analysis (PCA) are also popular options. @Aman's post describes forward selection, a very traditional feature selection algorithm. There's no requirement that forward selection start with only 5 features. You could easily start with 500 elements that have the highest tf*idf value, and incrementally add sets of 1, 5, 50, or 500, until you see the likelihood plateau.

That said, listen to Alexandre and give the SVM a shot without any explicit feature selection and see what happens.

answered Dec 15 '10 at 01:06

Andrew%20Rosenberg's gravatar image

Andrew Rosenberg

One way to solve the problem of feature selection is to start with a very small (say 5) features and then iteratively add one feature, train the model again, and check if the likelihood of the training samples increases by a significant percentage/threshold. If it does, then your feature is useful, other wise not. You could run this set up several times, starting with a high value of threshold (say, 30%) and then decreasing it with every complete run. As you decrease the threshold, the number of features being selected will increase. You can stop when you have acceptable number of features.

answered Dec 10 '10 at 00:23

Aman's gravatar image


I have free text, so my feature set is potentially all words. Selecting 5 features out of a pool of potentially tens of thousands and iterating seems daunting.

(Dec 10 '10 at 02:27) Parand

You shouldn't need to do feature selection for SVMs unless you've got pathological data that is really sparse, since the L2 regularization in the SVM formulation should take care of overfitting and most SVM implementations handle text data really well.

For neural networks you're better off doing dimensionality reduction instead of feature selection. I'd try random projections first and if you don't get good enough results maybe something like PCA.

answered Dec 10 '10 at 00:58

Alexandre%20Passos's gravatar image

Alexandre Passos ♦

What's a good way for me to get educated on handling textual data with SVM? How do I encode the text as a feature set?

(Dec 10 '10 at 02:28) Parand

The usual feature encoding for words is called tf-idf and is based on information retrieval techniques. If you're not familiar with it I highly recommend that you read Manning et al's book on information retrieval ( http://nlp.stanford.edu/IR-book/information-retrieval-book.html ), more specifically chapters 6, 14, and 15.

(Dec 11 '10 at 01:50) Alexandre Passos ♦

Thanks Alexandre, I'm not familiar with tf-idf so I'll do some reading.

(Dec 15 '10 at 00:12) Parand

I usually use the simplest binary representation (counting each word only once per document) with a L2-regularized SVM. This baseline is often very hard to beat with more exotic representations. It is also very scalable. If you get too many features, or need to do online learning where the number of features grows with each instance, just use a hash-function to map your feature indices to a fixed interval.

(Jan 28 '11 at 06:14) Oscar Täckström

Why would you need to do dimensionality reduction with a neural network? I don't believe that is necessary. One should just put all the features in the input layer.

(Mar 12 '11 at 19:55) Joseph Turian ♦♦

The problem with performing feature selection as a pre-processing step is that you fail to take the interaction of the feature representation and the learning algorithm into account. For example, if you use mutual information as a criterion, you assume conditional independence between features. This is also a problem with forward selection: assume you have features f1,f2,f3,f4 such that f1 and f2 are sightly discriminative, f3 and f4 together are the most discriminative features, but f3 or f4 alone are the least discriminative. Then the algorithm will select f1 and f2, but not f3 and f4. Backward elimination handles this better, since you start with all features active you take interactions into account, but it is still only locally optimal.

If you instead use L2-regularized SVM, you will take interactions between features into account and you will get a globally maximal optimum, though you will need to tune the regularization parameter. Still you should be careful about only adding features that you have good reason to believe will improve the results. As Alexandre points out, if you need to use feature selection, it probably means you added a poorly designed feature in the first place. In the case of small sample sizes and large number of features, L1-regularization seems to have a bit stronger theoretical guarantees, but I'm not aware of any practical results that support this in general.

Edit: If features for some reason are expensive to obtain, which is common in the medical domain, then feature selection might be very important. However, performing feature selection as a pre-processing step in order to reduce the number of words in a BoW representation for an SVM classifier does not seem like a good idea to me.

answered Jan 28 '11 at 06:33

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström

edited Mar 11 '11 at 10:09

This has certainly been done in the past, at least with other algorithms as the selector (most often tree induction). It is a quick shortcut, but clearly doesn't guarantee optimal results. Consider that naïve Bayes, for instance, ignores variable interactions. If the model you ultimately use does pay attention to variable interactions, your selection is likely not well tuned.

If speed is your goal, you could do worse than to use this idea. For model performance, however, you will likely do much better with a wrapper solution. I have had good luck with genetic algorithms for variable selection, but there are plenty of other methods out there (forward/backward/stepwise, for instance).

answered Jan 28 '11 at 06:33

Will%20Dwinnell's gravatar image

Will Dwinnell

Some input from a non-expert here. I am developing a text classifier as well, with large dimensionality (many words) and large # of input documents.

Based on the advice of Alexandre Passos, I achieved a large jump in accuracy by using the 'maximum entropy' classifier in my ML lib, Python NLTK. This is sometimes known as logistic regression. The maxent classifier actually uses the same exact feature data structures as the naive bayesian so it was a slide-in replacement.

I struggled for a couple of months with naive Bayesian and selecting features (tokens, bi-grams, top 1000 more informative, etc, etc, etc. I could never really achieve better than 50% accuracy.

You might also want to build a separate classifier (maybe as an input or adjunct to your main classifier) that looks at file metadata - discretized creation date, file owner, file size, file type/extension, email headers, etc. I had surprisingly good results classifying on this type of data with naive bayesian. I still need to explore this idea further with other classifiers.

answered Feb 01 '11 at 15:22

mfhughes's gravatar image


edited Feb 01 '11 at 15:26


You only need more that Naive Bayesian if your data attributes have a non-linear correlation with outputs or themselves.

answered Feb 03 '11 at 07:15

Armando%20Vieira's gravatar image

Armando Vieira

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.