|
I am currently performing L2 penalized Least Squares regression with features generated by a variety of models (tf-idf bag-of-words, LDA document cluster probabilities, Wiener filter predictions) in order to predict what direction a stock price goes after a news article's publication. I'm finding that regardless of normalization, I achieve significantly better test performance in mean squared error using features generated by tf-idf rather than placing those same features alongside those generated by other methods. I have done my best to ensure each feature has approximately the same scale, though I have avoided zeroing the mean for the sparse bag-of-words features. Can someone give me some intuition on why adding more features can hurt performance? Since lacking a feature is equivalent to having zero weight, it is unintuitive that more information would have a negative impact. |
|
For the training set error your intuition is exactly right, and you can say that adding a new feature can only decrease training set error (and the amount by which it decreases, in a linear model, is always smaller if you're adding to a set of features versus adding to just a subset of those features). For test-set error, however, this is not true, and this means that your model is overfitting, and leveraging patterns in these features that do not exist on the test data. If your training and test sets are IID and your model is regularized (with l2 regularization) and you have enough training examples adding features shouldn't hurt. Remove any of those conditions, however, and adding features can hurt. If your test-set is drawn from a different distribution than your training set you might want to try covariate shift methods to negate the bias that this induces, and if you don't have enough data points you might want a sparsity-inducing regularizer (like the l1 norm) to learn a more compact classifier that hopefully will only leverage the good features. The IID assumption is really important in my case, as sequential stock prices are highly affected by seasons. It's really quite striking how much more movement there is from October ~ December than there is any other time of the year. I believe I have enough data points (~300K) but not enough to have a full rank inverse (550K features). I'll see if I can play with some other details to make things work.
(May 07 '12 at 02:42)
Daniel Duckwoth
On a more "specialist" note, we've recently wrote a paper on portfolio management optimization, and a lot of people agree that this season is very atypical to try and do something meaningful. Try working on large time windows to do your training, since the natural frequency of stocks seems to be at leas on a 10-year time scale.
(May 07 '12 at 04:20)
Leon Palafox ♦
|
|
Andrew NG has a bit of explanation on that in some slides called Advice on Applying Machine Learning, there are some nice sanity tests to see if your algorithm is working or not. I read all the course notes for Stanford's CS229 2 years ago, but somehow I missed the contents of this lecture. Very useful advice that you'll never see in a theory course!
(May 07 '12 at 01:42)
Daniel Duckwoth
|
|
Extra garbage features are especially troublesome in financial market prediction, because in practice good features are often only slightly better than random. That makes it really hard for feature selection algorithms to pick out a good feature when there are lots of irrelevant features. Imagine, for example, you're trying to predict the direction of the stock market tomorrow (up/down), and you have a feature that you know, a priori, is correct 55% of the time. If you now add 10 random features, 5 of them on average will be good predictors, with accuracy > 50% (with the other 5 <= 50%). If your data set is small, the best random feature could easily show better in sample performance than the 55% your true feature provides, so your fitted model would down-weight the true predictor, in favor of some of the random ones, leading to worse out-of-sample performance. |