|
I asked this question in another post, but it got lost in the noise. forgive me if this seems somewhat obvious or my hunches are dead wrong, i'm just trying to fill in gaps in my knowledge. I'm trying to investigate the relationship between dimensionality and quality of predictions of a black-box predictive model, eg svm, lr, nb, c4.5, ripper, or whatever your favorite classifier is, as a function of the number of labeled instances. i would expect that a lower dimensional representation has greater sample density, and fewer parameters to tune, so given only a few labeled examples, i would suspect that it'd be advantageous to project down and avoid variance-based errors. on the other hand, more data can overcome these errors for higher dimensional representations. is there a mistake in my thinking? The action to take if i'm correct here would be to try to project down if the training is small, and maybe do some feature expansion as the training size increases (eg quadratic combinations of features, etc) |
|
With regularized linear models you don't really experience a degradation of performance as you increase the dimensionality; most state-of-the-art discriminative natural language processing models have millions of features and thousands of examples, or less. Naive bayes and other density-based classifiers (anything that is based on a distance matrix) tends to suffer a lot with increased dimensionality, as (I expect) should c4.5. If you're considering this as a research project, unless you have a fundamentally new insight to bring to it I'd say you're better off doing something else. See for example section 10.7 of Elements of Statistical Learning for an analysis of this sort. |
|
thanks- i was actually thinking from the active learning perspective; it's not clear to me how problems should be represented as data is being gathered. surely the representation influences the selections that are made, which in turn influences the final generalization performance. what's extra tricky is that i'd suspect that selections made with one representation may not give the best performance when the features are changed at the final model induction phase. this may lead to some erroneous conclusions on how the problem should be tackled. furthermore- there isnt really a way to do x-validation to determine the best representation here; the data set is very biased due to the active selections. I would figure that the most simple (or lowest dimensional) representations would lead to lower variance, and potentially more flexibility later on. i'll check out what hastie has to say about this. and i'm really just trying to fill the holes in my knowledge. not really in love with this as a research project. |
This is known as "bias variance tradeoff"