|
What are some favorite techniques for handling feature vectors with missing variables? The simplest (and most universally applicable) seems to be to impute values by sampling from the training data distribution of the feature. But from a statistics point of view, I kind of hate imputing values -- you end up learning or making decisions using false information -- but this would certainly work. Decision Trees have an easy out, by letting the decision pass through if a value is missing. But decision trees give garbage confidence scores, and have fairly blah performance on a lot of tasks. Are there any other contenders? |
|
The right thing to do will depend on whether you believe that your data is what they call "missing at random" or not. This is a good assumption if, say, the physical experiment used to derive some observations has a failure probability p; it is not a good assumption in, say, the collaborative filtering setting, where people are much more likely e.g. to rate movies that they really like or they really hate, thus skewing the observed score distribution in a non-uniform manner. Ben Marlin dealt extensively with this problem in his PhD thesis (available from his web page). |
|
While single imputation can be misleading, multiple imputation is probably the most statistically sound and generalizable approach to missing data. It is not always practical, given very sparse data, but it is theoretically justified and works across many types of problems. The key is to impute multiple times -- often a few times will do -- to properly incorporate uncertainty and to carefully specify your imputation model, particularly if your missing data mechanism is not ignorable. If you're lucky, you can marginalize out the missing data directly. But this rarely is possible in models I've worked with and even when it can be done it typically is not worth the effort. Multiple imputation effectively marginalizes out the missing data from the final analysis through simulation. |
|
If you are working with a probabilistic model[1] you might be able to marginalize out the missing features, either in closed form or by sampling. David Knowles has an interesting series of blog posts on the subject, specifically applied to infinite factor analysis models (aka Indian Buffet Process). [1] A nice reference is Probabilistic Graphical Models by Daphne Koller and Nir Friedman. |
|
I tend to use linear classifiers trained via online learning algorithms (perceptron, passive aggressive, SGD). These algorithms are surprisingly robust to noise (approximations, missing variables, etc) if that noise is present during training. One thing I found works well is to randomly drop a subset of the features (or treat their values as 0) for each instance considered during training. You can also view this as a type of regularization or related to feature bagging (ala http://homepages.inf.ed.ac.uk/csutton/publications/bags-hlt2006.pdf). This trick certainly helps with missing values, and because of the regularization sometimes even results in improved performance. This strategy, or something very similar anyway, is discussed at some length by Globerson & Roweis, 2006.
(Sep 29 '10 at 16:28)
David Warde Farley ♦
|
|
If you really think the pattern of missing and non missing features is in itself important you can model it directly. Some people in the netflix prize found it useful to use a predictor that factorized a binary version of the user ratings matrix, where an entry was 1 if the user rated that movie and 0 otherwise. I recall reading somewhere that this complemented nicely the results one got from factorizing the actual user ratings matrix, where each cell has the rating that user gave that movie. I didn't find the original reference I read at the time, but this paper seems to describe this. |