|
Replace with the mean for that feature? Not use those samples? |
|
There are many possible answers to this question (imputation, setting the feature to zero, ignoring the example, etc), but I think the right one is that whatever you do to your training set must be as close as possible to what you will do on the test set, because as long as you are consistent you should generalize well. So if your test set will have the same missing features any answer is right, as long as you are consistent, and your classifier will generalize well. If, however, features are only missing on the training but not on the test data, then you have a problem, and the best way to deal with it is to try to impute (guess) values for that feature that mirror as closely as possible what you will see on the test data. A usually good enough way of doing this is to learn a classifier / regressor that predicts the missing features from the other features, trained on the training set, but I don't know of any theoretical guarantees for this type of procedure. |
|
I want to answer with "impute the feature values" but suggesting that always gives me pause. Then I want to answer "use EM or another algorithm that plays well with unobserved data," but I realized I don't know what problem you're trying to solve. Can you provide some more information on the task, data set, or set of features? It's hard to answer this question without more information. The data is such that the missing features could probably be imputed from the data that isn't missing. What is the method typically used for this? Is it just another "sub" machine learning problem?
(May 21 '13 at 09:15)
C I
You did not answer the question: "Can you provide some more information on the task, data set, or set of features?" what kind of data? how many features? what is the nature of the features? continuous real values? dates? categorical identifiers our boolean encoding of categorical features? if so what is the cardinality or distribution of the category values? what is the rate of missing features? Is is homogeneous across the various features? Do you have any idea with the fact that specific feature might be missing to be correlated with the target outcome? What physical phenomenon is the cause of the missing features? what are you trying to predict? Is it a continuous variable? a discrete class assignment? If so how many classes? Is it evolving over time?
(May 21 '13 at 13:38)
ogrisel
I can't provide the full details of the dataset unfortunately, but to answer most of your questions: ~100 features, they're continuous real values for the most part though there are a few dates, binary features, and enumerated features that have become boolean, there are only two categories, about 50% of the features are missing for 50% of the dataset and it is mostly the same features for that 50% that is missing, whether the features are missing is not correlated to the target outcome, the missing features are just a result of less data being collected for a period of time, it is a classification problem with two classes, yes it does evolve over time to a degree but I think that is probably a whole separate question in itself
(May 21 '13 at 20:43)
C I
If there is a mostly a single missing feature it might be interesting to try to fit a regression model to predict that feature value from the other features as Alexandre suggested.
(May 22 '13 at 18:41)
ogrisel
|