|
Hi all, I have a machine learning problem with a well-defined train and test set. I'm trying to build models trained on the train set that performs well on the test set, but I'm hitting a brick wall. The train and test sets contain about the same number of records (~1200). The train set has 83% negative, and 17% positive; the test set has 96% negative and 4% positive. I'm using SVM for now with random undersampling to balance the dataset. The training set and test set are deterministically created as a "simulation" of real-world usage of this model, where I will be training on data before a certain time, and then trying to predict data after that point in time. What I care about are recall and precision of the positives. Here's what I'm observing:
Any ideas on how I might go about debugging this problem? Thanks! |
|
The first thing that I would check is what Leon commented on the kungpaochicken's answer: ensure that you are randomly picking your folds. Don't reinvent the wheel here: every programming language has a random function and most have a built in function for shuffling. Use this to generate your folds. If you want to be able to run the experiment on the same folds again, look into saving the seed. Using the same seed will give you the same folds every time. The second thing is to ensure that your learning algorithms are suited to such a biased data set. There are algorithms for very biased dataset, but I don't think you need them for a 96/4 split. It may help to investigate these a bit more. Thanks Robert! The folds for cross validation are indeed generated randomly, but the train and test sets are generated deterministically as a real-world simulation of how this model would be used. Do you have suggestions for learning algorithms that are more suited for biased data sets?
(Feb 09 '11 at 17:19)
Chung Wu
|
|
This looks a lot like a domain adaptation problem, where the distributions of the training and test sets are very different. Something's wrong, though, with your numbers: if you're getting around 6% precision/recall on the test set aren't you better off flipping the labels returned by your classifier? I'd take a look at domain adaptation methods. You can find very good information on the ICML 2010 domain adaptation tutorial website. |
|
Some early observations from your post - It seems that the data in your training set is not representative of the test set, For starters, the number of positive examples on your training set is 4 times (300%) larger than in your test set - so the way to go about debugging such an example is take a look at the misclassifications in the training set vs the test set, and see wha the difference is. also, 72% precision on such a biased set doesn't mean much - your model can get away predicting 0's 90% of the time and still get your numbers. One way tp fix this problem is adding some sort of regularization to your model (or prior) which weights 1's more heavily than 0's. Actually, he probably is separating the sets in a deterministic fashion, which is not usually the way to go.
(Feb 09 '11 at 02:00)
Leon Palafox
Thanks! I'll take a deeper look at how the misclassifications differ. The 72% precision is only on positives (# positives correct / # positives predicted), so I think it's a meaningful measure. I'll also look into sample weighting instead of undersampling. The train and test sets are picked deteriministically (as a simulation of real-world usage of the model), but the cross validation does pick folds randomly. Thanks!
(Feb 09 '11 at 17:17)
Chung Wu
|