|
I have a dataset with 2,800 patients and 80 variables. 1,400 patients are cases and 1,400 are controls. I am looking to split the data between "test" and "training". How do I decide what proportion of the dataset should be allocated to the "training" and what proportion to "test"? What test should I perform to ensure that they're drawn from the same distribution? Thanks in advance. |
|
I'm not quite clear on the problem setting but, since patient and control classes are usually known a priori, I'm assuming that you're looking for which are the relevant features out of the 80 possibilities (and what are their effects)? If you know the treatment and control groups, then there are standard statistical tests for checking for statistically significant effects due to treatment. Try googling on any of these terms: ANOVA MANOVA ANCOVA MANCOVA. These are probably a good place to start, since they are well-understood by medical professionals and therefore easy to justify, and easy for the end user to interpret and compare with other results. Even if you plan to do something else with the data, it would be a good idea to have the classical results to compare with. For machine learning approaches, there are rules of thumb, but not hard and fast rules about the proportions into which you should split the data. Since you have plenty of data, a split like 1000 training points and (depending on the approach you're using) 400 test points or 200 points each in the test and validation sets for each class would look about right to me (or in similar proportions if you're running multiple experiments). The experiment splits the treatment and control groups into 2 equal sized sets - for interpretability, as well as for practical reasons, I would suggest you do the same. At the very least you should ensure that each training set has 80+ observations in it - inference with fewer examples than variables is generally hard, and difficult to justify when you have more than enough data to avoid it. |