|
I have a classification problem with 2 classes (positive and negative). Usually, in such classification problems, all the samples will be labelled either 'positive' or 'negative'. In my dataset, some of the samples possess a combination of both positive and negative characteristics. Formally, if the dataset is$x, then,
where x1 is the set of all positive samples, x2 is the set of all negative samples and x3 is the set of samples that contain the characteristics of both the classes. As far as I could think of, this situation could be handled in 2 ways,
I wish to follow the second option, as it is more natural choice. The reason being, ignoring some samples from the dataset gives me a feeling of manipulating the dataset artificially which may affect the performance of the classifier in the real world scenario. In this context, I have the following questions,
|
|
As is usually the case in ML, there are multiple ways to handle this. Your intuition about it not being good to simply omit the x3 examples is correct. Treating it as a three-class problem is one way to approach it. My 1st inclination, however, would be to treat it as two two-class problems: (x1 vs not x1) and (x2 vs not x2). In this you would train the x1 classifier using (x1 U x3) as positive examples and x2 as negative examples. Then, you would train the x2 classifier using (x2 U x3) as positive examples and x1 as negative examples. What works best will depend on the structure of the class distributions and which classifier you're using. |