|
I have a data set with 20% labelled samples and 80% unlabelled samples. I have C classes. However, more than C classes may exist in the data. Each sample is a 70-dim vector. The size of the dataset is N. That is: X: subset of samples = {x1, ... ,xN} where xi 70-dim vector Y: subset of labels = {y1,...,yN} where yi has the value of {0,1,...,C} where 0 means that the class is unknown and 1,...,C is the class to which sample belongs. As I said before 20% of the samples are labelled and 80% are unlabelled. A class can create different patterns, that is, a class can be explained by different clusters in the space. Example: maybe the class 2 can be explained by a Gaussian distribution in a certain part of the space and another Gaussian distribution in another part of the space. Better said, all the samples that should be labelled under the same class i can be explained by different clusters in the space. I am assuming that if a clusters has labelled samples, the class that will represent that cluster will be based on the dominant label class of the samples inside that cluster. Besides, there is the possibility to find in the space clusters formed exclusively by unlabelled samples. That is, not all the classes are known. How can I tackle a problem like this? I would like to find clusters with labelled samples and clusters with unlabelled samples (new patterns). I would like to use the information provided by the labelled samples to find the best clusters (I mean I can assume that I dont have samples and find the clsusters). Any help can be useful. Thanks |
|
What problem are you trying to solve? Do you actually need the clusters, or are you trying to make multiclass predictions? By answering this question, you'll have a better sense of the loss you are trying to optimize. You can then define a loss function that mixes the supervised loss and the unsupervised loss (perhaps similar to k-means). You can then use EM to perform a semi-supervised variant of K-means clustering. To choose the number of clusters, I have cluster stability as a heuristic. (Sorry, I forget the exact citation.) Basically, you take an 80% sample of the instances, and run clustering, and find what percentage of instances are in the same cluster as they would be in the 100% (full) sample of the instances. Run with multiple samples and average. You should see a sharp drop at some point. You can perhaps choose the number of clusters by mixing the supervised loss with the cluster stability. There might be an obvious inflection point where the supervised loss is minimized but cluster stability hasn't dropped yet. The main question is how to weight the supervised loss against the unsupervised loss. That will depend upon your ability to define the problem you're trying to solve, and to quantitatively measure the your model's loss. I'm tryring to correctly classify as most as possible samples of the training set (transductive learning). Basically, I'm trying to enlarge my training set to train other classifiers.
(Oct 02 '14 at 13:52)
KoTy
Which could be a good metric for high-dimensional clustering?
(Oct 03 '14 at 11:48)
KoTy
|