|
Has any one used the MDC framework listed:http://www.cs.umass.edu/~ronb/mdc.html. I was using this to cluster some gene expression data. FYI This method uses a sequential information bottleneck but is designed for >2 variables. Meaning earlier algorithms using Info botleneck would be at best bi-modal/2-varables. And would maximize a Mutual info measure by alternatingly clustering the other variable. Like example words and docs. This one would do it for say: words - docs - email subject or gene-samples-Gene ontology data. Anyways I am still using it for bi-modal clustering of gene-samples. Just to see how it does for starters... I tried to compare its performance with the methods on the data from here. The performance measure was measured by using corrected rand index (Because I know the original clustering). Here is the rand index measure I use to measue cluster performance. I also measured the accuracy by using a max weight and max cardinality bipartite matching( accuracy for the original datasets was done by corrected rand index). The best performing method was kmeans with a mixture of gaussians. It was easily beating this new methods. There are many dataset i analyzed most have ~1500 genes and 50~150 samples. Listed: http://algorithmics.molgen.mpg.de/Static/Supplements/CompCancer/datasets.htm . While running the framework. I keep the number of clusters same as the classes defined in the true clusters. Based on the experiments I ran here are some questions I want to ask: 1) Using MDC I am having to configure a some hypreparameters(apart from #clusters) that I have no clue about viz: Top down or bottom up clustering of variables. Both give varying results and they vary enough to be statistically significant. The paper mentions a schedule where in I must keep twice the iterations of the target variable as compared to the other variable it is clustered against. This works well. 2) Affect of Noise: Can this be somehow delt with before we begin the clustering. Are Mutual Information methods robust to noise? 3) In some cases the accuracies vary by more than +/- 3%. What is the reason for this and how can this be remedied? 4) Did someone here do experiments on co-clustering Information Bottleneck frameworks on another types of data sets specifically to compare it to other methods like Spectral or kmeans? |