|
What are your best results for clustering on the 20-newsgroup dataset? I am looking for completely unsupervised learning methods, i.e. the class labels used for testing only, not for training. By googling I found these results from Guillaume Pitel. He told me that the best prec-recall curve is semi-supervised, the others are unsupervised. I'll share my results soon. Are there more results? |
|
Since my last experiments on this dataset, I've made several improvements to my algorithm, so since I believe my (unpublished, sorry) method has the best results, I will share it here. First, have a look at my latest results (don't be surprised if the results are worst than in my precedent graph, the train/test split is, this time, the "official" good split for 20 newsgroups (that is, temporally based). I did this to compare my results to Ranzato and Szummer ICML 2008 paper, who have, as far as my knowledge goes, the best results on the 20 newsgroups dataset in the published state of the art. My previous experiments used random shuffling as I used them to compare against Slakhutdinov's Semantic Hashing and Gehler's Rate Adapting Poisson. Two main improvements in my method. First I have improved the last step of the algorithm which is responsible for making one embedding compatible with the other (embedding of words and of documents, for instance). Second point is that I have improved how injected knowledge is taken into account, which lead to an almost horizontal precision/recall curve when using document's class knowledge. More interesting is how generic word knowledge can improve the results. We have used embeddings for an extended vocabulary, derived from the articles in 20 newsgroups as well as documents from the small reuters dataset (I haven't checked that the reuters data actually helps). In other experiments I have shown that, unsurprisingly, the improvement led by adding word knowledge is more important when the training/testing ratio decreases. When only 40% of the corpus is used for training, the PR curve drops by 7 or 8% on average, compared to the 80/20% split, unless you use injected word knowledge, in which case it only drops by 1-2%. |
|
Here are my results with the method I described here. I look into the proximity matrices to get some idea of what the clustering is. For example classes 2-5 and the classes 18-20 are not identified. But that is not a flaw of the algorithm, but the classes are just not that compact. The goal of the unsupervised learning is to find good intrinsic features with respect to the data, not some artificial labels, and this is hard to measure exactly. All in all, in lack of better labels these can only give a rough measure of quality. For example the red curve is clearly better than the blue. 1 - The fact that some classes of 20NG are ambiguous is well-known, so it is expected that clustering will not reflect the newgroups classes. I remember a few experiments on the 20NG that use other classes than the newsgroups (they group thematically relevant newsgroups together), the problem with those versions is that they cannot be compared easily with other methods. 2 - Semi-supervised methods are not limited to using artificial labels. For instance, you can introduce general knownledge about word meanings (from a bigger corpus) so that you will suffer less from scarce data on rarely used words. In the experiments you've pointed, I've used class labels as the semi-supervised information, because it was easier :). 3 - You method really shows impressive results, I really think they are the best to date.
(Dec 24 '10 at 03:55)
Guillaume Pitel
|
|
Semi-supervised methods rarely have lower results than unsupervised ones, so you should not compare an unsupervised method with a semi-supervised one. If you're looking for the best method for a given problem, then you should know if you have access to extra information that can be used by semi-supervised methods. About precision and recall, if you want to assess the quality of an embedding and you have a dataset with classes, then you can use the area under curve measure on the precision/recall curve. I agree with Alexandre Passos when he says that clusters' quality is, in general, not a good indicator. For other results and other datasets to test your method on, I've listed a few here : http://blog.guillaume-pitel.fr/index.php?post/2010/09/Performance-of-NC-ISC-on-Ohsumed |
|
Precision and recall are very poor measures of cluster qualty. They are brittle when you change the number of clusters and they artificially depend on a mapping between clusters and classes to be evaluated. Are you really sure you want this? On the other hand, given the quality of the t-SNE embeddings of this dataset you should get really good performance (almost as good as possible with word features) with t-SNE followed by a very simple single-linkage or spectral clustering algorithm. yes, you are right, precision and recall is a rough measure. What would you use for this purpose?
(Dec 22 '10 at 06:28)
Oliver Mitevski
Andrew Rosenberg's V-measure makes a lot more sense for clustering and has been used to benchmark unsupervised learning for POS tags (for example), so I'd recommend that. Apart from that, keep in mind that clustering is ill-defined and what you really want is something that will improve the performance of an end-to-end system.
(Dec 22 '10 at 06:44)
Alexandre Passos ♦
2
Those t-SNE embeddings were constructed using features obtained from a discriminative LDA model. Running it on the raw data wouldn't give nearly as nice results.
(Dec 23 '10 at 16:27)
Laurens van der Maaten
|
See also: "Are there any well-known databases to test clustering on?"