|
Hi all, I am currently working on Online News Clustering, basically just clustering news into events and evaluate different clustering method, K-Means, HAC, etc. with the use of Knowledge Base and different feature extraction variation. Since there isn't any golden dataset available on current news for Knowledge Based evaluation, can anyone recommend some appropriate evaluation measure? Thanks in advance. REgards, Andy. |
|
It depends what you want to use the clusters for. One could compare different clustering algorithms by using the clusters as input features to a supervised task. For example, Brown clusters are used as word features. See Koo et al (2008), to see how Brown clusters are used for dependency parsing. |
|
As its accuracy with clustering that you want, not something like 'dense and well separated clusters', you will want to use a training corpus with some manually labelled data. Run your algorithm and use precision and recall to determine how well your algorithm is doing. Precision answers the question "When it groups news stories together, is that grouping accurate?" and Recall answers the question "When there is a grouping in the training set, does my algorithm find it?". Depending on your application, one of these may be more important than the other. Its a pain to label a bunch of documents, so you may want an iterative approach with some buttons for your end user "This is part of a larger group" or "This isn't part of this group". Use this feedback to identify how your algorithm is doing, and to generate a larger dataset. Finally, don't forget to split your data so that you don't over train. Use something like 10 fold cross validation to get a good measure of the precision and recall. |
|
Precision and Recall are particularly ill defined when it comes to cluster evaluation. This is due to the fact that you don't necessarily have a one-to-one mapping from (gold standard) classes to (hypothesized) clusters. Check out the discussion here on evaluating spectral clustering. Beyond needing to find annotated material for evaluation, there aren't major differences between evaluation of text clustering and evaluation of spectral clustering, or anything else -- from an external evaluation perspective I have to disagree about precision and recall being ill-defined. They can be defined differently for different applications but if you set the definition wrt your application, it can be quite useful. From the quote that you linked to, its important to determine why clustering is taking place. Eg If the aim is to group all appropriate documents into a correct topic and fix errors later, than recall is what you want to focus on.
(Sep 22 '10 at 23:19)
Robert Layton
@Robert: Andrew's point is more subtle than that. Even if you have the correct labels for all elements in your data set, precision of the X class is defined as the ratio between elements in class X that were correctly classifierd and all elements classified as X by your algorithm, and recall is the ratio between elements in class X that were correctly classified and all elements in class X. This definition is always the same. so they are not ill-defined in the supervised setting. What Andrew is pointing out is that, in the unsupervised setting, you will get wildly different precision and recall values if you consider different clusters to correspond to different classes. How should you map things? One-to-one? Many-to-one? Many-to-many? Each decision of which cluster(s) correspond to class X gives you wildly different precision and recall values for class X, which is why Andrew said they were undefined.
(Sep 23 '10 at 03:04)
Alexandre Passos ♦
1
@Alexandre: Precisely. @Robert: Evaluating Precision and Recall in a clustering setting requires you to define a mapping from each cluster to a class. A common way to do this is to map the cluster to the most represented class in the data points that it contains (cf. [Fung, 2003][3]). Any cluster evaluation approach that needs to perform this mapping is flawed, and unless the mapping function is specified -- ill-defined. [Meila, 2007][2] described this issue as "the problem of matching", I repeated and slightly expanded on her discussion [here][1]. [1][http://acl.ldc.upenn.edu/D/D07/D07-1043.pdf] (Sorry, i guess comments don't process links the same way original posts do.
(Sep 23 '10 at 07:48)
Andrew Rosenberg
1
@Robert: and if you think this distinction is somewhat artificial, Haghighi and Klein's paper on prototype-driven learning for sequence models and Lee, Haghighi, and Barzilay's paper on Type-level unsupervised POS tagging show the difference in performance (measured in F1) one gets when treating unsupervised POS tagging as classification and using an unnatural mapping between classes instead of a one-to-one mapping. Sometimes it's over .15 f1, and sometimes it even inverts the order of which system is actually best in that case.
(Sep 23 '10 at 08:11)
Alexandre Passos ♦
|