0
1

I'm interested in the relationship between unsupervised learning (clustering) and supervised learning (classifiers). If I represent a labeled dataset using some feature space, would a good classifier always find a better separation between the different classes than a good clustering algorithm, assuming they both use the same feature space?

As an example, say I have a collection of documents that have been tagged with semantic labels, so one document may be tagged as being about "automobiles" and another as about "kittens". Then say I represent each document as a feature vector recording the number of times each word occurs, and maybe weighted by PMI, TF-IDF, or some other scheme. If I then trained a Naive Bayes classifier on the dataset and also clustered the dataset using say K-Means, would the classifier have a better boundary between each set of documents? And if I later wanted to label new documents (for the classifier this is easy, with the clustering algorithm, i'd turn the new document into a feature vector and then label it based on the most similar cluster centroid found), would the classifier always generate a better labeling? Would this be true for any feature space?

My intuition on this is yes, but I haven't read anything that actually proves this. Are there any papers on this relationship? And are there particular relationships between certain classifier and clustering pairs?

asked Jan 25 '12 at 18:52

Keith%20Stevens's gravatar image

Keith Stevens
62161327


One Answer:

Slightly unrelated to the actual formulation of your question but strongly related to the ideas (I think are) behind it are the recent papers on unsupervised supervised learning by Guy Lebanon's group. They are Unsupervised supervised learning I: estimating classification and regression errors without labels and Unsupervised supervised learning II: margin-based classification without labels. They probe the question of what assumptions are really necessary to distinguish supervised and unsupervised learning. The second paper (the one I read in detail) shows how even a small bit of information related to the expected proportion of the labels can be enough for supervised learning.

Apart from that, and more closely to your questions, here I go.

  1. A good classifier can often find a better separation than a good clustering algorithm, but it depends on the features. If the feature space is large enough that the two classes are well-separated beyond a doubt, then both methods should be equivalent. If it isn't, however, as it is true in many real-world cases (sentiment analysis of product reviews being a nice example, specially if you want a classifier that generalizes across product domains, as in this case the reviews will cluster far more easily across domains than across sentiment if you use simple word-based features), it might be that the best classifier has to cut through a region of high density in feature space, and hence the results of clustering will easily be nonsensical.

  2. Naive bayes is not an example of a good classifier, as it has no notion of margin and represents each class with something like its mean. Even though, it is known that there are datasets where naive bayes when used in a semi supervised fashion can find arbitrarily bad solutions (google the term "concept drift").

  3. Making statements about any feature space is dangerous. However, in some settings, there are known classifiers which can achieve low regret (google "regret online learning" for more information), which means that they perform comparably to the best possible classifier a posteriori. These classifiers do not look much like clustering algorithms at all, as they all have to explicitly use the margin and control it carefully to achieve low regret. This makes me think that a regret bound for clustering-like things would be hard to get (it is even hard to define).

  4. Finally, there are known generative models, such as naive bayes (but also hidden Markov models, probabilistic context-free grammars, and many others) which can be used either in clustering mode or in classifier mode. This is the closest analogue I can think of for your notion of classifier/clustering pairs. Generally, unless the data was generated by something really well represented by the generative model, there can be a large gap in performance between the same model used as a classifier and used as clustering. For an example of this see this interesting talk by Dan Klein.

answered Jan 25 '12 at 19:35

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Jan 25 '12 at 19:36

Thanks @alexandre, this gives me a lot of good readings to sift through and touches on some of my doubts about why classifiers would not always be better than clustering.

(Jan 25 '12 at 19:56) Keith Stevens
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.