|
I'm interested in the relationship between unsupervised learning (clustering) and supervised learning (classifiers). If I represent a labeled dataset using some feature space, would a good classifier always find a better separation between the different classes than a good clustering algorithm, assuming they both use the same feature space? As an example, say I have a collection of documents that have been tagged with semantic labels, so one document may be tagged as being about "automobiles" and another as about "kittens". Then say I represent each document as a feature vector recording the number of times each word occurs, and maybe weighted by PMI, TF-IDF, or some other scheme. If I then trained a Naive Bayes classifier on the dataset and also clustered the dataset using say K-Means, would the classifier have a better boundary between each set of documents? And if I later wanted to label new documents (for the classifier this is easy, with the clustering algorithm, i'd turn the new document into a feature vector and then label it based on the most similar cluster centroid found), would the classifier always generate a better labeling? Would this be true for any feature space? My intuition on this is yes, but I haven't read anything that actually proves this. Are there any papers on this relationship? And are there particular relationships between certain classifier and clustering pairs? |
|
Slightly unrelated to the actual formulation of your question but strongly related to the ideas (I think are) behind it are the recent papers on unsupervised supervised learning by Guy Lebanon's group. They are Unsupervised supervised learning I: estimating classification and regression errors without labels and Unsupervised supervised learning II: margin-based classification without labels. They probe the question of what assumptions are really necessary to distinguish supervised and unsupervised learning. The second paper (the one I read in detail) shows how even a small bit of information related to the expected proportion of the labels can be enough for supervised learning. Apart from that, and more closely to your questions, here I go.
Thanks @alexandre, this gives me a lot of good readings to sift through and touches on some of my doubts about why classifiers would not always be better than clustering.
(Jan 25 '12 at 19:56)
Keith Stevens
|