|
Hello, I have this data set:
I need to use a clustering algorithm on this data set. How do I use it? Say, for example if I go for k-means, then how will i find the mean? I am just trying to figure out the real world implications of clustering algorithms and how to use them. If I do:
|
|
Standard clustering is done using a vector space model. The easiest way to do this is to create a file like a spread sheet, where each row is each document/instance and each column is a variable/feature. With your dataset, the standard method to start with would be to have each feature be "Does this set of tags contain X?", with a 1 if it does and a 0 if it doesn't. You can then apply k-means, such as through Weka, on the resulting dataset. What this does, in practise for your dataset, is to group together sets of tags that are very similar, such as those that share 75% of common tags (depending, of course, on the parameters). You will probably get a similar result to your example. Another area you can look at is graph based clustering. This builds a graph and splits the graph into subgraphs based on some criteria, which would achieve a similar result, but with potentially better results. Finally, once you have your initial results, you may want to play around with what the features are, or the method of calculating distance between them. This gets a bit more advanced though and you may need to re-implement k-means to do this (someone comment if they know of a good k-means implementation that takes an arbitrary distance metric please!). One such distance metric you could try would be the ratio of the intersection of the tags to the union of the tags. Eg.
Have an intersection size of 3 (sharing C#, datetime and J#) and a union size of 5 (there are 5 different tags). The similarity would then be 3/5=0.6. This can be turned into a distance metric by subtracting it from 1 which is 1-0.6 = 0.4. 2
To add a bit to what Robert said, the features you select may be any number of things. Where you're using documents, you might use counts of non-trivial words or word pairs, or meta information that is easy to pre-determine about the document (eg, if it's code and it's known ahead of time what language it is, you might have a feature for each language). -Brian
(Dec 01 '10 at 19:42)
Brian Vandenberg
1
I forgot to add this bit as well: When you add binary features to use in a typical data mining system, it's possible to have some features get washed out because they are just one voice in a sea of data. For example, suppose you wanted to do the Netflix challenge. There were over 17,000 movies to rate, and more than 100 million ratings given by over 480,000 users. If all of that information gets boiled down to a set of binary values put in a single feature vector, as a first attempt you could potentially end up with 17000 samples with 100000000 * 480000 =~ 48 trillion features (depending on how you arrange it). In this new feature space, you decide you want to add meta information -- whether the movie was action, drama, comedy, chick flick, zombie movie, porn, etc. Even if you managed to come up with a million pieces of meta information to add to the feature vector, it would only account for approximately 0.000001% of the feature space. This example wasn't entirely realistic, because the sample/feature space is extremely sparse, but it helps illustrate my point. You can't just add features on a whim and expect it to find all the correlations and zeroize weights if a feature is not valuable. That can work, but you shouldn't rely on it to save you from coming up with good choices for a feature vector. Additionally, it might be useful to either attempt to artificially weight some features or use separate systems to work with smaller subsets of all features you want to work with, then combine them at a later stage (eg, a DBN for each set of features, then combine deeper representations into a single feature vector to use in another system). -Brian
(Dec 03 '10 at 12:34)
Brian Vandenberg
|
What does the dataset mean? I imagine its a set of tags/topics.
yes, like metaoptimize question tags