|
In my application I'm gathering histogram representation of bags of words. However, this is essentially an unsupervised process in that I'm not manually dictating one bag belonging to a class during the training process. But I'm not quite sure. So assume that I have an existing class of bags, my application then gets a bunch of novel bags, with some having some of the same words. Then through some classification mechanism, I get a match or no match for the bags to the class. Now assume that one of the novel bags is found to be in the same class as the existing bag. I want to add that bag to the existing class, however the new bag has some words that the existing class does not(not even a 0 representation in hist). Basically, new class members can have more columns of values. I dabbled with SVMs a bit, and it does not like this at all. There is nothing that can be done short of adding the new columns for all members of the class. Can this be considered supervised data? If I get a match from whatever classifier model, I would know with some certainty that it should belong to a certain class. And thus, I could train it and label it to be in that class. Guess I'm a bit confused here... Also, as a side question, what is a good classifying method for my application? I couldn't find any solution that dealt with incremental training and datasets with different number of features. |
|
It sounds similar to a (unsupervised) clustering algorithm, such as DBSCAN. In DBSCAN, we start with no clusters (comparable to classes, but different theoretically), then go through the data samples (rows/samples/vectors, you call them "bags") one by one. If the sample doesn't fit a previous cluster, it forms a new cluster. We then go through the rest of the samples (those that aren't already assigned a cluster) and look for samples that belong to this cluster. This "builds" clusters one by one. There are incremental forms of DBSCAN, and many similar algorithms (keywords: incremental clustering). I do not have any experience with them, so I'll refrain from making any suggestions on the side question. |