Hi All!

Consider the following (funny) classification problem.

The data is as follows. Each datapoint consists of a single class label and a bag of multiple input vectors, where the number N of input vectors differs per datapoint, but P(N) is known. Each input vector is binary and sparse.

Does anyone know models (and learning methods) for this type of data, where we need a mapping from multiple input vectors to a single class label?

(Further optional details.) Each datapoint is generated as follows. Some process with class C generates a bag (X) of N text messages. The process class has known distribution P(C). Also, the distribution over N is known (N ~ Poiss(r)). Each text message is converted into a sparse, binary feature vector. Two possible methods:

  • I could sum each instance's bag of input vectors X into a single vector. This converts the problem into a 'classical' classification problem, but this throws away some potentially important information, like within-message co-occurance of features.

  • I could do it the Bayesian way: P(C|X) = P(X|C)*P(C)/P(X). Learning such model is feasible, but requires learning a generative model P(X|C). This seems overkill, since learning such generative model requires modelling the whole input domain, instead of a (discriminative) mapping from input to class.

Any advice on how to tackle this problem is highly appreciated!

asked Dec 11 '11 at 05:56

Durk%20Kingma's gravatar image

Durk Kingma
1111


3 Answers:

Have you thought about applying multiple instance learning? If the multiple instance learning assumption is not met in your problem, there are some methods in the multiple instance learning literature that map bags to single feature vectors.

You could also look at set kernels. For example the multiple instance kernel of Gaertner et al. seems to be pretty directly applicable. It corresponds to summing the instances up in feature space. This means it still loses some information.

A completely different approach would be to use bag of word representations. This is very popular in computer vision. There, an image is represented as a "bag" of interesting features. Each feature (=instance) is vector quantized and then a histogram over code occurrences is build. If you look up bag of features or bag of words you'll find a lot about this.

This answer is marked "community wiki".

answered Dec 11 '11 at 06:28

Andreas%20Mueller's gravatar image

Andreas Mueller
2686185893

Assuming that all N vectors have the same (sparse) feature set, and taking into account that the feature co-occurance is potentially important, I would construct a new derived feature set by "aggregating" vectors while keeping track of feature co-occurances. This, for each data point, would produce a sparse feature set consisting of the following:

  1. Counts of each feature present in the data point (i.e. the main effects)
  2. Counts of two-feature-combinations present in the data point (i.e. two-way interactions)
  3. Counts of three-feature-combinations present in the data point (i.e. three-way interactions)
  4. ..5 ..6 .. You get the idea - one can go on as long as computational power and the dataset's sparsity allow (I suspect the point of diminishing returns will be reached pretty quickly)

Once this data transformation is done one can train one's favorite discriminative learner on it.

answered Dec 28 '11 at 16:13

Yevgeny's gravatar image

Yevgeny
4613

edited Dec 28 '11 at 17:26

You might try the paper Classification of Sets using Restricted Boltzmann Machines. I recently skimmed over it and I believe they address the exact problem you are discussing using a hybrid discriminative/generative approach.

answered Dec 28 '11 at 19:39

alto's gravatar image

alto
60351124

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.