I know that the title is not that clear, so I will explain the situation briefly.

My graduation project is about learning from distributed data preserving privacy. So data can't be copied from the machine holding it. Consider we have m machines holds m dataset, different numbers of observations but the same number of features (dimensions). Simply I want to learn from each dataset with the same classifier and then combine these same classifiers to build a single one.

Are there a rule to do that or simply every classifier has its own rule or I combine them using their statistics properties?

thank you all

asked May 03 '11 at 13:46

Omar%20Osama's gravatar image

Omar Osama
35678

retagged May 06 '11 at 15:33

David%20Warde%20Farley's gravatar image

David Warde Farley ♦
551820

1

What sort of classifier are you using? Could the models for each dataset be put into an ensemble with a voting scheme perhaps based on size or accuracy to create the final model or is there something with what you're doing that doesn't lend itself to that sort of approach?

(May 03 '11 at 16:11) Chris Simokat

I didn't get you completely.

but I will use linear classifiers like SVM (support vector machines), decision trees .. etc

Am I clear enough or you need more explanation?

(May 03 '11 at 17:40) Omar Osama

2 Answers:

Actually it depends on the data you are using and the classifier.

A quick example, suppose you run your SVM on the data of machine 1, and then you run another SVM on Machine 2:

If Machine 1 has different classes than Machine 2, then you will have a classifier in Machine 1 and another in Machine 2, which are not exchangeable and thus you cannot mix them.

If, on the other hand, Machine 1 and Machine 2 have the same classes, then you can use them indiscriminately.

Parallel classification has the clear problem that if you want to use parallel classifiers, then you need the data to be in the same manifold and their topologies must be comparable.

You can do a serial classification:

Take machine 1 and use your SVM, now go to Machine 2, use the SAME SVM and update it in case you have a new class. Then go to Machine 3 and do the same.

Check for things like Online Classification, Online SVM's , etc

answered May 03 '11 at 22:30

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

I think you got me totally wrong. Data in machine1 are from the same distribution as data in machine2 actually both have the same features and the same classes .. just I want to combine these 2 SVMs to generate a single powerful SVM.

Wish I am clear enough .. :)

(May 03 '11 at 22:56) Omar Osama

Still, wouldn't it be better to train your first SVM in the first machine and then update it with the info from machine 2 and so on? If you run n SVM's for n machines and then mix them, I'm not sure, but your complexity may be higher than if you try to update it on the go, since if the data is in the same Manifold, after the first 5 computers or so, you would actually reach a really good SVM with little need to do optimization afterwards.

(May 03 '11 at 23:11) Leon Palafox ♦

but if i did it parallel it will be better.

do you know any approaches that combines several same classifiers?

(May 03 '11 at 23:16) Omar Osama

As I guess the main issue here is privacy, wouldn't it be a bad idea to communicate support vectors from one machine to another? In a way, the support vectors are the most relevant examples.

Otherwise I'll agree with Chris: Just bag the SVMs and do a majority vote. So if you have a new data point, just run all of the SVMs and take the class that was predicted most.

(May 04 '11 at 07:40) Andreas Mueller

I think the majority solution is not that accurate. If I have 7 bad SVMs out of 10. they will not classify new data points correctly. I did think about that solution, but it is not that accuracy and it is very time consuming.

(May 04 '11 at 11:18) Omar Osama

Unless you do a lousy implementation, it is really hard to get a "bad" SVM, since it is an optimization problem.

(May 06 '11 at 20:21) Leon Palafox ♦
showing 5 of 6 show all

You can also use a distributed SGD-SVM technique called mini-batches, where each SVM is held on an independent computer. Each SVM draws and learns X samples, and after those X steps synchronize themselves (i.e., by sending the support vectors and their weights to everyone else), then draw and learn from X more, and so on until convergence... Each machine could hold their own set of data, but when you do the synchronization step you would need to send other machines the selected support vectors.

This would probably allow you to fetch a better performance while maintaining the speed of a distributed algorithm. However, this method is better suited for large datasets than small ones.

answered May 04 '11 at 11:02

levesque's gravatar image

levesque
3653515

edited May 04 '11 at 11:03

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.