Hi, I would like to ask an advice if separating outputs into individual networks will help to identify patterns.

This is the problem description:

I have 50,000 records of training data and 5,000 records of test data. I train a 4 layered (2 hidden layers) ANN with backprop. There are 2,000 inputs and 4 outputs. Each output corresponds to a class label, so I have 1,2,3 and 4 classes the network is trained to identify. There is a fifth class though, when the network doesn't see any class all outputs are inactive, so we could consider it a fifth class. It is not an image recognition problem, but it is something similar to when you try to identify, for example, a square,triangle, circle and line (4 classes), and also identify a picture where none of the previous shapes is present which is mostly the case.

The data is unbalanced, for example for the 50,000 records of training data I have:

class 1: 2,166 records

class 2: 3,241 records

class 3: 2,176 records

class 4: 3,307 records

the fifth or 'class 0' has the remaining 39,110 records (all outputs are inactive) Test data has similar distribution. When training only 1 class can be active, so only one of the outputs has the value of 1, the others are 0.

So, here comes the question:

When I train the network, for example for the class 1, the first output is active, all other 3 are inactive, for the class 2 the second output is active and all others are inactive. Will this sort of training penalize identification of less probable classes? I mean, when I teach the network that it is a class #3, the network will learn 0s for all other classes and at testing phase if it thinks that this is the class #3, it will automatically put all other outputs to 0 so I never will have a chance to observe maybe there was a slight probability that it wasn't class #3 but it looked like maybe a 30% probability it was class #1 or class #4.

Wouldn't it be better to train 4 different networks with only one output, each identifying certain class?

For example the first network will be trained on class #1 , it will see 50,000 records and its only output will be active 2,166 times. The second network will be trained on class #2 , it will see 50,000 records and its only output will be active 3,241 times. And so on.... This way if the test pattern looks like class #2 and class #4, both (the second and fourth) networks will have its output close to 1 and I could select the highest activation value from all four networks to decide what class it is. The training time will be 4 times longer but maybe I will get better classification because I have not yet been able to achieve good results.

What do you think?

asked Feb 04 '11 at 07:08

Nulik's gravatar image

Nulik
30336


One Answer:

There has long been the notion that the various outputs in a multiple-output neural network would somehow "share" the information gleaned in the hidden nodes. How well that works out in practice depends on several things, such as how similar the decision boundaries are across classes and how well the neural network's architecture and training algorithm can exploit that similarity. I am not aware of empirical evidence one way or the other, but individual cases will certainly vary.

Consider, too, that since the neural network will be trained iteratively, some output nodes will reach their optimal performance before others. Ultimately, this means that you will see some classes being overfit, while others are still underfit.

For these reasons, for multiple class problems, I favor separate neural networks, or models which fit all classes in one shot (like linear or quadratic discriminant analysis).

answered Feb 04 '11 at 07:28

Will%20Dwinnell's gravatar image

Will Dwinnell
312210

Thanks! My 4 classes could be separated in 2 subclasses each pair having 2 classes inside, they are sharing similarities, I probably should train them in 2 networks.

Actually, in the statistics of training I see classes #1, #3 converge first in about 500 epochs, after that class #4 in about 2000 epochs (about 80% of samples fitted), then very slowly, class #2 is starting to converge but it takes long time, like 3 samples to fit per every 50 epochs. I had no patience to wait for the class #2 to converge fully yet.

Maybe to avoid overfitting in a "fit all classes in one shot" network I could play with network architecture. Like, place very little hidden nodes and see how far it converges. If it is underfit, add nodes, and repeat until I find the best architecture. That is the only thing I see right now. The other is of course change the algorithm but I still want to exhaust the posibilities of the backprop.

(Feb 04 '11 at 10:01) Nulik
Your answer
toggle preview

Subscription:

Once you sign in you will be able to subscribe for any updates here

Tags:

×4
×3
×1
×1

Asked: Feb 04 '11 at 07:08

Seen: 1,357 times

Last updated: Feb 04 '11 at 10:02

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.