If I would like to use self-organizing map for classification purposes, then I only need to have two cells when setting up the grid, where each cell corresponds to one class. Is my understanding correct?
Andreas is right, SOM is not used for classification, since you are betting clusters will be formed based on some metric and a dissipation function. Using 2 cells in a SOM would not make much sense
Leon, can you elaborate more on why SOM is not a good fit on supervised classification? I know SOM is designed for unsupervised clustering. Thanks a lot.
Andreas, I think there always should be a reason of why one method work better for one problem while work worse for anothe problem. I am curious of why SOM is not a good fit for classification, either statistically or mathematically. Thanks.
Ok then, how do you plan to train a SOM in a supervised fashion? The usual algorithm doesn't do anything with the labels. If you don't use the labels, there is really no reason why you should be able to learn anything if you don't make strong assuptions about the distributions of labels on your data.
If you plan to use the labels: how does the algorithm look?
For instance, I have 100 data points having been annotated with either +1 or -1. I just setup two cells in initializing the map. After running SOM, if most of the data points connecting to cell 1 have been annotated with class -1, then I just mark this cell as -1. Given a new data points, if it is more near to cell 2, then I just mark this new data point as +1.
Well, that's what I meant by not using the labels during learning.
This process is not a consistent classifier. Imagine this dataset in 2d:
you have two circles of radius 1, one at (0,0), one at (3,0). The upper half of each circle is +1, the lower half is -1.
You data is generated by sampling uniformly from the circles. Or using a Gaussian. Doesn't matter.
You som will assign a node to each circle. Each circle has approximatly as many +1 as -1 samples, so you assign the labels to the circles more or less randomly. If you classfy new test data, you will have chance performance.
This is even true if you had infinitely many training data.
Every sensible classifier will have zero error on this dataset with just a few examples, as the decision surface is linear without noise.
Not to mention that the basic function of SOM modifies the weight of the Neuron as well as the weights of it's neighbors (hence creating a map).
SOM was first created to MAP and N dimensional topology to a 2D map and is more useful as a feature reduction technique than as a classification technique.
What you are suggesting, associating with one label, then modify the weights sounds a lot like an incremental k-means, where your centroid is defined by the weights of the neuron.
So your approach is a non supervised method. (also called clustering).
Leon, the other feature methods, like PCA, generates a reduced feature space by choosing some most important features accoding to some pre-exisint rules. For instance, we can 100 points and each with 10 features. PCA lets us choose the first 6 features, so we have 1006 matrix right now. How to understand the feature reduction utility of SOM in accordance to this example. We orginially have 100 data points, and SOM have 22 grid. So we transfer a 10010 matrix into 4100 matrix after SOM trainning. Then what are the new features in the newly built SOM space?
I think that your method of using assigning to each cell the most assigned label is the correct way of using unsupervised algorithms for supervised tasks. However, the problem, as pointed out by others, is that there are algorithms specifically designed for supervised learning that are going to be significantly better than SOM (or any unsupervised method).
If you are going for simple, start with a basic distance based classifier. Get the centre point of each class (i.e. the mean of all -1 labels and the mean of all +1 labels) and assign each dataset to the nearest centre point.
For comprable complexity to SOM, try a basic feed forward neural network.
You are confusing the main concept of PCA. PCA does not select some features, instead it does a projection of features over a candidate vector, which also happens to maintain the covariance.
To correctly use SOM to reduce features, you have to see the effects of the different weights in the U Matrix, that way you can see how correlated are certain features with each other.
@Leon, can you elaborate more on the feature reduction properties of SOM? U Matrix stores the distance between different points in the SOM space. For instance, we have 4 cells on a two-dimensional space. Then U matrix stores the distances of these 4 neurons. However, I am not clear about how to connect this fact with feature reduction? Or in specific, what can be treated as the reduced features.
Why do you want to use SOMs for classification? Afaik they are usually used as unsupervised algorithms.
Andreas is right, SOM is not used for classification, since you are betting clusters will be formed based on some metric and a dissipation function. Using 2 cells in a SOM would not make much sense
Leon, can you elaborate more on why SOM is not a good fit on supervised classification? I know SOM is designed for unsupervised clustering. Thanks a lot.
I think you just answered yourself there. Is there any particular reason you want to use SOM instead of, say a linear model, SVM or KNN?
Andreas, I think there always should be a reason of why one method work better for one problem while work worse for anothe problem. I am curious of why SOM is not a good fit for classification, either statistically or mathematically. Thanks.
Ok then, how do you plan to train a SOM in a supervised fashion? The usual algorithm doesn't do anything with the labels. If you don't use the labels, there is really no reason why you should be able to learn anything if you don't make strong assuptions about the distributions of labels on your data.
If you plan to use the labels: how does the algorithm look?
For instance, I have 100 data points having been annotated with either +1 or -1. I just setup two cells in initializing the map. After running SOM, if most of the data points connecting to cell 1 have been annotated with class -1, then I just mark this cell as -1. Given a new data points, if it is more near to cell 2, then I just mark this new data point as +1.
Well, that's what I meant by not using the labels during learning. This process is not a consistent classifier. Imagine this dataset in 2d: you have two circles of radius 1, one at (0,0), one at (3,0). The upper half of each circle is +1, the lower half is -1. You data is generated by sampling uniformly from the circles. Or using a Gaussian. Doesn't matter.
You som will assign a node to each circle. Each circle has approximatly as many +1 as -1 samples, so you assign the labels to the circles more or less randomly. If you classfy new test data, you will have chance performance. This is even true if you had infinitely many training data.
Every sensible classifier will have zero error on this dataset with just a few examples, as the decision surface is linear without noise.
Not to mention that the basic function of SOM modifies the weight of the Neuron as well as the weights of it's neighbors (hence creating a map). SOM was first created to MAP and N dimensional topology to a 2D map and is more useful as a feature reduction technique than as a classification technique. What you are suggesting, associating with one label, then modify the weights sounds a lot like an incremental k-means, where your centroid is defined by the weights of the neuron. So your approach is a non supervised method. (also called clustering).
Leon, the other feature methods, like PCA, generates a reduced feature space by choosing some most important features accoding to some pre-exisint rules. For instance, we can 100 points and each with 10 features. PCA lets us choose the first 6 features, so we have 1006 matrix right now. How to understand the feature reduction utility of SOM in accordance to this example. We orginially have 100 data points, and SOM have 22 grid. So we transfer a 10010 matrix into 4100 matrix after SOM trainning. Then what are the new features in the newly built SOM space?
I think that your method of using assigning to each cell the most assigned label is the correct way of using unsupervised algorithms for supervised tasks. However, the problem, as pointed out by others, is that there are algorithms specifically designed for supervised learning that are going to be significantly better than SOM (or any unsupervised method).
If you are going for simple, start with a basic distance based classifier. Get the centre point of each class (i.e. the mean of all -1 labels and the mean of all +1 labels) and assign each dataset to the nearest centre point.
For comprable complexity to SOM, try a basic feed forward neural network.
You are confusing the main concept of PCA. PCA does not select some features, instead it does a projection of features over a candidate vector, which also happens to maintain the covariance. To correctly use SOM to reduce features, you have to see the effects of the different weights in the U Matrix, that way you can see how correlated are certain features with each other.
@Leon, can you elaborate more on the feature reduction properties of SOM? U Matrix stores the distance between different points in the SOM space. For instance, we have 4 cells on a two-dimensional space. Then U matrix stores the distances of these 4 neurons. However, I am not clear about how to connect this fact with feature reduction? Or in specific, what can be treated as the reduced features.