In Krizhevsky et al:

"We did not pre-process the images in any other way, except for subtracting the mean activity over the training set from each pixel. So we trained our network on the (centered) raw RGB values of the pixels"

Assuming my data matrix, X, is n-samples by p-features, does this mean I center each column of X? What's the reasoning behind centering each pixel with respect to the dataset? I expect this would remove a bit of the correlation between neighboring pixels for each sample. Is this just to prevent the network from saturating? But it seems like demeaning the rows of the rows of X would have the same effect.

asked Apr 03 '13 at 01:12

cdrn's gravatar image

cdrn
15225


2 Answers:

There are numerous reasons why one might perform mean subtraction prior to training. For a neural network, subtracting the mean makes it easier to set initial random weights and can reduce training time. Your quote mentions use of "raw RGB values" so they may be subtracting the mean to reduce effects of unknown scene illumination conditions. But note that subtracting the mean will not remove correlation between pixels (calculations of correlation and covariance already remove the mean).

For an n x p data matrix X (where p=3 for an RGB image), you are correct that you would subtract the mean from each column. That is, you would subtract the mean of column i from each value in column i (for i=1,2,3). Demeaning the rows would do something different - it would effectively be subtracting each pixel's gray-scale intensity from the RGB value. In that situation, all gray-scale pixels would become black and your modified data matrix would be RGB deviations from gray. There may be situations where you would want to do something like that but it is not the same as subtracting the image RGB mean.

answered Apr 03 '13 at 09:19

bogatron's gravatar image

bogatron
471156

edited Apr 03 '13 at 09:22

Thanks for correcting my statement about correlations. So for a full RGB image representation, would your matrix be n x (p*k) where k is the total number of pixels?

What I meant to say was that the inherent smoothness/consistency between neighboring pixels would be lost for each individual image (ie the normalization would make it look much less like an image). It seems like it would be harming things like rotation/shift invariance since the normalization assumes each pixel globally represents the same feature in the dataset. I can understand this centering being more useful in non-image tasks than image related ones.

(Apr 03 '13 at 11:21) cdrn

From just the quote you provided, I don't know if they're operating on individual pixels (ignoring neighbors) or using the neighborhood around each pixel during training/classification. If you are operating on independent pixels, then for an M x N RGB image, I would represent X as a k x p matrix (where k = M * N and p = 3 ). You could just as easily use the transpose of that representation. If you are using neighborhoods, then you probably would want to keep an M x N x p representation.

I'd be careful about using the term "normalization" to refer to mean subtraction because normalization typically involves scaling (stretching or contracting) of the data values, which is not happening with mean subtraction. From an algorithmic perspective, there is no loss of smoothness or consistency due to mean subtraction (vector differences between pixels in an image will remain unchanged). Of course, the image would visually appear different if rendered with means subtraction.

If you use one image (or set of images) for classification and then try to classify pixels in a new image, you don't want your algorithm to perform poorly due to a mean offset between the training data and new data (e.g., due to the overall scene being brighter in the new image). By subtracting the RGB means separately from each image, you can mitigate that effect (to the extent that it can be represented by an additive offset).

(Apr 03 '13 at 12:48) bogatron

I think we might be referring to different types of centering. To simplify things, let's refer to 300x300 grayscale images. For a dataset of 10000 images, my datastructure would be 10000x90000, would I center the columns of this matrix by subtracting each row by the same 90000 element vector?

(Apr 03 '13 at 15:02) cdrn

If you are trying to reproduce the quoted experiment, given the example you just stated, it isn't clear whether you would subtract a scalar from your 10000x90000 matrix (mean pixel value subtraction) or subtract a common length-90000 vector from all rows of the matrix (mean image subtraction). I would need to know more about what the author did in his/her experiment (details are missing from the quote).

(Apr 03 '13 at 20:26) bogatron

Both row-wise and column-wise normalisation (and I guess by extension subtracting the mean, which is normalisation without the scaling step) can be useful.

If each row is an example, as in your description, then normalising each row boils down to a form of brightness/contrast normalisation in the case of image data. I suppose by leaving out the scaling step it affects the brightness only.

Normalising each column in the matrix amounts to 'feature-wise' (or in this case pixel-wise) normalisation, which I think is particularly useful for models trained with gradient descent, since that tends to work better if the all input features have the same scale.

Of course if the scaling step is left out, I suppose this only affects the bias terms in such models (which can also be useful, since you save time not having to learn the right biases first).

(Apr 04 '13 at 05:46) Sander Dieleman

In the quoted experiment they subtracted the mean pix value for the batch.

answered Apr 07 '13 at 15:41

DwoaC's gravatar image

DwoaC
603

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.