I am looking to evaluate some of the deep methods on a image segmentation problem familiar to the team. I am very familiar with regular MLPs but not so with image processing.

The problem is very simple, given a set of k x k images (k ~= 500), say for each pixel if it is class A or class B (e.g., sky & not sky). I have 1000's of unlabeled images and a handful labeled ones. If need be I can generate more labeled ones (with a different method) but these will not be perfect. The resolution of the segmented image should be the same as the original. I can downsample the images if need be to save computation time, but 100x100 would probably be the extreme limit.

So the goal is not unlike that in this paper by Alvarez/LeCun et. al.

I am wrapping my head around Theano and its algorithms to see how this would work but not quite there yet. I was planning to start from the LeNet example in the tutorial.

My current plan:

  1. Take a sliding window (say 32x32) over an image, generating a series of overlapping patches for each image (and associated ground truth). Question: I guess the overlap depends on the application, is there a heuristic?

  2. Train the net in a supervised way on every patch of every labeled image. Thus the network will have 32*32 outputs (a logistic regression layer) and 32*32 inputs each denoting a grayscale value. Question: how do people usually deal with color, does that mean 32*32*3 inputs? (for RGB)

  3. Apply the trained net to a new image & stitch all the patches together (averaging on overlap) to get the final result. Probably need to do thresholding of some sort as post processing to get sharp delineations.

So I have only a handful of high quality labeled images but can generate an arbitrary number of noisy ones (should be reasonably good but not perfect).

  1. How to deal with this? Should I do an unsupervised training first, followed by supervised fine tuning (on the noisy ones, good ones, both?). But then for unsupervised training I guess I need to switch to a different network type (e.g., stacked RBM / CNN hybrid or something?).

  2. Could you also solve this problem with a stacked autoencoder which maps the full image onto the segmented one? Or would require too many inputs/output and/or lose context?

Any guidance appreciated.

asked Aug 23 '13 at 09:45

Dirk%20Gorissen's gravatar image

Dirk Gorissen
16112


One Answer:

I am not very familiar with the problem of image segmentation and I may not be able to give a full answer but as far as I understand, in order to perform deep learning, it is always good to pre-train your network in an unsupervised manner to get good initialization for the weights. In case you don't perform any pre-training, you are likely to start with a bad random set of initial weights for the network and may end up stuck in a poor local minima. Your network can be DBNs, Stacked Auto-encoders or any such common methods used for pre-training a deep network. Since you have a lot of unlabelled data, there should be a big plus to have unsupervised pre-training before any fine-tuning.

Now coming to the issue of image segmentation, the CNN LeNet5 example from theano may not be very helpful as it shows a completely supervised classification problem in which the assigned label is for the complete image and not for every pixel. You can probably looked at something like denoising auto-encoder (DA) and Stacked DA to start with the pre-training. You can further put a regression or classification layer on top of it as may be required.

answered Aug 30 '13 at 18:09

Ankit%20Bhutani's gravatar image

Ankit Bhutani
31126

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.