|
I'm trying to train an autoencoder to represent images of a Pong implementation. The images are very simple but also quite big for the relatively sparse amount of information they contain. Here is an example:
I need a very efficient, small representation, because I want to use each code unit as a feature in a reinforcement learning setting. This is called DFQ, and it has been experimentally shown that with a setup very similar to mine, a big image with just one moving ball can be represented in two code units, each representing a dimension of the position. Thus, it should be possible to represent my Pong pictures in four values, the y-positions of the paddles, and the x,y-position of the ball (I will worry about its movement later). Actually, 10 code values would still be okay, so here I'm working with that. Now, I tried a wide range of parameters, but it seems almost impossible to get the representation to this size. I sample the image down by factor 4, so that I don't loose too much information. In the first 1, 2 or 3 layers, I use convolution kernels of size 3 and shared weights. After that, I use regular layers that half the size of the representation each, which already gives me at least 8 layers for the encoder part of the autoencoder. As the training algorithm, I use RPROP. To handle overfitting, I use some weight decay, I have a huge training set (more than 20k images) from a random distribution of positions, and I also use some salt-and-pepper and slight gaussian noise on the input data of each layer as in a Denoising Auto Encoder, though I can't use a lot, as I fear the ball might be completely noised out. The problem seems to be that it is always easier to reproduce the paddles, because their movement is limited to one axis and they are bigger than the ball. I already set the ball to a bigger size, but still the paddles seem to be more interesting to the training procedure. Only when it has learned the paddles perfectly, training will start and focus on the ball. Now since pretraining of the second cascade is pointless if my first cascade did only learn to reproduce the paddles, I need to make sure that each shallow autoencoder keeps the ball's information. Thus I included a finetuning between each cascade, training the whole stacked autoencoder pretrained so far. This works great for the first few layers, but later on, it gets harder and harder to recover the ball positions, until it seems it has both overfitted the training data, and failed to keep the ball information at all, effectively running into a local minimum. I also tried to "binarize" the image into black and white pixels with a certain threshold, and then use cross-entropy as the training error. With this, I seem to be getting a bigger gradient, and thus much better and more confident convergence. However, I also lose some information that is contained in "grey" pixels after the downsampling, and thus could only get accurate to the 4th pixel, as I downsample by factor 4. Does anyone have an idea what else I could try to make this work? If you need to know more details of my parameters, I can provide them. Also note that I use almost the same setup as the experiment mentioned above, even the same code, that has been used successfully quite a few times. It's from a German dissertation and it's highly probable that the fault is not in the software but in my setup or the problem in general. Lots of thanks for reading this far, even more if you can think of something that helps! Max Edit: Some more info on the experiment where this worked. The images used look like this:
It looks as though there is a grid, but the player (square at bottom center) can be at any pixel position. The final goal of the dissertation and of my project is to have a video camera looking at a game, while the computer learns which features are important for playing and then how to win at the game. In the dissertation, there was only one moving object (a racing car), for me, again, there are three. The experimenter managed to get the representation of the player into just two hidden units over many (~10) encoder layers, each pretrained as a denoising auto encoder. I'm trying to do the same, only with more moveable objects. |


maybe you could explain what was the problem for which the method worked?
and what are you trying to "prove"?
I am no expert, but why do you think a multi-layer conv net should work? I understand the weight sharing, to develop "ball detectors" at each position. but I don't see that anything beyond that level is useful: you are not trying to develop "higher order feature detectors" of combinations of features - you only have a single ball (unlike letters which you can view as constellations of strokes)
so now if you just needed the ball x,y position it is [excluding paddles ] more or less a linear function of the ball detector output.. so would think training directly for x,y would be straightforward.... the problem in the auto encoder setting is the decoding: given x,y the corresponding output (reconstructing the input image) is nonlinear
and why do you NEED to reduce it to a coordinate representation to input into your reinforcement learning anyway? I would have thought that the feature detector output would be good enough... because it is effectively just 1 of n encoding - ie very sparse.
Thanks for you answer! I updated the question with more info on the experiment and also the final goal of my project. I'm trying to prove that we do not need any more supervision than the pure images in order to learn how to play the game with reinforcement learning methods. The final goal is to have a camera looking at a monitor, and a robot arm controlling the player via keyboard or gamepad.
I do not actually need the x,y position, I just need basically a descriptor of the whole image that is very sparse but includes all the information. Each code unit will be used as a feature in my RL algorithm (e.g. Q-learning), and thus this number should be as small as possible, as each code unit gives me another dimension to the learning problem. It does not need to be humanly readable, as I will extract the information I need with another, shallow neural net.
How would you propose I use the feature detector output to describe the whole input? I would have to check for the feature at every position, which would give me a very large code. Or maybe I misunderstand you.
So if I understand your image, the conv layer architecture would be required to memorise the different maze configurations- ie it makes sense to me for that problem.
By sparse (I mean/is meant) that only a few inputs are ever active at any one time (not that there are only few inputs). the curse of dimensionality basically depends on possible input combinations not the actual number of inputs. eg if you have 100 inputs of which only the first ever changes you do no have a 100 dimensional problem. similarly here if you have a 100 binary inputs (10 by 10) but only one of 100 is ever on, then there is no problem- your complete space is just 100 points versus 2^100 if you can have every combination ...
I think it might be helpful if you imagine hand coding an NN to play pong... I believe it would be easier to separately learn a movement for every combination of x,y (ie from the 100 inputs) than from the actual x,y...in SVMS thats what the kernel trick is all about :projecting into a high dimensional space in which all you need is a linear classifier). Hopefully you have thought about the game more than me and know what the true response should be... to me pong seems quite non linear in (x,y) ball position (because of all the bouncing off the walls) so getting the x,y representation will actually be a step backwards.