|
update (3) As requested, here is the code I'm currently using which appears to work for small toy grammars. The 'ae' object being passed is an object that has the weight matrices and biases. Additionally, ae has a 'forwardProp' method and a 'df' (derivative of the non-linear function) method. The 'node' object is a standard node in a tree. I'm using Numpy (imported as 'np') for the math. The code below is written in Python, but it should appear fairly similar to Matlab.
updated (2), see below Sorry, realized I made some typos below, as well as some clarification of notation. I'm trying to better understand the exact algorithm used by Socher et al.'s NIPS 2011 paper. For those unfamiliar with their work, they were able to train NNs to learn structured input (specifically CFGs). This paper is related to earlier work done by Pollack (1990; RAAM), but uses a different learning rule known as backpropagation through structure (BTPS) (Goller and Kuchler, 1996). The difficulty I'm having in understand their paper is that their method of training differs from the original formulation of BPTS in one important way. Normally with BPTS supervised training is employed, and you can calculate delta_o via the error based on the top node's output. This allows for just a single encoder layer to be used, as no reconstruction is required, and thus autoencoders (AE) is not considered. On the other end of the spectrum, RAAM uses an AE as its base unit. Similar to how BPTS has a single encoding weight matrix that is used for all the nodes of the tree, RAAM has an encoding and decoding weight matrix that appears for all the nodes of the tree. The error of the network is found by calculating the reconstruction error at each node, and then back propagating that error to the encoding layer. This allows the two weight matrices to be trained simultaneously for the entire tree. Thus, at first glance using an AE with BPTS seems incompatible. BPTS provides a mechanism to train just a single encoder, but the AEs have an additional decoder layer that needs to be trained. One hint as to what might be occurring is that the authors say the error of the tree is equal to the reconstruction error of all the subnodes. Thus, you can calculate delta_o for the top node and backpropagate it to the decoding layer of all its children. If I'm correct about this, then does this mean the next step is to backpropagate the error assigned to the decoding layers to the encoding layers? This would make the algorithm (I should note, I say 'encoding layer' and 'decoding layer' but really mean copies made per node):
I would greatly appreciate any hints, suggestions, or confirmations (or if you can point out where I'm wrong and/or ignorant about my understanding of these NNs, that would also be great). Also the author do provide source code (on socher.org) for previous papers. However, it's highly optimized Matlab, so making progressive through it is somewhat slow. UpdateI've started to implement my own version as well as going back over the author's code, and I think I mostly understand what's happening now. The BPTS occurs for the encoding layer just like the original paper. However, the error that occurs by reconstruction is added to the error as well. Thus:
Note: Two complications arise from notation for dealing with children. One is that there are actually two delta_p^{l+1} for each child. With the original BPTS, this was dealt with by just copying the delta_p^{l+1}. However, because of the dimensions of We (visiblexhidden; visible =2hidden), Wedelta_p has the size of visible ([left;right]). Likewise, when we update gradWe, we half to concatenate the children's values for the actual input to the node. One part the had me confused for awhile was the fact that the code had an additional term in it:
However, the algorithm used has a semi-supervised part where they put a softmax function on the 'p' layer. I haven't used softmax before, but it looks like its derivative is simply (x - hat{x}), thus explaining the last term. The one question I have left, is that according to the original BPTS paper (Goller and Kichler, 1996) they have:
which is the same as used above. However, the standard derivation of backpropagation uses activation of node and not its output (where output = f(activation)) inside the f'(). Unfortunately I can't find a copy of their 1995 paper where they actually derive the formulas they use. Does anyone know why this is? |
|
From what I get in the paper the tree is never created byt the algorithm; it's taken for granted to come from a parser. So what is done is: (1) from the words up compute the latent representation of each level in the tree; (2) go down the tree and deterministically reconstruct all intermediate nodes including the words; (3) do a gradient step minimizing the reconstruction error of the words given all the up and down weights in the tree (as, fixing the words, the whole tree is deterministic). To do this gradient step you need to backpropagate the erros first up in the down weights and then down in the up weights, in the reverse direction of how the values were computed. This gradient would probably look pretty hairy, and I think you're better off using automated tools like Theano to compute it. (1) I agree, (2) Yes, they unfold each node all the way (for the unfolding RAE) to get the reconstruction error for each node, (3) Here I disagree. They explicitly say that they do backpropagation through structure (section 2.5, second sentence). This makes it easy to compute the gradients. The confusion for me is that if you do backpropagation at this step, I believe you'll be solving for the decoder weight matrix at each node. To account for this, I suspect you need to additionally backpropagate the delta_D (error for decoding layer) to the encoding layer for each node.
(Nov 08 '11 at 13:53)
nop
I think conceptually it's easier to think in terms of gradients rather than which errors are being backpropagated through which node. The reconstructed words are a deterministic function of the original words and the tree structure, so what you want is the gradient of that w.r.t each parameter. I think that the parameters of the autoencoders are all tied, so it's one big encoding matrix and one big decoding matrix being used all over the place, and the gradient for each parameter then is very complex.
(Nov 08 '11 at 13:57)
Alexandre Passos ♦
You do use the exact same encoding matrix and decoding matrix for each node, but if you look at Goller and Kuchler (1996), you can actually treat each node as having a copy of these matrices (very similar to backpropagation through time). This let's you reframe the problem in terms of standard backpropagation, where the deltas (errors) of a higher layer are propagated to the layers below. The total gradient can be computed from the sum of all the deltas (equation 1 of Goller and Kuckler, 1996).
(Nov 08 '11 at 14:35)
nop
|
Thank you to Alexandre for clarifying which paper I had cited!
Nop, did you ever get a chance to implement your own version? I'm in the process of trying to reimplement this code as well to get a better understanding of what it does. The paper is fairly vague.
I have what I believe is a working version, and I don't think I could have implemented from the paper alone. I'll post more details when I get a chance, but the basic idea is that you standard RAAM with BPTS on top of it. For RAAM, the error is the standard reconstruction error for an autoencoder. To do BPTS, you take the error of the top parent, back propagate it to the encoding layer, and then use the left and right half of the resulting vector as error from the parent. I ran my implementation on a toy grammar and it appears to work fairly well. I'll try to provide a clearer explanation when I get time to post the details.
Nop, that would be great if you could post more details (or code :-) I've been digging through this thing for weeks trying to understand it and build something similar. If the code was cleaner that would have made it a lot easier.
@Ryan: sorry for taking so long to finally put something up .. things have been crazy here. Anyways, I ended up putting it in code form since this site doesn't do LaTeX formatting, so I thought it might be easier to read. It's in Python, but I can also provide Matlab code if you want. Please let me know if you have any questions. I haven't included the proportional weighting or normalization terms to keep the code simpler. I can add those in if you want. Also, I'm currently looking into the unfolding version of this algorithm, but haven't been able to figure out what's going on quite yet.