update (3)

As requested, here is the code I'm currently using which appears to work for small toy grammars. The 'ae' object being passed is an object that has the weight matrices and biases. Additionally, ae has a 'forwardProp' method and a 'df' (derivative of the non-linear function) method.

The 'node' object is a standard node in a tree. I'm using Numpy (imported as 'np') for the math. The code below is written in Python, but it should appear fairly similar to Matlab.

def calcGrad(ae, node, parentDelta):
    # forward prop input of node
    hiddenAct, encoding, act, output = ae.forwardProp(node.input)

   ##
   ## This part is the standard RAAM model
   ##

    # calculate output error
    error = node.input - output
    delta_r = np.multiply(-error, ae.df(act))

    # calculate gradient for output layer (weight matrix and bias)
    W2grad = delta_r*encoding.T
    b2grad = delta_r

     # backdrop error to hidden layer
    delta_e = np.multiply(ae.W2.T*delta_r, ae.df(hiddenAct)) + parentDelta

    # calculate gradient for hidden layer (weight matrix and bias)
    W1grad = delta_e*node.input.T
    b1grad = delta_e

   ##
   ## This adds back-propagation through structure (BPTS) in
   ##

    # calculate error due to parent node in tree
    delta_p = np.multiply(ae.W1.T*delta_e + error, ae.df(node.input))

    # roll all gradients up
    grad = pack(W1grad, W2grad, b1grad, b2grad)

    N = delta_p.shape[0]
    leftChild, rightChild = node.children

    # according to BPTS, we take the error from the parent and split it into two parts (left and right)
    # so if there is either a left or right child, split error, and call calcGrad recursively
    if not leftChild.isLeaf:
        grad += calcGrad(ae, leftChild, delta_p[:N/2])

    if not rightChild.isLeaf:
        grad += calcGrad(ae, rightChild, delta_p[(N/2):])

    return grad

updated (2), see below Sorry, realized I made some typos below, as well as some clarification of notation.

I'm trying to better understand the exact algorithm used by Socher et al.'s NIPS 2011 paper. For those unfamiliar with their work, they were able to train NNs to learn structured input (specifically CFGs). This paper is related to earlier work done by Pollack (1990; RAAM), but uses a different learning rule known as backpropagation through structure (BTPS) (Goller and Kuchler, 1996).

The difficulty I'm having in understand their paper is that their method of training differs from the original formulation of BPTS in one important way. Normally with BPTS supervised training is employed, and you can calculate delta_o via the error based on the top node's output. This allows for just a single encoder layer to be used, as no reconstruction is required, and thus autoencoders (AE) is not considered.

On the other end of the spectrum, RAAM uses an AE as its base unit. Similar to how BPTS has a single encoding weight matrix that is used for all the nodes of the tree, RAAM has an encoding and decoding weight matrix that appears for all the nodes of the tree. The error of the network is found by calculating the reconstruction error at each node, and then back propagating that error to the encoding layer. This allows the two weight matrices to be trained simultaneously for the entire tree.

Thus, at first glance using an AE with BPTS seems incompatible. BPTS provides a mechanism to train just a single encoder, but the AEs have an additional decoder layer that needs to be trained.

One hint as to what might be occurring is that the authors say the error of the tree is equal to the reconstruction error of all the subnodes. Thus, you can calculate delta_o for the top node and backpropagate it to the decoding layer of all its children. If I'm correct about this, then does this mean the next step is to backpropagate the error assigned to the decoding layers to the encoding layers?

This would make the algorithm (I should note, I say 'encoding layer' and 'decoding layer' but really mean copies made per node):

  1. Create a tree by recursively encoding input
  2. Unfold the tree to calculate the reconstruction cost
  3. Calculate the top node's delta_o by summing the reconstruction cost of all the nodes
  4. Backpropagate delta_o to the decoding layer of all the nodes of the unfolded tree
  5. Backpropagate the error of the decoding layer to the encoding layer of each node of the unfolded tree

I would greatly appreciate any hints, suggestions, or confirmations (or if you can point out where I'm wrong and/or ignorant about my understanding of these NNs, that would also be great).

Also the author do provide source code (on socher.org) for previous papers. However, it's highly optimized Matlab, so making progressive through it is somewhat slow.

Update

I've started to implement my own version as well as going back over the author's code, and I think I mostly understand what's happening now.

The BPTS occurs for the encoding layer just like the original paper. However, the error that occurs by reconstruction is added to the error as well. Thus:

We = encoding matrix
Wd = decoding matrix
f = some activation function
p = f(We^{T}x + b)

// decoding matrix almost follows standard gradient
delta_o = (x - \hat{x})f'(p)
gradWd = delta_o*p^{T}

// encoding matrix has two types of errors
delta_p^{l+1} = 0 for first layer, otherwise previously calculated delta_p
delta_p^{l} = f'(p)(We*delta_p^{l+1} + Wd*delta_o)
gradWe = delta_p^{l}*[leftChild.p;rightChild.p]

Note: Two complications arise from notation for dealing with children. One is that there are actually two delta_p^{l+1} for each child. With the original BPTS, this was dealt with by just copying the delta_p^{l+1}. However, because of the dimensions of We (visiblexhidden; visible =2hidden), Wedelta_p has the size of visible ([left;right]). Likewise, when we update gradWe, we half to concatenate the children's values for the actual input to the node.

One part the had me confused for awhile was the fact that the code had an additional term in it:

delta_p^{l} = f'(x)(We*delta_p^{l+1} + Wd*delta_o - (x - \hat{x}))

However, the algorithm used has a semi-supervised part where they put a softmax function on the 'p' layer. I haven't used softmax before, but it looks like its derivative is simply (x - hat{x}), thus explaining the last term.

The one question I have left, is that according to the original BPTS paper (Goller and Kichler, 1996) they have:

delta_p^{l} = f'(x)*We^{T}delta_p^{l+1}

which is the same as used above. However, the standard derivation of backpropagation uses activation of node and not its output (where output = f(activation)) inside the f'(). Unfortunately I can't find a copy of their 1995 paper where they actually derive the formulas they use. Does anyone know why this is?

asked Nov 07 '11 at 11:29

nop's gravatar image

nop
1062310

edited Jan 26 at 13:56

Thank you to Alexandre for clarifying which paper I had cited!

(Nov 08 '11 at 10:15) nop

Nop, did you ever get a chance to implement your own version? I'm in the process of trying to reimplement this code as well to get a better understanding of what it does. The paper is fairly vague.

(Jan 21 at 20:03) Ryan Stout

I have what I believe is a working version, and I don't think I could have implemented from the paper alone. I'll post more details when I get a chance, but the basic idea is that you standard RAAM with BPTS on top of it. For RAAM, the error is the standard reconstruction error for an autoencoder. To do BPTS, you take the error of the top parent, back propagate it to the encoding layer, and then use the left and right half of the resulting vector as error from the parent. I ran my implementation on a toy grammar and it appears to work fairly well. I'll try to provide a clearer explanation when I get time to post the details.

(Jan 21 at 22:45) nop

Nop, that would be great if you could post more details (or code :-) I've been digging through this thing for weeks trying to understand it and build something similar. If the code was cleaner that would have made it a lot easier.

(Jan 23 at 18:13) Ryan Stout

@Ryan: sorry for taking so long to finally put something up .. things have been crazy here. Anyways, I ended up putting it in code form since this site doesn't do LaTeX formatting, so I thought it might be easier to read. It's in Python, but I can also provide Matlab code if you want. Please let me know if you have any questions. I haven't included the proportional weighting or normalization terms to keep the code simpler. I can add those in if you want. Also, I'm currently looking into the unfolding version of this algorithm, but haven't been able to figure out what's going on quite yet.

(Jan 26 at 13:58) nop

One Answer:

From what I get in the paper the tree is never created byt the algorithm; it's taken for granted to come from a parser. So what is done is: (1) from the words up compute the latent representation of each level in the tree; (2) go down the tree and deterministically reconstruct all intermediate nodes including the words; (3) do a gradient step minimizing the reconstruction error of the words given all the up and down weights in the tree (as, fixing the words, the whole tree is deterministic). To do this gradient step you need to backpropagate the erros first up in the down weights and then down in the up weights, in the reverse direction of how the values were computed.

This gradient would probably look pretty hairy, and I think you're better off using automated tools like Theano to compute it.

answered Nov 08 '11 at 11:54

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1899744214335

(1) I agree, (2) Yes, they unfold each node all the way (for the unfolding RAE) to get the reconstruction error for each node, (3) Here I disagree. They explicitly say that they do backpropagation through structure (section 2.5, second sentence). This makes it easy to compute the gradients.

The confusion for me is that if you do backpropagation at this step, I believe you'll be solving for the decoder weight matrix at each node. To account for this, I suspect you need to additionally backpropagate the delta_D (error for decoding layer) to the encoding layer for each node.

(Nov 08 '11 at 13:53) nop

I think conceptually it's easier to think in terms of gradients rather than which errors are being backpropagated through which node. The reconstructed words are a deterministic function of the original words and the tree structure, so what you want is the gradient of that w.r.t each parameter. I think that the parameters of the autoencoders are all tied, so it's one big encoding matrix and one big decoding matrix being used all over the place, and the gradient for each parameter then is very complex.

(Nov 08 '11 at 13:57) Alexandre Passos ♦

You do use the exact same encoding matrix and decoding matrix for each node, but if you look at Goller and Kuchler (1996), you can actually treat each node as having a copy of these matrices (very similar to backpropagation through time). This let's you reframe the problem in terms of standard backpropagation, where the deltas (errors) of a higher layer are propagated to the layers below. The total gradient can be computed from the sum of all the deltas (equation 1 of Goller and Kuckler, 1996).

(Nov 08 '11 at 14:35) nop
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.