|
Hi all, For my thesis I wish to implement a log-linear conditional random field and apply it in a relation extraction context. For that I'm using this tutorial, as it seems to contain a clear and precise description. Following this paper, I decided to first start and try out an implementation of a regular log-linear model, before attempting the log-linear CRF. Using Python I did a really basic implementation of a log-linear model where I want to learn the feature weights using gradient ascent. This implementation is mostly based on the derived formula at the top of page 18 of the tutorial mentioned above. The problem I'm having is that for some reason the model only seems to learn for a single iteration. In other words, it doesn't matter if I train a single iteration or 100 iterations, the final performance on the training set is always the same (tested for multiple datasets). For this single iteration it does seem to update the weights correctly. After that weights are still updated, but don't influence the final classifications during testing. I tried adding an additional feature which is always the same as the class label for that particular instance, so it should always reach 100% classification accuracy after enough iterations. This doesn't happen either, so I'm suspecting there is either something wrong with my code (which I checked many times) or I'm misinterpreting the formulas. Other things I've tried:
I posted some code below to show what I did. I realize this is a very naive implementation that is unlikely to scale very well. However, I think it fits the purpose of trying to understand how these models work. I stripped the code of the code for loading the data/building the features and added some comments (identation is not 100% correctly shown):
For clarification I named the variables pretty much according the tutorial. Does anyone see what I'm doing wrong here? I'm thinking I'm missing something very simple here, but would greatly appreciate any help. If any other info is needed or anything is unclear, please ask. Thanks! |
|
There are many things which are shady with this bit of code. First, in your likelihood computation, you are exponentiating the result of a dot product. This can be pretty big, so you're likely to overflow a float after a few gradient steps. This is also going to be very slow for many reasons, and might not converge. Note that you're doing a gradient step per feature, and in each such gradient step you're computing all the dot products all over again. You can get a gradient for all features after just one pass of computing those dot products, so you're turning a linear-time iteration into a quadratic-time iteration. You're also doing this per-instance, In your feature expectations code, you're only adding a term in the expectation when label == instance[1], which means that you're not averaging over all labels, hence your gradient is wrong. The correct way of computing feature expectations is by summing over all labels (including incorrect ones). See the equation in the middle of page 7 in the tutorial, where the partial derivative is defined as sum_i (y_i - p_i) x_i (y_i here is 1 only in the "true label", but p_i is nonzero everywhere, and should be computed by your likelihood). There are other minor issues, and I suggest you also look at other sources. |