Hi all,

For my thesis I wish to implement a log-linear conditional random field and apply it in a relation extraction context. For that I'm using this tutorial, as it seems to contain a clear and precise description. Following this paper, I decided to first start and try out an implementation of a regular log-linear model, before attempting the log-linear CRF.

Using Python I did a really basic implementation of a log-linear model where I want to learn the feature weights using gradient ascent. This implementation is mostly based on the derived formula at the top of page 18 of the tutorial mentioned above. The problem I'm having is that for some reason the model only seems to learn for a single iteration. In other words, it doesn't matter if I train a single iteration or 100 iterations, the final performance on the training set is always the same (tested for multiple datasets). For this single iteration it does seem to update the weights correctly. After that weights are still updated, but don't influence the final classifications during testing. I tried adding an additional feature which is always the same as the class label for that particular instance, so it should always reach 100% classification accuracy after enough iterations. This doesn't happen either, so I'm suspecting there is either something wrong with my code (which I checked many times) or I'm misinterpreting the formulas. Other things I've tried:

  • Search the Web and Metaoptimize for others having a similar problem, however I could not find this.
  • Debugging the training procedure to check what is going wrong. The problem is that the updates seem logical to me, but in the end the performance is still stuck at the same level.

I posted some code below to show what I did. I realize this is a very naive implementation that is unlikely to scale very well. However, I think it fits the purpose of trying to understand how these models work. I stripped the code of the code for loading the data/building the features and added some comments (identation is not 100% correctly shown):

def train(instances, features, labels):

# instances is a list, where each element of the list is an instance of the dataset.
# each instance of this dataset is a tuple that contains:
# 1) a dictionary containing the feature values for this particular instance of the dataset
# 2) the label of this particular instance
# features is a dictionary with all possible features. Two example features: (feature1=0 AND class=1) or (feature4=1 AND class=0)
# labels is a list with the possible labels
learningrate = 0.01
iterations = 10
weights = {}

# init feature weights
for feature in features:
    weights[feature] = 0

for i in range(iterations):         # iterations
    for instance in instances:      # iterate over instances
        for feature in features:    # iterate over features

            # The first part of the formula: Fj(x,y)
            if feature[0] in instance[0] and feature[1] == instance[1]:
                Fj_x_y = 1
            else:
                Fj_x_y = 0

            # compute partition function
            Z = 0
            for label in labels:
                l = likelihood(label, instance, weights, features)
                Z += l

            # The second part of the formula: Ey'-p(y'|x,w)[Fj(x,y')]
            expectation = 0        
            for label in labels: # iterate over all labels y'
                if feature[0] in instance[0] and label == instance[1]:
                    f_j_x_yprime = 1
                else:
                    f_j_x_yprime = 0

                expectation += (f_j_x_yprime * (likelihood(label, instance, weights, features) / Z))

            # Compute the gradient and update weight
            gradient = Fj_x_y - expectation
            weights[feature] = weights[feature] + (learningrate * gradient)

def likelihood(label, instance, weights, features):

likelihood = 0
for feature in features:
    if feature[0] in instance[0] and label == feature[1]:
        value = 1
    else:
        value = 0

    likelihood += (weights[feature] * value)
return math.exp(likelihood)

For clarification I named the variables pretty much according the tutorial. Does anyone see what I'm doing wrong here? I'm thinking I'm missing something very simple here, but would greatly appreciate any help. If any other info is needed or anything is unclear, please ask. Thanks!

asked Mar 09 '12 at 07:26

RLager's gravatar image

RLager
1111


One Answer:

There are many things which are shady with this bit of code.

First, in your likelihood computation, you are exponentiating the result of a dot product. This can be pretty big, so you're likely to overflow a float after a few gradient steps.

This is also going to be very slow for many reasons, and might not converge. Note that you're doing a gradient step per feature, and in each such gradient step you're computing all the dot products all over again. You can get a gradient for all features after just one pass of computing those dot products, so you're turning a linear-time iteration into a quadratic-time iteration. You're also doing this per-instance,

In your feature expectations code, you're only adding a term in the expectation when label == instance[1], which means that you're not averaging over all labels, hence your gradient is wrong. The correct way of computing feature expectations is by summing over all labels (including incorrect ones). See the equation in the middle of page 7 in the tutorial, where the partial derivative is defined as sum_i (y_i - p_i) x_i (y_i here is 1 only in the "true label", but p_i is nonzero everywhere, and should be computed by your likelihood).

There are other minor issues, and I suggest you also look at other sources.

answered Mar 09 '12 at 09:33

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.