I have written some Java code to implement Baum-Welch algorithm to experiment with training hidden Markov models. Actually, the program only calculates alpha, beta, gamma and xi values for now as I have not written the estimation step.

I think, normally, given the correct initial state, state transition and emittance probabilities for the model, gamma values generated by the algorithm for t=0 should give the exact initial state probabilites. (Reminder: A gamma value for i equals to P(s_t = i | O, lambda) where O stands for the observation string and lambda stands for the model.) But the results I get are quite different from the initial state probabilities that I give to the algorithm. I have tried it with observation strings that have the length of 20000 with only three different possible observations, but it seems to consistently converge to something else.

Is this normal behavior? Is it possible to have a model that represents the observations better than the actual model? Why does the law of large numbers work differently here?

I hope the question is clear enough. Thanks in advance.

asked May 25 '11 at 00:53

Emre's gravatar image

Emre
36226

edited May 25 '11 at 00:58

If by "represents observations better than the original model" you mean overfits, then yes, it is possible. Otherwise pleasy clarify, as I don't understand your question.

(May 25 '11 at 02:44) Alexandre Passos ♦

One Answer:

Let me first state my understanding of your experimental setup:

  1. You have a HMM, complete with all necessary parameters defined.
  2. You simulate data from this model (up to 20,000 time steps in length, and I assume also over many independent chains?)
  3. You then use the Baum-Welch algorithm to estimate the values of alpha, beta, gamma and xi from this simulated data.

And your question is why don't your learned parameters -- gamma seems to be your (only?) problem here -- reflect the original parameters which generated your data.

If gamma is the only problem, then that is easily explained. The initial state probability reflects P(s(i) | lambda), essentially it is drawn conditional on nothing else (not even the first observation!). If you were to simulate many, many chains from your model, you should see that the distribution of s(1) would be essentially equal to the distribution described by your initial state parameter. Gamma(1), on the other hand, asks the probability of the initial state, given every observed value from t=0 until the end of your chain. Essentially, it includes information from all observations, as opposed to the initial state parameter which includes none of this information. Thus the two should not, in general, be the same value.

Hope this answers your question!

answered May 26 '11 at 20:36

Scott's gravatar image

Scott
1

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.