|
When objective function is log-likelihood, negative of inverse of the expected Hessian divided by n gives us approximate covariance matrix of the ML estimator, are there similar results for other objective functions? |
|
I think it's just because that's how the covariance (which is basically the second moment) is defined. Second moment is the just Fisher information which is the same as the inverse of the negative of the expected Hessian of log-likelihood. So I think this form of the estimate is specific to MLE. Wikipedia derivation of asymptotic normality of MLE gets the result by relying on some regularity properties of likelihood function, so it seems the same result should apply to other objective functions with those properties (for instance, objective function must decompose over training examples in order for central limit theorem to apply)
(Aug 19 '10 at 19:23)
Yaroslav Bulatov
According to the Cramer–Rao bound, The variance of any unbiased estimator hat{theta} of theta is lower bounded by the inverse of the Fisher information (negative of the expected Hessian). MLE is asymptotically efficient which means that the asymptotic variance equals the inverse Fisher information which is the best possible variance. I think any objective function which leads to an asymptotically efficient unbiased estimator will lead to a similar result (and it may require the regularity conditions for this to hold).
(Aug 19 '10 at 20:26)
spinxl39
|
Isn't the covariance estimate equal to just the inverse of the negative of the expected Hessian?
And also, since it is the expected Hessian, it doesn't need to be divided by n.
Where expectation is computed over what?
Over samples.
Suppose you are trying to learn p(x,t) where true generating distribution is q(x)=p(x,t0). Then your MLE's asymptotic variance is 1/(nH) where H is the expected Hessian of the log-likelihood function evaluated at t0, and the expectation is taken with respect to q. Sometimes people take the expectation with respect to q^n which is the distribution over sequences of n IID points drawn from q, in which case you don't need 1/n factor