Understand the Maximum Likelihood Estimation


Let X_{1},X_{2}\cdots, be iid random vector with density or probability function f(x,\theta), where \theta is the unknown parameter and suppose the true value of \theta is \theta_{0}.

The likelihood function: L(x,\theta)=\prod_{i=1}^{n}f(X_{i},\theta)
The logarithm likelihood function: l(x,\theta)=\Sigma_{i=1}^{n}logf(X_{i},\theta)

Note that these two functions are actually joint probability functions of the iid data X_{1},X_{2},\cdots; we call them “likelihood” functions instead of “probability” functions because we now consider them as functions of the unknown parameter \theta.

The maximum likelihood estimation \hat{\theta}_{MLE} is found by maximize the likelihood function, and we will show the idea behind this procedure.

Jessen inequality states that for a concave function g, Eg(X)\leq g(EX) for any random variable X. g(x)=logx is concave, so under \theta_{0}, E_{\theta_{0}}logf(X,\theta)-E_{\theta_{0}}logf(X,\theta_{0})=E_{\theta_{0}}log\frac{f(X,\theta)}{f(X,\theta_{0})}\leq log[E_{\theta_{0}}\frac{f(X,\theta)}{f(X,\theta_{0})}]=log[\int\frac{f(X,\theta)}{f(X,\theta_{0})}f(x,\theta_{0})dx]=log1=0.

That is to say, max_{\theta}E_{\theta_{0}}logf(X,\theta)=E_{\theta_{0}}logf(X,\theta_{0}), since E_{\theta_{0}}logf(X,\theta)\leq E_{\theta_{0}}logf(X,\theta_{0}).

Now a natural criteria for finding \theta_{0} is to find the parameter which maximizes h(\theta)\equiv E_{\theta_{0}}logf(X,\theta).

However, the function we want to maximize h(\theta) is unknown because h(\theta) has something to do with the unknown \theta_{0}.

But note that \theta_{0} only appears in the mean operator E_{\theta_{0}}, we can overcome this problem by using sample mean operator \frac{1}{n}\Sigma instead.

Finally, we reach that we can find a estimator by maximize l(x,\theta)=\Sigma_{i=1}^{n}logf(X_{i},\theta).

There will be a loss when using sample mean to replace the real one, and the quality of this estimator depends on the law of large numbers.

(Update: 2012/Feb/17) We want to maximize logP(X;\theta), or equivalently, letting

(logP(X;\theta))'=\frac{P'(X;\theta)}{P(X;\theta)}\to0.

The above formula can be interpreted as we want

1. P(X;\theta)\to1, or as large as possible,

2. P'(X;\theta)\to0, or as small as possible.

These two points implies that we hope our estimation \hat{\theta} can make the possibility of the observed data X as large as possible, and at the same time, we want some kind of stability of that possibility, i.e., we hope the rate of change in P(X;\theta) – that is P'(X;\theta), as small as possible.

References:
A Course in Large Sample Theory(Lecture notes), Xianyi Wu
A Course in Large Sample Theory, Thomas S. Ferguson

Advertisements
This entry was posted in statistics. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s