## Understand the Maximum Likelihood Estimation

Let $X_{1},X_{2}\cdots$, be iid random vector with density or probability function $f(x,\theta)$, where $\theta$ is the unknown parameter and suppose the true value of $\theta$ is $\theta_{0}$.

The likelihood function: $L(x,\theta)=\prod_{i=1}^{n}f(X_{i},\theta)$
The logarithm likelihood function: $l(x,\theta)=\Sigma_{i=1}^{n}logf(X_{i},\theta)$

Note that these two functions are actually joint probability functions of the iid data $X_{1},X_{2},\cdots$; we call them “likelihood” functions instead of “probability” functions because we now consider them as functions of the unknown parameter $\theta$.

The maximum likelihood estimation $\hat{\theta}_{MLE}$ is found by maximize the likelihood function, and we will show the idea behind this procedure.

Jessen inequality states that for a concave function $g$, $Eg(X)\leq g(EX)$ for any random variable $X$. $g(x)=logx$ is concave, so under $\theta_{0}$, $E_{\theta_{0}}logf(X,\theta)-E_{\theta_{0}}logf(X,\theta_{0})=E_{\theta_{0}}log\frac{f(X,\theta)}{f(X,\theta_{0})}\leq log[E_{\theta_{0}}\frac{f(X,\theta)}{f(X,\theta_{0})}]=log[\int\frac{f(X,\theta)}{f(X,\theta_{0})}f(x,\theta_{0})dx]=log1=0.$

That is to say, $max_{\theta}E_{\theta_{0}}logf(X,\theta)=E_{\theta_{0}}logf(X,\theta_{0})$, since $E_{\theta_{0}}logf(X,\theta)\leq E_{\theta_{0}}logf(X,\theta_{0})$.

Now a natural criteria for finding $\theta_{0}$ is to find the parameter which maximizes $h(\theta)\equiv E_{\theta_{0}}logf(X,\theta)$.

However, the function we want to maximize $h(\theta)$ is unknown because $h(\theta)$ has something to do with the unknown $\theta_{0}$.

But note that $\theta_{0}$ only appears in the mean operator $E_{\theta_{0}}$, we can overcome this problem by using sample mean operator $\frac{1}{n}\Sigma$ instead.

Finally, we reach that we can find a estimator by maximize $l(x,\theta)=\Sigma_{i=1}^{n}logf(X_{i},\theta)$.

There will be a loss when using sample mean to replace the real one, and the quality of this estimator depends on the law of large numbers.

(Update: 2012/Feb/17) We want to maximize $logP(X;\theta)$, or equivalently, letting

$(logP(X;\theta))'=\frac{P'(X;\theta)}{P(X;\theta)}\to0$.

The above formula can be interpreted as we want

1. $P(X;\theta)\to1$, or as large as possible,

2. $P'(X;\theta)\to0$, or as small as possible.

These two points implies that we hope our estimation $\hat{\theta}$ can make the possibility of the observed data $X$ as large as possible, and at the same time, we want some kind of stability of that possibility, i.e., we hope the rate of change in $P(X;\theta)$ – that is $P'(X;\theta)$, as small as possible.

References:
A Course in Large Sample Theory(Lecture notes), Xianyi Wu
A Course in Large Sample Theory, Thomas S. Ferguson