Wednesday, March 11, 2015

Maximum Likelihood Estimation

Ø Maximum Likelihood Estimation-
Maximum likelihood estimation is a method to seek the probability distribution or parameters value that makes the observed data most likely.
By knowing the distribution without knowing the parameters say (mean and variance) the mean and variance of the population can be estimated with MLE with sample.  For a fixed set of data and underlying statistical model, the method of maximum likelihood selects the set of values of the model parameters that maximizes the likelihood function.
Once a model is specified with its parameters one is in a position to evaluate its goodness of fit. Goodness of fit is assessed by finding parameter values of a model that best fits the data—a procedure called parameter estimation.
There are two general methods of parameter estimation, least-squares estimation (LSE) and maximum likelihood estimation (MLE).

o   Probability density function-
Data analysis is used to identify the population that is most likely to have generated the sample. Each population is identified by a corresponding probability distribution, and each probability distribution is a unique value of the model’s parameter. Change in models parameters changes probability distribution.
Let f (y/w) denotes the probability density function (PDF) that specifies the probability of observing data vector y given the parameter w: If individual observations, yi’s, are statistically independent of one another, then according to the theory of probability, the PDF for the data y = ( y1….ym )given the parameter vector w can be expressed as a multiplication of PDFs for individual observations.
Example- Consider one observation and one parameter, that is, m = k = 1:
Suppose that the data y represents the number of successes in a sequence of 10 Bernoulli trials and that the probability of a success on any one trial, represented by the parameter w; is 0.2.
The general expression of the PDF of the binomial distribution for arbitrary values of w and n:

Which is known as the binomial distribution with parameters n = 10; w = 0:2:
PDF changes with the change in parameter value.
o   Likelihood function- Given the observed data and a model of interest. Likelihood function finds the one PDF, among all the probability densities that the model prescribes, that is most likely that have produced the data.
Likelihood function is defined by reversing the roles of the data vector y and the parameter vector w in f (y/w); i.e.
L (w/y) = L(y/w)
L (w/y) represents the likelihood of the parameter w given the observed data y and is a function of w.
Example - For the one-parameter binomial example the likelihood function for y = 7 and n = 10 is given by-
Likelihood function tells us the likelihood (‘‘un-normalized probability’’) of a particular parameter value for a fixed data set.

o   Calculating Maximum likelihood estimation
1-     Likelihood Equation (L (w/y))
2-     (Ln (L (w/y))) is a maximum and not a minimum.
2. Tells us that shape should be convex which means second partial derivative of likelihood equation should be less than 0.
MLE estimate is obtained by maximizing the log-likelihood function, i.e.
Ln (L (w/y)).
Assuming that the log-likelihood function (Ln (L (w/y))) is differentiable, if
wMLE exists, it must satisfy the following partial differential equation known as the likelihood equation:
Example- For the log likelihood equation in example under section likelihood functions above.
Taking Ln-
Taking derivative and equating it to zero-
We get wMLE = 0.7.
To conclude it represents maximum not minimum second derivation < 0;
When the model involves many parameters and its PDF is highly non-linear, it is difficult to obtain a logical form solution for the MLE estimate. Optimization is used initially by adding random value of parameters. Depending upon the choice of the initial parameter values, the algorithm could prematurely stop and return a sub-optimal set of parameter values. For example, one may choose different starting values over multiple runs of the iteration procedure and then examine the results to see whether the same solution is obtained repeatedly. When that happens, one can conclude with some confidence that a global maximum has been found.
o   Relation to least-squares estimation-
MLE we seek the parameter values that are most likely to have produced the data. In LSE, on the other hand, we seek the parameter values that provide the most accurate description of the data, measured in terms of how closely the model fits the data under the square-loss function.
LSE is the sum of squares error (SSE) between observations and predictions is minimized:
Minimization of LSE is also subject to the local minima problem, especially when the model is non-linear with respect to its parameters.
MLE should be preferred to LSE, unless the probability density function is unknown or difficult to obtain in an easily computable form.
When observations are independent of one another and are normally distributed with a constant variance. In this case, MLE estimates = LSE estimates.

