Ø Maximum
Likelihood Estimation-
Maximum
likelihood estimation is a method to seek the probability distribution or
parameters value that makes the observed data most likely.
By knowing the distribution without knowing
the parameters say (mean and variance) the mean and variance of the population
can be estimated with MLE with sample. For a fixed set of data and
underlying statistical model, the method of maximum likelihood selects the set
of values of the model parameters that maximizes the likelihood function.
Once a model is specified with its parameters
one is in a position to evaluate its goodness of fit. Goodness of fit is
assessed by finding parameter values of a model that best fits the data—a
procedure called parameter estimation.
There
are two general methods of parameter estimation, least-squares estimation (LSE)
and maximum likelihood estimation (MLE).
o
Probability density function-
Data
analysis is used to identify the population that is most likely to have
generated the sample. Each population is identified by a corresponding
probability distribution, and each probability distribution is a unique value
of the model’s parameter. Change in models parameters changes probability
distribution.
Let
f (y/w) denotes the probability density function (PDF) that specifies the
probability of observing data vector y given the parameter w: If individual
observations, yi’s, are statistically independent of one
another, then according to the theory of probability, the PDF for the data y =
( y1….ym
)given the parameter vector w can be expressed as a multiplication of PDFs for
individual observations.
Example- Consider one observation and one parameter,
that is, m = k = 1:
Suppose
that the data y represents the number of successes in a sequence of 10
Bernoulli trials and that the probability of a success on any one trial,
represented by the parameter w; is 0.2.
The
general expression of the PDF of the binomial distribution for arbitrary values
of w and n:
Which is
known as the binomial distribution with parameters n = 10; w = 0:2:
PDF changes with the change in
parameter value.
o
Likelihood function- Given the observed data and a model of
interest. Likelihood function finds the one PDF, among all the probability
densities that the model prescribes, that is most likely that have produced the
data.
Likelihood
function is defined by reversing the roles of the data vector y and the
parameter vector w in f (y/w); i.e.
L (w/y) = L(y/w)
L
(w/y) represents the likelihood of the parameter w given the observed data y
and is a function of w.
Example
- For the one-parameter binomial example the likelihood function for y = 7 and
n = 10 is given by-
Likelihood
function tells us the likelihood (‘‘un-normalized probability’’) of a
particular parameter value for a fixed data set.
o
Calculating Maximum likelihood estimation –
Required-
1- Likelihood Equation (L (w/y))
2- (Ln (L (w/y))) is a maximum and not a minimum.
2. Tells us that shape should be convex which
means second partial derivative of likelihood equation should be less than 0.
MLE
estimate is obtained by maximizing the log-likelihood function, i.e.
Ln
(L (w/y)).
Assuming
that the log-likelihood function (Ln (L (w/y))) is differentiable, if
wMLE
exists, it must satisfy the following partial differential equation known as
the likelihood equation:
Example- For the log likelihood equation in example
under section likelihood functions above.
Taking
Ln-
Taking
derivative and equating it to zero-
We
get wMLE = 0.7.
To
conclude it represents maximum not minimum second derivation < 0;
When
the model involves many parameters and its PDF is highly non-linear, it is
difficult to obtain a logical form solution for the MLE estimate. Optimization
is used initially by adding random value of parameters. Depending upon the
choice of the initial parameter values, the algorithm could prematurely stop
and return a sub-optimal set of parameter values. For example, one may choose
different starting values over multiple runs of the iteration procedure and
then examine the results to see whether the same solution is obtained
repeatedly. When that happens, one can conclude with some confidence that a
global maximum has been found.
o
Relation to least-squares estimation-
MLE
we seek the parameter values that are most likely to have produced the data. In
LSE, on the other hand, we seek the parameter values that provide the most
accurate description of the data, measured in terms of how closely the model fits
the data under the square-loss function.
LSE
is the sum of squares error (SSE) between observations and predictions is
minimized:
Minimization
of LSE is also subject to the local minima problem, especially when the model
is non-linear with respect to its parameters.
MLE
should be preferred to LSE, unless the probability density function is unknown
or difficult to obtain in an easily computable form.
When
observations are independent of one another and are normally distributed with a
constant variance. In this case, MLE estimates = LSE estimates.
https://drive.google.com/file/d/0Bx3mfFH5R-y3VS1JWlJNVFUyWnM/view?usp=sharing