
Maximum Likelihood Estimation (MLE) is often the preferred method when it comes to the estimation of statistical models. It is preferred due to the "nice" properties that the estimators from this estimation algorithm have (e.g. estimators are asymptotically normal).
This MLE method involves specifying the likelihood function. Given a random sample and the probability density function (pdf)
, the likelihood function is given as
. It is worth noting that this function is viewed as a function of
, given
.
To help with the optimization of , we usually consider
instead. There are various optimization algorithms that can be used to find
that maximizes
. These include Particle Swam Optimisation, Genetic Algorithm and the Expectation Maximization (EM) algorithm, amongst others. However, the suitable optimization algorithm to use depends on the functional form of
. One can use Fitness Landscape Analysis to fully understand the structure of
. A more detailed explanation of Fitness Landscape Analysis can be found here.
In this post, we will be focusing on cases where the EM algorithm may be useful in finding the estimates of . We will go through the theory of the EM algorithm, and then look at examples.

The EM algorithm was formally established by Arthur Dempster, Nan Laird, and Donald Rubin in thier 1977 paper. THE EM algorithm is useful in cases where we are analyzing a system with incomplete or missing data.The incomplete data case occurs when we have a combination of data that we can observe, and data that we cannot not observe (i.e "hidden data"). This means that we observe some data vector , and we are unable to observe
, but we believe that
was generated in-conjunction with
.
It is important to note that our hypothesis is that although we only observe , we believe that the underlying model contains
, which we do not observe. This means that we have that
where
is the pdf of
.
So, if we were using the MLE approach, we would want to maximize: In most instances,
will be difficult to maximize because we have a log of a sum. However, the EM algorithm provides a potential solution to this problem. Consider the following formulation:
where
is some distribution function and
is the expectation of
under the probability measure
.
The fact that means that we have created a lower bound for
. It is important to note that this lower bound exists for different
. So we need to think carefully about our choice of
. In addition, we want the lower bound to be as tight as possible on
. That is, if possible, we want to chose
in such a way that
.
Given that is strictly concave, one can show that
if and only if the expectation is with respect to a constant. That is:
. This means that
.
However, since is a proper distribution and its probabilities must sum to one, we have that
, thus
. The process of finding
is called the Expectation step. We are now going to focus on the Maximization step.
The Maximization step involves finding a that maximizes
at the chosen
from the Expectation step. Finding this optimal
given
is called the Maximization step. We then use these new
values as initial values for the Expectation step, and continue iterating between the Expectation and Maximization steps.
In summary, the EM algorithm is iterative and is implemented as follows:
- Expectation step
- Assume an initial value for
- Find
that ensures that
..
- This
will then be our estimate of the distribution of the un-observed data
- Assume an initial value for
- Maximisation Step
- Given the
estimated in the Expectation step, optimize
with respect to
- Use this estimated value of
as the initial values for the Expectation step above
- Given the
One can also show that after each iteration, the value of that we obtain produces a value for
that is bigger than the value of
at the previous iteration. This means that the procedure converges to at least a local optima. In addition, one can also show that this procedure will converge to the MLE that we are looking for. The proofs for the above claims can be found here.
Disadvantages of the EM algorithm
The EM algorithm tends to find local optima. So one needs to use many different starting values to increase the chances that the global optima is attained. In addition, the convergence to the optima is slow, more especially if the parameters we are trying to estimate are not separable.
In the next section, we will be looking at some numerical examples in order to see how one would implement the EM algorithm in practice.
In this section we will be looking at a couple of examples to illustrate how the EM algorithm would be implemented in practice. We will be looking at a Coin Tossing example, and an example on estimating parameters from a mixture of Gaussian distributions.
Coin Tossing Example

Consider the scenario where you have three possibly biased coins: Coin 1, Coin 2 and Coin 3. Assume that the probability of observing a head from Coin 1 is , from Coin 2 is
and from Coin 3 is
. We first toss Coin 1:
- If it lands on Heads (H), we proceed to toss Coin 2,
times
- If it lands on Tails (T), we proceed to toss Coin 3,
times.
So, for say, what we end up observing is a set
, but we restrict it to the case where we don't know from which coin (Coin 2 or Coin 3) each sequence comes from. So the data is "incomplete" in the sense that we are not able to observe the result of tossing Coin 1. Thus we can use the EM algorithm to estimate the parameter vector
as follows:
Expectation Step
and by the fact that is a proper distribution it follows that:
Maximization Step
where:
and is from the Expectation step.
Results from simulations
We implemented the Coin Tossing example through simulations. We simulated data with and
. The graph below depicts how the method converges through iterations:

Note that the above diagram is just one possible output as it depends on the starting values and the the data vector simulated. Also, the density is an approximation as the binomial distribution is a discrete distribution. The density approximation was used merely for illustration purposes.
Gaussian Mixture Example

Consider the case where data is generated from a mixture of Normal's. These cases includes problem in clustering data into different groups, amongst others. For a more detailed discussion on clustering see Turing Finance's article.
Another application of Gaussian mixtures, which is of particular interest to us, is in quantitative finance. In particular, one can show that the probability density function of a Merton Jump Diffusion process is a mixture of Normal distributions, where the mixing probabilities are drawn from a Poisson distribution. That is, we only observe the data but we are not sure from which Normal distribution each observation
was drawn from. Let
be the identifier of the normal distribution from which
was drawn from. Note that
is observed but
is not, so the EM algorithm will assist us in estimating the parameters of this model. The Figure above shows an example of what the density of mixture of three Normal's would look like.
In our example, we assume that the data is generated from a mixture of 3 normal distributions. Let the parameters of each normal distribution be given as follows:
- First normal has mean
and variance
, and is chosen with probability
- First normal has mean
and variance
, and is chosen with probability
- Third normal has mean
and variance
, and is chosen with probability
Note that . This is because the sum of all the probabilities must equal 1. This means that the pdf of
is given as:
where . Note that
does not need to be explicitly estimated as
.
Applying the EM algorithm to estimate we have that :
Expectation Step
and by symmetry we have that :
Maximization Step
where
is :
and is from the Expectation step.
Results from simulations
We implemented the mixture of Gaussian distributions example through simulations. We simulated data with and with
. The graph below depicts how the method converges.

In this section we provide the R-Code that was used to implement the two examples.