Understanding the EM Algorithm

Maximum Likelihood Estimation (MLE) is often the preferred method when it comes to the estimation of statistical models. It is preferred due to the "nice" properties that the estimators from this estimation algorithm have (e.g. estimators are asymptotically normal).

This MLE method involves specifying the likelihood function. Given a random sample $\{X_i, ..., X_N\},$ and the probability density function (pdf)  $P(X|\theta)$, the likelihood function is given as $L(\theta|X)=\prod_{i=1}^{N} P(X_i|\theta)$. It is worth noting that this function is viewed as a function of $\theta$, given $X$.

To help with the optimization of $L(\theta|X)$, we usually consider $l(\theta|X)=\ln(L(\theta|X))$ instead. There are various optimization algorithms that can be used to find $\theta$ that maximizes $l(\theta|X)$. These include Particle Swam Optimisation, Genetic Algorithm and the Expectation Maximization (EM) algorithm, amongst others. However, the suitable optimization algorithm to use depends on the functional form of $l(\theta|X)$. One can use Fitness Landscape Analysis to fully understand the structure of $l(\theta|X)$. A more detailed explanation of Fitness Landscape Analysis can be found here.

In this post, we will be focusing on cases where the EM algorithm may be useful in finding the estimates of $\theta$. We will go through the theory of the EM algorithm, and then look at examples.

The EM algorithm was formally established by Arthur Dempster, Nan Laird, and Donald Rubin in thier 1977 paper. THE EM algorithm is useful in cases where we are analyzing a system with incomplete or missing data.The incomplete data case occurs when we have a combination of data that we can observe, and data that we cannot not observe (i.e "hidden data"). This means that we observe some data vector $X$, and we are unable to observe $Z$, but we believe that $X$ was generated in-conjunction with $Z$.

It is important to note that our hypothesis is that although we only observe $X$, we believe that the underlying model contains $Z$, which we do not observe. This means that we have that $P(X|\theta)=\sum_z P(X,Z|\theta),$ where $P(X|\theta)$ is the pdf of $X$.

So, if we were using the MLE approach, we would want to maximize:$l(\theta|X)=l(\theta)=\sum_{i=1}^{N} \ln(P(X_i|\theta))=\sum_{i=1}^{N} \ln(\sum_z P(X_i,Z|\theta)).$ In most instances, $l(\theta)$ will be difficult to maximize because we have a log of a sum. However, the EM algorithm provides a potential solution to this problem. Consider the following formulation:

\begin{align*} l(\theta) & = \sum_{i=1}^{N} \ln(\sum_z P(X_i,Z=z|\theta)) \\ & = \sum_{i=1}^{N} \ln\left(\sum_z Q(z)_i \frac{P(X_i,Z=z|\theta)}{Q(z)_i}\right)\\ & = \sum_{i=1}^{N} \ln\left(E_{Q(z)_i}\left[ \frac{P(X_i,Z=z|\theta)}{Q(z)_i}\right]\right)\\ & \ge \sum_{i=1}^{N} E_{Q(z)_i}\left[\ln\left(\left[ \frac{P(X_i,Z=z|\theta)}{Q(z)_i}\right]\right)\right] \text{ (by Jensen's inequality)}\\ \implies l(\theta) &\ge{l^*(\theta)}\end{align*}

where $l^*(\theta)=\sum_{i=1}^{N} E_{Q(z)_i}\left[\ln\left(\left[ \frac{P(X_i,Z=z|\theta)}{Q(z)_i}\right]\right)\right],$ $Q(z)_i$ is some distribution function and $E_{Q(z)_i}\left[*\right]$ is the expectation of $*$ under the probability measure $Q(z)_i$.

The fact that $l(\theta) \ge{l^*(\theta)}$ means that we have created a lower bound for $l(\theta)$. It is important to note that this lower bound exists for different $Q(z)_i's$. So we need to think carefully about our choice of $Q(z)_i$. In addition, we want the lower bound to be as tight as possible on $l(\theta)$. That is, if possible, we want to chose $Q(z)_i$ in such a way that  $l(\theta)= l^*(\theta)$.

Given that $\ln(x)$ is strictly concave, one can show that  $l(\theta)= l^*(\theta)$ if and only if the expectation is with respect to a constant. That is:$\frac{P(X_i,Z=z|\theta)}{Q(z)_i} =c$. This means that $Q(z)_i=\frac{P(X_i,Z=z|\theta)}{c}$.

However, since $Q(z)_i$ is a proper distribution and its probabilities must sum to one, we have that $c=\sum_z P(X_i,Z=z|\theta)$, thus $Q(z)_i=\frac{P(X_i,Z=z|\theta)}{\sum_z Pr(X_i,Z=z|\theta)}$. The process of finding  $Q(z)_i$ is called the Expectation step.  We are now going to focus on the Maximization step.

The Maximization step involves finding a $\theta$ that maximizes $l^*(\theta)$ at the chosen $Q(z)_i$ from the Expectation step. Finding this optimal $\theta$ given  $Q(z)_i$ is called the Maximization step. We then use these new $\theta$ values as initial values for the Expectation step, and continue iterating between the Expectation and Maximization steps.

In summary, the EM algorithm  is iterative and is implemented as follows:

• Expectation step
• Assume an initial value for $\theta$
• Find  $Q(z)_i$ that ensures that  $l(\theta)= l^*(\theta)$..
• This $Q(z)_i$ will then be our estimate of the distribution of the un-observed data $Z.$
• Maximisation Step
• Given the $Q(z)_i$ estimated in the Expectation step, optimize $l^*(\theta)$ with respect to $\theta$
• Use this estimated value of $\theta$ as the initial values for  the Expectation step above

One can also show that after each iteration, the value of $\theta$ that we obtain produces a value for $l^*(\theta)$ that is bigger than the value of  $l^*(\theta)$  at the previous iteration. This means that the procedure converges to at least a local optima. In addition, one can also show that this procedure will converge to the MLE that we are looking for. The proofs for the above claims can be found here.

The EM algorithm tends to find local optima. So one needs to use many different starting values to increase the chances that the global optima is attained. In addition, the convergence to the optima is slow, more especially if the parameters we are trying to estimate are not separable.

In the next section, we will be looking at some numerical examples in order to see how one would implement the EM algorithm in practice.

In this section we will be looking  at a couple of examples to illustrate how the EM algorithm would be implemented in practice. We will be looking at a Coin Tossing example, and an example on estimating parameters from a mixture of Gaussian distributions.

Coin Tossing Example

Consider the scenario where you have three possibly biased coins: Coin 1, Coin 2 and Coin 3.  Assume that the probability of observing a head from Coin 1 is $\lambda$, from Coin 2  is $p_{2}$ and from Coin 3  is $p_{3}$.  We first toss Coin 1:

• If it lands on Heads (H), we proceed to toss Coin 2, $n$  times
• If it lands on Tails (T), we proceed to toss Coin 3, $n$ times.

So,  for $n=3$ say, what we end up observing is a set $\{HHH, HTH, TTH,...\}$, but we  restrict it to the case where we don't know from which coin (Coin 2 or Coin 3) each sequence comes from. So the data is "incomplete" in the sense that we are not able to observe the result of tossing Coin 1. Thus we can use the EM algorithm to estimate the parameter vector $\theta=\{\lambda,p_1,p_2\}$ as follows:

Expectation Step

\begin{align*}Q(z=H)_i & =\frac{P(X_i,Z=H|\theta)}{\sum_z P(X_i,Z=z|\theta)} \\ & = \frac{P(X_i|Z=H, \theta)P(Z=H|\theta)}{\sum_{z=\{H,T\}} P(X_i|Z=z,\theta)P(Z=z|\theta)}\\ & = \frac{\binom{n} {x_i} p_1^{x_i}(1-p_1)^{n-x_i} \lambda}{\binom{n} {x_i} p_1^{x_i}(1-p_1)^{n-x_i} \lambda+\binom{n} {x_i} p_2^{x_i}(1-p_2)^{n-x_i}(1- \lambda)}\end{align*}

and by the fact that $Q(z)_i$  is a proper distribution it follows that:

\begin{align*}Q(z=T)_i &= \frac{\binom{n} {x_i} p_2^{x_i}(1-p_2)^{n-x_i}(1- \lambda)}{\binom{n} {x_i} p_1^{x_i}(1-p_1)^{n-x_i} \lambda+\binom{n} {x_i} p_2^{x_i}(1-p_2)^{n-x_i}(1- \lambda)}\end{align*}

Maximization Step

$l^*(\theta) = \sum_{i=1}^{N} \sum_{z=\{T,H\}}Q(z)_i \ln\left(\frac{P(X_i,Z=z|\theta)}{Q(z)_i}\right)$ where:

$P(X_i,Z=z|\theta) = \begin{cases} \binom{n} {x_i} p_1^{x_i}(1-p_1)^{n-x_i} \lambda &\text{if z=H} \\ \binom{n} {x_i} p_2^{x_i}(1-p_2)^{n-x_i} (1- \lambda) &\text{if z=T} \end{cases}$

and  $Q(z)_i$ is from the Expectation step.

Results from simulations

We implemented the Coin Tossing example through simulations. We simulated data with $N= 10 000, \theta= \{\lambda=0.45, p_1=0.2,p_2=0.7\}$ and $n=10$. The graph below depicts how the method converges through iterations:

Note that the above diagram is just one possible output as it depends on the starting values and the the data vector $X$ simulated. Also, the density is an approximation as the binomial distribution is a discrete distribution. The density approximation was used merely for illustration purposes.

Gaussian Mixture Example

Consider the case where data is generated from a mixture of Normal's. These cases includes problem in clustering data into different groups, amongst others. For a more detailed discussion on clustering see Turing Finance's article.

Another application of Gaussian mixtures, which is of particular interest to us, is  in quantitative finance. In particular, one can show that the probability density function of a Merton Jump Diffusion process is a mixture of Normal distributions, where the mixing probabilities are drawn from a Poisson distribution. That is, we only observe the data $X=\{x_1,x_2,...\},$ but we are not sure from which Normal distribution each observation $x_i$ was drawn from. Let $Z$ be the identifier of the normal distribution from which $x_i$ was drawn from. Note that $X$ is observed but $Z$ is not, so the EM algorithm will assist us in estimating the parameters of this model. The Figure above shows an example of what the density of mixture of three Normal's would look like.

In our example, we assume that the data $X$ is generated from a mixture of 3 normal distributions. Let the parameters of each normal distribution be given as follows:

• First normal has mean $\mu_1$  and variance $\sigma_1^2$, and is chosen with probability $\omega_1$
• First normal has mean $\mu_2$  and variance $\sigma_2^2$, and is chosen with probability $\omega_2$
• Third normal has mean $\mu_3$  and variance $\sigma_3^2$, and is chosen with probability $\omega_3$

Note that $\omega_1+\omega_2+\omega_3=1$. This is because the sum of all the probabilities must equal 1. This means that the pdf of $X$ is given as:

\begin{align*}f(X|\theta)=P(X|\theta)& =\sum_{k=1}^{3} P(X, Z=k|\theta_k)\\ & = \sum_{k=1}^{3} P(X|Z=k,\theta_k) P(Z=k|\theta_k)\\ & = \sum_{k=1}^{3} \frac{\exp{(-0.5(\frac{X-\mu_k}{\sigma_k})^2)}}{\sqrt{2\pi\sigma_k^2}}\omega_k\end{align*}

where $\theta =\{\mu_1,\sigma_1,\mu_2,\sigma_2,\mu_3,\sigma_3, \omega_1, \omega_2\}$. Note that $\omega_3$  does not need to be explicitly estimated as  $\omega_3=1-\omega_1-\omega_2$.

Applying the EM algorithm  to estimate $\theta$ we have that :

Expectation Step

\begin{align*}Q(z=1)_i& =\frac{P(X_i,Z=1|\theta)}{\sum_{k=1}^{3} P(X_i,Z=k|\theta)}\\ & = \frac{P(X_i|Z=1, \theta)P(Z=H|\theta)}{\sum_{k=1}^{3}P(X_i|Z=k,\theta)P(Z=k|\theta))}\\ &= \frac{\frac{\exp{\left(-0.5(\frac{X-\mu_1}{\sigma_1})^2\right)}}{\sqrt{2\pi\sigma_1^2}}\omega_1}{\sum_{k=1}^{3} \frac{\exp{\left(-0.5(\frac{X-\mu_k}{\sigma_k})^2\right)}}{\sqrt{2\pi\sigma_k^2}}\omega_k}\end{align*}

and by symmetry we have that :

\begin{align*}Q(z=2)_i= \frac{\frac{\exp{\left(-0.5(\frac{X-\mu_2}{\sigma_2})^2\right)}}{\sqrt{2\pi\sigma_2^2}}\omega_2}{\sum_{k=1}^{3} \frac{\exp{\left(-0.5(\frac{X-\mu_k}{\sigma_k})^2\right)}}{\sqrt{2\pi\sigma_k^2}}\omega_k}\end{align*}

\begin{align*}Q(z=3)_i= \frac{\frac{\exp{\left(-0.5(\frac{X-\mu_3}{\sigma_3})^2\right)}}{\sqrt{2\pi\sigma_3^2}}\omega_3}{\sum_{k=1}^{3} \frac{\exp{\left(-0.5(\frac{X-\mu_k}{\sigma_k})^2\right)}}{\sqrt{2\pi\sigma_k^2}}\omega_k}\end{align*}

Maximization Step

$l^*(\theta) = \sum_{i=1}^{N} \sum_{k=1}^{3}Q(z=k)_i \ln(\frac{P(X_i,Z=k|\theta_k)}{Q(z=k)_i})$ where $P(X_i,Z=k|\theta_k)$ is :

$P(X_i,Z|\theta) = \begin{cases}\frac{\exp{(-0.5(\frac{(X-\mu_1)}{\sigma_1})^2)}}{\sqrt{(2\pi\sigma_1^2)}}\omega_2&\text{if k= 1} \\\frac{\exp{(-0.5(\frac{X-\mu_2}{\sigma_2})^2)}}{\sqrt{(2\pi\sigma_2^2)}}\omega_3&\text{if k=2}\\\frac{\exp{(-0.5(\frac{(X-\mu_3)}{\sigma_3})^2)}}{\sqrt{(2\pi\sigma_3^2)}}(1-\omega_1-\omega_2)&\text{if k= 3} \end{cases}$

and  $Q(z)_i$ is from the Expectation step.

Results from simulations

We implemented the mixture of Gaussian distributions example through simulations. We simulated data with $N= 10 000$ and with $\theta= \{\omega_1=0.2,\omega_2=0.6,\mu_1=-3,\sigma_1=1.5,$

$\mu_2=9.7,\sigma_2=1.2,\mu_3=4,\sigma_3=1.1\}$. The graph below depicts how the method converges.

In this section we provide the R-Code that was used to implement the two examples.