Intro to EM

What is the EM Algorithm?

Nuanced situations end up providing complicated situations. One way of handling this is to view it as a mixture of simpler distributions - where the methodology of mixing is governed by the introduction of latent variables.

The problem can be then modelled into a Maximum Likelihood Estimation problem; but obtaining the parameters that best fit the data is not a straightforward task (no closed form solutions) (singularity problems)

A general technique for finding maximum likelihood estimators in latent variable models is the expectation-maximization (EM) algorithm:

  • E-step (expectation): Use parameter estimates to update latent variable values.
  • M-step (maximization): Obtain new parameter estimates by maximizing the log-likelihood function based on the latent variables obtained from E-step

We illustrate the workings of the EM Algorithm on the Old-Faithful dataset by posing it as a 2-component gaussian mixture model.

!wget https://raw.githubusercontent.com/data-8/materials-su19/master/materials/su19/hw/hw02/old_faithful.csv

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
from scipy.stats import multivariate_normal
import pandas as pd
from matplotlib import mlab
from sklearn.preprocessing import MinMaxScaler

plt.rcParams['figure.figsize'] = (8, 8)

X = pd.read_csv('old_faithful.csv')[['eruptions', 'waiting']].to_numpy()
X = MinMaxScaler().fit_transform(X)
plt.scatter(X[:, 0], X[:, 1])

# M, N = np.mgrid[min(X[:, 0]):max(X[:, 0]):0.01, min(X[:, 1]):max(X[:, 1]):0.01]
M, N = np.mgrid[0:1:0.01, 0:1:0.01]
grid = np.dstack((M, N))
--2023-11-18 18:46:43--  https://raw.githubusercontent.com/data-8/materials-su19/master/materials/su19/hw/hw02/old_faithful.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2270 (2.2K) [text/plain]
Saving to: ‘old_faithful.csv.13’

old_faithful.csv.13   0%[                    ]       0  --.-KB/s               old_faithful.csv.13 100%[===================>]   2.22K  --.-KB/s    in 0s      

2023-11-18 18:46:43 (23.1 MB/s) - ‘old_faithful.csv.13’ saved [2270/2270]

Our problem is to estimate the parameters of the following K-component gaussian mixture pdf:

\[ p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x} | \mathbf{\mu}_k, \mathbf{Σ}_k) \]

Where: - $_k = $ Probability that a random point belongs to the kth component (also called mixing coefficients) (\(\sum{\pi_k} = 1\)) - \(\mu_k, Σ_k\) and \(\mathcal{N}(\mathbf{x} | \mathbf{\mu}_k, \mathbf{Σ}_k)\) are the parameters and pdf of the kth components respectively.

Introduction of Latent Variable

We introduce a K-dimensional latent variable \(\mathbf{z}\); only one of the elements of \(\mathbf{z}\) will be \(1\) (\(1\)-of-\(K\) encoding) (also a standard basis vector).

\[\mathbf{z} = \mathbf{e}_k\]

Reformulating problem in terms of latent variable \(\mathbf{z}\)

We model the marginal distribuon over \(\mathbf{z}\) using our mixing coefficients \(\pi_k\):

\[p(\mathbf{z}=\mathbf{e}_k) = \pi_k\]

The conditional distribution of \(\mathbf{x}\) over \(\mathbf{z}\) can be similarly modelled:

\[p(\mathbf{x} | \mathbf{z}=\mathbf{e}_k) = \mathcal{N}(\mathbf{x}|\mu_k, Σ_k)\]

We then obtain \(p(\mathbf{x})\) by summing the joint distribution over all possible \(\mathbf{z}\) states:

\[p(x) = \sum_\mathbf{z}p(\mathbf{z})p(\mathbf{x}|\mathbf{z}) = \sum_{k=1}^K\pi_k\mathcal{N}(\mathbf{x}|\mu_k, Σ_k)\]

And done! We have formulated our goal using this latent variable \(\mathbf{z}\). One more quantity of use is the conditional probablity of \(\mathbf{z}=\mathbf{e}_k\) given \(\mathbf{x}_i\) (which we will call \(\lambda_ik\)):

\[\lambda_{ik} \equiv p(\mathbf{z}=\mathbf{e}_k) = \frac{p(\mathbf{z}=\mathbf{e}_k)p(\mathbf{x}|\mathbf{z}_k=\mathbf{e}_k)}{\sum_{j=1}^K p(\mathbf{z}=\mathbf{e}_j)p(\mathbf{x}|\mathbf{z}_j=\mathbf{e}_j)} = \frac{\pi_k\mathcal{N}(\mathbf{x}|\mu_k, Σ_k)}{\sum_{j=1}^K \pi_j\mathcal{N}(\mathbf{x}|\mu_j, Σ_j)}\]

By the reformulation above, we have enabled the application of EM on our problem.

Let’s begin!

To get started, we perform the following initialization:

  • Pick 2 points at random as cluster centres
  • Hard-assign points based on proximity to centres
K = 2

rng = np.random.default_rng(seed=12)
centres = X[rng.integers(low=0, high=len(X), size=K)]
l = np.array([(lambda i: [int(i==j) for j in range(K)])(np.argmin([np.linalg.norm(p-centre) for centre in centres])) for p in X])
pi = l.sum(axis=0)/l.sum()

cov = []
for i in range(K):
  cov.append(np.cov(X.T, ddof=0, aweights=l[:, i]))
cov = np.array(cov)

def plot():

  plt.scatter(X[:, 0], X[:, 1], color='black', marker='+')
  plt.scatter(centres[:, 0], centres[:, 1], color='red', marker='x')

  probability_grid = np.zeros(grid.shape[:2])
  for i in range(K):
    probability_grid += pi[i] * multivariate_normal(centres[i], cov[i]).pdf(grid)

  plt.contour(M, N, probability_grid)
  plt.show()

plot()

Posing a Maximum Likelihood problem:

The log of the likelihood is the following:

\[\ln p(\mathbf{X}|\pi, \mu, Σ) = \sum_{n=1}^N\ln \left\{{\sum_{k=1}^K\pi_k\mathcal{N}(\mathbf{x}_n|\mu_k, \Sigma_k)}\right\}\]

Maximization step

Taking derivate of above equation with \(\mu_k\) and setting to zero yields the following:

\[ 0 = -\sum_{n=1}^N \frac{\pi_k\mathcal{N}(\mathbf{x}_n|\mu_k, \Sigma_k)}{\sum_{j=1}^K\pi_j\mathcal{N}(\mathbf{x}_n|\mu_j, \Sigma_j)} Σ_k(\mathbf{x}_n-\mu_k) ≡ -\sum_{n=1}^N \lambda_{nk}Σ_k(\mathbf{x}_n-\mu_k) \]

Multiplying both sides by \(Σ_k^{-1}\) (non-singular; invertible) and rearranging, we get:

\[\mathbf{\mu}_k = \frac{1}{N_k}\sum_{n=1}^{N}\lambda_{nk}\mathbf{x}_n\]

Where we define \(N_k\) to be the effective number of points assigned to cluster k: \(N_k = \sum_{n=1}^N \lambda_{nk}\)

We follow a similar approach and derive the following ML estimates for \(\Sigma_k\) and \(\pi_k\):

\[Σ_k = \frac{1}{N_k}\sum_{n=1}^N\lambda_{nk}(\mathbf{x}_n-\mathbf{\mu}_k)(\mathbf{x}_n-\mathbf{\mu}_k)^T \\\pi_k = \frac{N_k}{N} \]

Expectation step

Recompute \(\lambda\) using the parameter values. Forumula mentioned again for completion sake:

\[\lambda_{ik} = \frac{\pi_k\mathcal{N}(\mathbf{x}|\mu_k, Σ_k)}{\sum_{j=1}^K \pi_j\mathcal{N}(\mathbf{x}|\mu_j, Σ_j)}\]

# Expectation Step
def Max(l):
  pi = l.sum(axis=0)/l.sum()
  centres, cov = [], []
  for i in range(K):
    centres.append(np.dot(l[:, i], X)/l[:, i].sum())
    cov.append(np.cov(X.T, ddof=0, aweights=l[:, i]))

  return pi, np.array(centres), np.array(cov)

# Maximization step
def Exp(pi, centres, cov):
  l = []
  for i in X:
    p = np.array([pi[k] * multivariate_normal.pdf(i, mean=centres[k], cov=cov[k]) for k in range(K)])
    p = p/p.sum()
    l.append(p)

  return np.array(l)

# Convergence criterion
norm_theta = lambda pi, centres, cov: np.linalg.norm(np.r_[pi, centres.reshape(-1), cov.reshape(-1)])
from IPython.display import HTML
from matplotlib import animation

prev_norm = norm_theta(pi, centres, cov)

fig, ax = plt.subplots()
artists = []

while True:
  frame = []
  frame.append(ax.scatter(X[:, 0], X[:, 1], color='black', marker='+'))
  frame.append(ax.scatter(centres[:, 0], centres[:, 1], color='red', marker='x'))

  probability_grid = np.zeros(grid.shape[:2])
  for i in range(K):
    probability_grid += pi[i] * multivariate_normal(centres[i], cov[i]).pdf(grid)

  frame += list(ax.contour(M, N, probability_grid).collections)
  artists.append(frame)

  l = Exp(pi, centres, cov)
  pi, centres, cov = Max(l)

  curr_norm = norm_theta(pi, centres, cov)

  if abs(curr_norm-prev_norm) < 0.00001:
    break
  else:
    prev_norm = curr_norm

plt.close()
anim = animation.ArtistAnimation(fig, artists, interval=200, repeat=False, blit=False);
HTML(anim.to_jshtml())

Relation to \(K\)-Means?

There is a very close similarity. In fact \(K\)-Means is a restricted GMM clustering (initalize all \(Σ_k\)’s to \(ϵ\mathbf{I}\), with \(ϵ → 0\), and do not update in maximization step)

How does the above make it \(K\)-Means? We investigate \(\lambda_{ik}\):

\[\lambda_{ik} = \frac{\pi_k\mathcal{N}(\mathbf{x}|\mu_k, Σ_k)}{\sum_{j=1}^K \pi_j\mathcal{N}(\mathbf{x}|\mu_j, Σ_j)} = \frac{\pi_k\exp \{-||\mathbf{x}_i-\mathbf{\mu}_k||^2/2ϵ\}}{\sum_{j=1}^K \pi_j\exp \{-||\mathbf{x}_i-\mathbf{\mu}_j||^2/2ϵ\}}\]

Let \(\phi = \arg \min f(j) = ||\mathbf{x}_i-\mathbf{\mu}_j||^2\). Setting \(ϵ → 0\), we see that in the denominator the \(Φ\)’th term goes to zero the slowest. Hence \(\lambda_{n\phi} → 1\) while the others \(→ 0\) (note that this results in a hard-clustering).

The above is equivalent to the K-means clustering paradigm; assign clusters based on proximity from cluster centers.

# (Restricted) Maximization Step
def KM_Max(l):
  pi = l.sum(axis=0)/l.sum()
  centres, cov = [], []
  for i in range(K):
    centres.append(np.dot(l[:, i], X)/l[:, i].sum())
    cov.append(eps*np.eye(l.shape[1]))

  return pi, np.array(centres), np.array(cov)

K = 2
eps = 0.005

rng = np.random.default_rng(seed=72)
centres = X[rng.integers(low=0, high=len(X), size=K)]
l = np.array([(lambda i: [int(i==j) for j in range(K)])(np.argmin([np.linalg.norm(p-centre) for centre in centres])) for p in X])
pi = l.sum(axis=0)/l.sum()

cov = []
for i in range(K):
  cov.append(eps*np.eye(l.shape[1]))
cov = np.array(cov)

def plot():

  plt.scatter(X[:, 0], X[:, 1], color='black', marker='+')
  plt.scatter(centres[:, 0], centres[:, 1], color='red', marker='x')

  probability_grid = np.zeros(grid.shape[:2])
  for i in range(K):
    probability_grid += pi[i] * multivariate_normal(centres[i], cov[i]).pdf(grid)

  plt.contour(M, N, probability_grid)
  plt.show()

plot()

a, b, c = KM_Max(l)
[[round(i, 5) for i in j] for j in Exp(a, b, c)[:5]]
[[0.99911, 0.00089], [0.0, 1.0], [0.00082, 0.99918], [0.0, 1.0], [1.0, 0.0]]

We see that the assignments are close to hard-clustering. Setting epsilon to a much smaller value will ensure this better.

For visual presentability, epsilon has been set as small as possible.

from IPython.display import HTML
from matplotlib import animation

prev_norm = norm_theta(pi, centres, cov)

fig, ax = plt.subplots()
artists = []

while True:

  frame = []
  frame.append(ax.scatter(X[:, 0], X[:, 1], color='black', marker='+'))
  frame.append(ax.scatter(centres[:, 0], centres[:, 1], color='red', marker='x'))

  probability_grid = np.zeros(grid.shape[:2])
  for i in range(K):
    probability_grid += pi[i] * multivariate_normal(centres[i], cov[i]).pdf(grid)

  frame += list(ax.contour(M, N, probability_grid).collections)
  artists.append(frame)

  l = Exp(pi, centres, cov)
  pi, centres, cov = KM_Max(l)

  curr_norm = norm_theta(pi, centres, cov)

  if abs(curr_norm-prev_norm) < 0.0001:
    print(curr_norm-prev_norm)
    break
  else:
    prev_norm = curr_norm

plt.close()
anim = animation.ArtistAnimation(fig, artists, interval=1000, repeat=False, blit=False);
HTML(anim.to_jshtml())
-7.535850650119968e-05