Nuanced situations end up providing complicated situations. One way of handling this is to view it as a mixture of simpler distributions - where the methodology of mixing is governed by the introduction of latent variables.
The problem can be then modelled into a Maximum Likelihood Estimation problem; but obtaining the parameters that best fit the data is not a straightforward task (no closed form solutions) (singularity problems)
A general technique for finding maximum likelihood estimators in latent variable models is the expectation-maximization (EM) algorithm:
E-step (expectation): Use parameter estimates to update latent variable values.
M-step (maximization): Obtain new parameter estimates by maximizing the log-likelihood function based on the latent variables obtained from E-step
We illustrate the workings of the EM Algorithm on the Old-Faithful dataset by posing it as a 2-component gaussian mixture model.
!wget https://raw.githubusercontent.com/data-8/materials-su19/master/materials/su19/hw/hw02/old_faithful.csvimport matplotlib.pyplot as pltimport numpy as npfrom sklearn.datasets import make_blobsfrom scipy.stats import multivariate_normalimport pandas as pdfrom matplotlib import mlabfrom sklearn.preprocessing import MinMaxScalerplt.rcParams['figure.figsize'] = (8, 8)X = pd.read_csv('old_faithful.csv')[['eruptions', 'waiting']].to_numpy()X = MinMaxScaler().fit_transform(X)plt.scatter(X[:, 0], X[:, 1])# M, N = np.mgrid[min(X[:, 0]):max(X[:, 0]):0.01, min(X[:, 1]):max(X[:, 1]):0.01]M, N = np.mgrid[0:1:0.01, 0:1:0.01]grid = np.dstack((M, N))
Where: - $_k = $ Probability that a random point belongs to the kth component (also called mixing coefficients) (\(\sum{\pi_k} = 1\)) - \(\mu_k, Σ_k\) and \(\mathcal{N}(\mathbf{x} | \mathbf{\mu}_k, \mathbf{Σ}_k)\) are the parameters and pdf of the kth components respectively.
Introduction of Latent Variable
We introduce a K-dimensional latent variable \(\mathbf{z}\); only one of the elements of \(\mathbf{z}\) will be \(1\) (\(1\)-of-\(K\) encoding) (also a standard basis vector).
\[\mathbf{z} = \mathbf{e}_k\]
Reformulating problem in terms of latent variable \(\mathbf{z}\)
We model the marginal distribuon over \(\mathbf{z}\) using our mixing coefficients \(\pi_k\):
\[p(\mathbf{z}=\mathbf{e}_k) = \pi_k\]
The conditional distribution of \(\mathbf{x}\) over \(\mathbf{z}\) can be similarly modelled:
And done! We have formulated our goal using this latent variable \(\mathbf{z}\). One more quantity of use is the conditional probablity of \(\mathbf{z}=\mathbf{e}_k\) given \(\mathbf{x}_i\) (which we will call \(\lambda_ik\)):
There is a very close similarity. In fact \(K\)-Means is a restricted GMM clustering (initalize all \(Σ_k\)’s to \(ϵ\mathbf{I}\), with \(ϵ → 0\), and do not update in maximization step)
How does the above make it \(K\)-Means? We investigate \(\lambda_{ik}\):
Let \(\phi = \arg \min f(j) = ||\mathbf{x}_i-\mathbf{\mu}_j||^2\). Setting \(ϵ → 0\), we see that in the denominator the \(Φ\)’th term goes to zero the slowest. Hence \(\lambda_{n\phi} → 1\) while the others \(→ 0\) (note that this results in a hard-clustering).
The above is equivalent to the K-means clustering paradigm; assign clusters based on proximity from cluster centers.
# (Restricted) Maximization Stepdef KM_Max(l): pi = l.sum(axis=0)/l.sum() centres, cov = [], []for i inrange(K): centres.append(np.dot(l[:, i], X)/l[:, i].sum()) cov.append(eps*np.eye(l.shape[1]))return pi, np.array(centres), np.array(cov)K =2eps =0.005rng = np.random.default_rng(seed=72)centres = X[rng.integers(low=0, high=len(X), size=K)]l = np.array([(lambda i: [int(i==j) for j inrange(K)])(np.argmin([np.linalg.norm(p-centre) for centre in centres])) for p in X])pi = l.sum(axis=0)/l.sum()cov = []for i inrange(K): cov.append(eps*np.eye(l.shape[1]))cov = np.array(cov)def plot(): plt.scatter(X[:, 0], X[:, 1], color='black', marker='+') plt.scatter(centres[:, 0], centres[:, 1], color='red', marker='x') probability_grid = np.zeros(grid.shape[:2])for i inrange(K): probability_grid += pi[i] * multivariate_normal(centres[i], cov[i]).pdf(grid) plt.contour(M, N, probability_grid) plt.show()plot()
a, b, c = KM_Max(l)[[round(i, 5) for i in j] for j in Exp(a, b, c)[:5]]