3F8 Inference Tripos Revision

Posted by Jingbiao on April 10, 2021, Reading time: 3 minutes.

\( \require{amstext} \require{amsmath} \require{amssymb} \require{amsfonts} \)



  • Maximum Likelihood Estimation of parameters: Given data $ y=\{ y_i \}$ and parameter $\theta$, the likelihood of the parameter or the probability of the data given parameter is: $p(y | \theta )$. The maximum likelihood estimate is to maximize the likelihood function with the $\theta_{\textrm{ML}}$ :

\( \theta_{\textrm{ML}} = \operatorname*{argmax} p(y|\theta) \)

and log likelihood is almost always used. Here the argmax if over $\theta$


  • Maximum a Posteriori (MAP) estimation:
    • Estimate the parameter $\theta$ from data $\mathcal{D}$.
    • It requires a likelihood function for the parameter $\theta$, $p(\mathcal{D} \theta)$
    • a prior distribution of the parameter before observing the dataset: $p(\theta)$
    • The posterior distribution could then be computed using Baye’s rule: $p(\theta \mathcal{D}) \prop p(\theta) p(y \theta)$
    • We found the $\theta$ that maximizes the posterior distribution: \( \theta_{\textrm{MAP}} = \operatorname{argmax} p(\theta | \mathcal{D}) \) By taking log, this is equivalent to: \( \theta_{\textrm{MAP}} = \operatorname{argmax} \log p(\theta) + \log p(\mathcal{D} | \theta) \) all the argmax is computed over $\theta$


They are both supervised learning problems with a training dataset of pairs of input $x=\{ x_i\}$ and output data $y=\{ y_i\}$ $\rightarrow$ $\mathcal{D}=\{x_i, y_i\}$ for $i=1,…,N$. A model is learnt from the training set to make prediction of the new output $y^\star$ given a new input $x^\star$. The difference is that regression predicts continuous real values, while classification predicts a discrete class label.


Clustering is an unsupervised learning algorithm with input of a dataset without label or output. It classifies similar or nearby datapoints into same cluster, dissimilar or far away dataset into different clusters. The number of clusters should be no larger than the number of datapoints.

Monte Carlo Methods

  • Monte Carlo methods approximate expectation $E_{p(x)}[f(x)]$ for an arbitrary function $f(x)$ by drawing N samples from the pdf and then compute an empirical average. \( E_{p(x)}[f(x)] = 1/N \sum_{i=0}^{N}f(X_i) \)
  • Advantage is the unbiased nature, more samples drawn, higher accuracy
  • Drawback is the slow convergence, and requires lots of samples.

Hidden Markov Model:

  • Latent variable: s , observed variable: y
  • A system with an initial state probability vector $p(s_1=k) = \Pi_k$ for $k=1,…,K$
  • Transition probability matrix: \( p(s_{t+1}=k | s_t=k^\prime ) = T_{k k^\prime} \)
  • Emission probability matrix: \( p(y_t|s_t) \)

  • Forward Algorithm could be used to compute the probability distribution. It has linear complexity overtime, so it is desireable for long sequence computation. The transition matrix’s dimension would be larger if the time of the sequence is large. The matrix multiplication should be written in bespoke form to use the sparsity or special structure in the matrix multiplication
Example calculation:
  • Compute the predictive distribution for the next observed variable y: $p(y_{t+1}|y_{1:t})$ \( \begin{align} p(y_{t+1}|y_{1:t}) =& \sum_{s_t, s_{t+1}} p(y_{t+1}, s_t, s_{t+1}|y_{1:t})
    =& \sum_{s_t, s_{t+1}} p(y_{t+1}|s_{t+1}) p(s_t, s_{t+1}|y_{1:t})
    =& \sum_{s_t, s_{t+1}} p(y_{t+1}|s_{t+1}) p(s_t|y_{1:t}) p(s_t+1|s_t) \end{algin} \)

  • Compute the likelihood $p(y_{1:t})$: \( \begin{align} p(Y_{1:T}) =& \sum_{X_T} p(Y_{1:T}, X_T) \ =& \sum_{X_T} \sum_{X_{T-1}} p(Y_{1:T}, X_T, X_{T-1})
    =& \sum_{X_T} \sum_{X_{T-1}} p(Y_{1:T-1}, X_{T-1}) p(Y_T, X_T | Y_{1:T-1}, X_{T-1})
    =& \sum_{X_T} \sum_{X_{T-1}} p(Y_{1:T-1}, X_{T-1}) p(Y_T| X_T, Y_{1:T-1}, X_{T-1}) p(X_T | Y_{1:T-1}, X_{T-1})
    =& \sum_{X_T} \sum_{X_{T-1}} p(Y_{1:T-1}, X_{T-1}) p(Y_T| X_T) p(X_T |X_{T-1})
    \end{align} \)