\require{amstext} \require{amsmath} \require{amssymb} \require{amsfonts} \newcommand{\eg}{\textit{e}.\textit{g}.} \newcommand{\xb}{\mathbf{x}} \newcommand{\yb}{\mathbf{y}} \newcommand{\xbtn}{\tilde{\xb}^{(n)}} \newcommand{\yn}{y^{(n)}} \newcommand{\betab}{\pmb{\beta}} \newcommand{\lap}{\mathcal{L}} \newcommand{\dydx}[2]{\frac{\partial{#1}}{\partial{#2}}} \newcommand{\dydxsmall}[2]{\partial{#1}/\partial{#2}}
Introduction
Logistic regression is investigated for solving a binary classification problem. Linear classification using original input features and non-linear RBF expansion on input features are both used and compared here. Undoubtedly, the non-linear classifier outperforms the linear classifier with classification accuracy 90.5% and 69.5% respectively using maximum likelihood estimation. After examining the effectiveness of using RBF, RBF is used as default in comparison for between ML, full Bayesian and Maximum a posteriori estimation.
Maximum Likelihood Learning
A dataset with n datapoints, { y^{(n)},\xb^{(n)}}_{n=1}^{N} is used to investigate a binary logistic classification problem with output being 0 or 1. With \xb^{(n)} is of Dimension D, i.e. \xb^{n} \in \mathbb{R}^D. Furthermore, \tilde{\xb}^{(n)} is defined by augmenting with a fixed unit input for \xb^{(n)} , i.e. \tilde{\xb}^{(n)} = (1, \xb^{(n)}), therefore \tilde{\xb}^{n} \in \mathbb{R}^{D+1}. This enables bias terms to be multiplied directly from the biased weight \beta_0. i.e.
\betab^T \tilde{\xb^{(n)}} = \beta_0 + \sum_{d=1}^D \beta_d x_d^{(n)}.
where \betab \in \mathbb{R}^{D+1}.
We can then define our probability of positive/negative class label and therefore, the overall probability of the dataset:
P(\yn = 1 | \xbtn) = \frac{1}{1 +\exp(-\betab^T \xbtn)} = \sigma(\betab^T \xbtn)
P(\yn = 0 | \xbtn) = 1 - \sigma(\betab^T \xbtn) = \sigma(-\betab^T \xbtn)
\begin{align} P(\mathbf{y}|\mathbf{X}, \betab) &= \prod_{n=1}^N P(\yn | \xbtn)\newline & = \prod_{n=1}^N \sigma(\betab^T \xbtn)^{\yn}(1 - \sigma(\betab^T \xbtn))^{1 - \yn} \end{align}
The log-likelihood function can be defined as the log of the probability function:
\begin{align} \lap(\betab) & = \log P(\mathbf{y}|\mathbf{X}, \betab) \newline & = \sum_{n=1}^N \yn \log \sigma(x’) + (1 - \yn)\log(1 - \sigma(x’)) \end{align}
where x’ = \betab^T \xbtn
The derivative of the cost function w.r.t each \beta_i is:
\begin{align} &\dydx{\lap(\betab)}{\beta_i} = \newline &\sum_{n=1}^N \yn \dydx{}{\beta_i}\log \sigma(x’) + (1 - \yn)\dydx{}{\beta_i}\log(1 - \sigma(x’)) \end{align}
where we can calculate the two derivatives of log function separately:
\begin{align} &\dydx{(\log \sigma(x’))}{\beta_i} = \dydx{(\log \sigma(\betab^T \xbtn))}{\beta_i}\newline & = \dfrac{\dydx{{[1 +\exp(-\betab^T \xbtn)]^{-1}}}{\beta_i}}{\sigma(\betab^T \xbtn)}\newline & = \dfrac{-[1 +\exp(-\betab^T \xbtn)]^{-2} \dydx{{\exp(-\betab^T \xbtn)}}{\beta_i} } {\sigma(\betab^T \xbtn)} \newline & = \dfrac{-\sigma^2(\betab^T \xbtn)\exp(-\betab^T \xbtn)(-x_i^{(n)})}{\sigma(\betab^T \xbtn)} \newline & = \dfrac{\exp(-\betab^T \xbtn)x_i^{(n)}}{1 + \exp(-\betab^T \xbtn)} \newline & = \dfrac{(1 + \exp(-\betab^T \xbtn) - 1)x_i^{(n)}}{1 + \exp(-\betab^T \xbtn)} \newline & = (1-\sigma(x’))x_i^{(n)} \end{align}
Similarly,
\begin{align} &\dydx{(\log(1- \sigma(x’)))}{\beta_i} = \dydx{(\log (1 - \sigma(\betab^T \xbtn)))}{\beta_i}\newline & = \frac{1}{1-\sigma(x’)} \sigma^2(x’) \exp(-\betab^T \xbtn) (-x_i^{(n)}) \newline & = -\frac{1}{1-\sigma(x’)} \sigma^2(x’) (\frac{1}{\sigma(x’)}-1) (x_i^{(n)})\newline & = - \sigma(x’)x_i^{(n)} \end{align}
Combining gives:
\begin{align} &\dydx{\lap(\betab)}{\beta_i} \newline &=\sum_{n=1}^N \yn (1-\sigma(x’))x_i^{(n)} + (1 - \yn)(- \sigma(x’)x_i^{(n)}) \newline & =\sum_{n=1}^N (\yn - \sigma(\betab^T \xbtn)) x_i^{(n)}\newline \end{align}
where \sigma(\betab^T \xbtn)) is the sigmoid activated prediction value for y^{(n)}.
Vectorized implementation of the gradient is:
\dydx{\lap(\betab)}{\betab} = \sum_{n=1}^N (\yn - \sigma(\betab^T \xbtn)) \xbtn
Therefore, gradient ascent can be used to estimate the weights \betab
\betab = \betab + \eta \dydx{\lap(\betab)}{\betab}
which would be investigated in detail in the next paragraph.
Gradient Decent
The gradient ascent algorithm for estimating parameters \betab is shown below. Note that different definition for row and column of \mathbf{X} would result in different transpose in the algorithm, always validate the matrix vector multiplication and make sure the gradient dimension matches the dimension of \betab
The learning rate parameter \eta is not obvious to determine, we can use trial and error. One way is to plot the training loss w.r.t epoch number. If training loss oscillates a lot, that means \eta is too large and need to be reduced. If training loss decreases very slowly, we can increase the training rate slightly. In neural network, a conventional choice of learning rate would be 0.02 and after some epochs, learning rate decay could be used to reduce by a factor of 10. Also, learning rate scheduler could be used to choose learning rate automatically during the training process. The aim of choosing a proper learning rate is to have the training loss decrease in a gentle fashion.