\( \require{amstext} \require{amsmath} \require{amssymb} \require{amsfonts} \newcommand{\eg}{\textit{e}.\textit{g}.} \newcommand{\xb}{\mathbf{x}} \newcommand{\yb}{\mathbf{y}} \newcommand{\xbtn}{\tilde{\xb}^{(n)}} \newcommand{\yn}{y^{(n)}} \newcommand{\betab}{\pmb{\beta}} \newcommand{\lap}{\mathcal{L}} \newcommand{\dydx}[2]{\frac{\partial{#1}}{\partial{#2}}} \newcommand{\dydxsmall}[2]{\partial{#1}/\partial{#2}} \)
Introduction
Logistic regression is investigated for solving a binary classification problem. Linear classification using original input features and non-linear RBF expansion on input features are both used and compared here. Undoubtedly, the non-linear classifier outperforms the linear classifier with classification accuracy 90.5% and 69.5% respectively using maximum likelihood estimation. After examining the effectiveness of using RBF, RBF is used as default in comparison for between ML, full Bayesian and Maximum a posteriori estimation.
Maximum Likelihood Learning
A dataset with $n$ datapoints, ${ y^{(n)},\xb^{(n)}}_{n=1}^{N}$ is used to investigate a binary logistic classification problem with output being 0 or 1. With $\xb^{(n)}$ is of Dimension $D$, i.e. $\xb^{n} \in \mathbb{R}^D$. Furthermore, $\tilde{\xb}^{(n)}$ is defined by augmenting with a fixed unit input for $\xb^{(n)} $, i.e. $\tilde{\xb}^{(n)} = (1, \xb^{(n)})$, therefore $\tilde{\xb}^{n} \in \mathbb{R}^{D+1}$. This enables bias terms to be multiplied directly from the biased weight $\beta_0$. i.e.
\( \betab^T \tilde{\xb^{(n)}} = \beta_0 + \sum_{d=1}^D \beta_d x_d^{(n)}. \)
where $\betab \in \mathbb{R}^{D+1}$.
We can then define our probability of positive/negative class label and therefore, the overall probability of the dataset:
\( P(\yn = 1 | \xbtn) = \frac{1}{1 +\exp(-\betab^T \xbtn)} = \sigma(\betab^T \xbtn) \)
\( P(\yn = 0 | \xbtn) = 1 - \sigma(\betab^T \xbtn) = \sigma(-\betab^T \xbtn) \)
\( \begin{align} P(\mathbf{y}|\mathbf{X}, \betab) &= \prod_{n=1}^N P(\yn | \xbtn)\newline & = \prod_{n=1}^N \sigma(\betab^T \xbtn)^{\yn}(1 - \sigma(\betab^T \xbtn))^{1 - \yn} \end{align} \)
The log-likelihood function can be defined as the log of the probability function:
\( \begin{align} \lap(\betab) & = \log P(\mathbf{y}|\mathbf{X}, \betab) \newline & = \sum_{n=1}^N \yn \log \sigma(x’) + (1 - \yn)\log(1 - \sigma(x’)) \end{align} \)
where $x’ = \betab^T \xbtn$
The derivative of the cost function w.r.t each $\beta_i$ is:
\( \begin{align} &\dydx{\lap(\betab)}{\beta_i} = \newline &\sum_{n=1}^N \yn \dydx{}{\beta_i}\log \sigma(x’) + (1 - \yn)\dydx{}{\beta_i}\log(1 - \sigma(x’)) \end{align} \)
where we can calculate the two derivatives of log function separately:
\( \begin{align} &\dydx{(\log \sigma(x’))}{\beta_i} = \dydx{(\log \sigma(\betab^T \xbtn))}{\beta_i}\newline & = \dfrac{\dydx{{[1 +\exp(-\betab^T \xbtn)]^{-1}}}{\beta_i}}{\sigma(\betab^T \xbtn)}\newline & = \dfrac{-[1 +\exp(-\betab^T \xbtn)]^{-2} \dydx{{\exp(-\betab^T \xbtn)}}{\beta_i} } {\sigma(\betab^T \xbtn)} \newline & = \dfrac{-\sigma^2(\betab^T \xbtn)\exp(-\betab^T \xbtn)(-x_i^{(n)})}{\sigma(\betab^T \xbtn)} \newline & = \dfrac{\exp(-\betab^T \xbtn)x_i^{(n)}}{1 + \exp(-\betab^T \xbtn)} \newline & = \dfrac{(1 + \exp(-\betab^T \xbtn) - 1)x_i^{(n)}}{1 + \exp(-\betab^T \xbtn)} \newline & = (1-\sigma(x’))x_i^{(n)} \end{align} \)
Similarly,
\( \begin{align} &\dydx{(\log(1- \sigma(x’)))}{\beta_i} = \dydx{(\log (1 - \sigma(\betab^T \xbtn)))}{\beta_i}\newline & = \frac{1}{1-\sigma(x’)} \sigma^2(x’) \exp(-\betab^T \xbtn) (-x_i^{(n)}) \newline & = -\frac{1}{1-\sigma(x’)} \sigma^2(x’) (\frac{1}{\sigma(x’)}-1) (x_i^{(n)})\newline & = - \sigma(x’)x_i^{(n)} \end{align} \)
Combining gives:
\( \begin{align} &\dydx{\lap(\betab)}{\beta_i} \newline &=\sum_{n=1}^N \yn (1-\sigma(x’))x_i^{(n)} + (1 - \yn)(- \sigma(x’)x_i^{(n)}) \newline & =\sum_{n=1}^N (\yn - \sigma(\betab^T \xbtn)) x_i^{(n)}\newline \end{align} \)
where $\sigma(\betab^T \xbtn))$ is the sigmoid activated prediction value for $y^{(n)}$.
Vectorized implementation of the gradient is:
\( \dydx{\lap(\betab)}{\betab} = \sum_{n=1}^N (\yn - \sigma(\betab^T \xbtn)) \xbtn \)
Therefore, gradient ascent can be used to estimate the weights $\betab$
\( \betab = \betab + \eta \dydx{\lap(\betab)}{\betab} \)
which would be investigated in detail in the next paragraph.
Gradient Decent
The gradient ascent algorithm for estimating parameters $\betab$ is shown below. Note that different definition for row and column of $\mathbf{X}$ would result in different transpose in the algorithm, always validate the matrix vector multiplication and make sure the gradient dimension matches the dimension of $\betab$
The learning rate parameter $\eta$ is not obvious to determine, we can use trial and error. One way is to plot the training loss w.r.t epoch number. If training loss oscillates a lot, that means $\eta$ is too large and need to be reduced. If training loss decreases very slowly, we can increase the training rate slightly. In neural network, a conventional choice of learning rate would be 0.02 and after some epochs, learning rate decay could be used to reduce by a factor of 10. Also, learning rate scheduler could be used to choose learning rate automatically during the training process. The aim of choosing a proper learning rate is to have the training loss decrease in a gentle fashion.