Logistic Classification Comparing ML/MAP/Bayesian approches

Introduction

Logistic regression is investigated for solving a binary classification problem. Linear classification using original input features and non-linear RBF expansion on input features are both used and compared here. Undoubtedly, the non-linear classifier outperforms the linear classifier with classification accuracy 90.5% and 69.5% respectively using maximum likelihood estimation. After examining the effectiveness of using RBF, RBF is used as default in comparison for between ML, full Bayesian and Maximum a posteriori estimation.

Maximum Likelihood Learning

A dataset with datapoints, is used to investigate a binary logistic classification problem with output being 0 or 1. With is of Dimension , i.e. . Furthermore, is defined by augmenting with a fixed unit input for , i.e. , therefore . This enables bias terms to be multiplied directly from the biased weight . i.e.

where .

We can then define our probability of positive/negative class label and therefore, the overall probability of the dataset:

The log-likelihood function can be defined as the log of the probability function:

where

The derivative of the cost function w.r.t each is:

where we can calculate the two derivatives of log function separately:

Similarly,

Combining gives:

where is the sigmoid activated prediction value for .

Vectorized implementation of the gradient is:

Therefore, gradient ascent can be used to estimate the weights

which would be investigated in detail in the next paragraph.

Gradient Decent

The gradient ascent algorithm for estimating parameters is shown below. Note that different definition for row and column of would result in different transpose in the algorithm, always validate the matrix vector multiplication and make sure the gradient dimension matches the dimension of

The learning rate parameter is not obvious to determine, we can use trial and error. One way is to plot the training loss w.r.t epoch number. If training loss oscillates a lot, that means is too large and need to be reduced. If training loss decreases very slowly, we can increase the training rate slightly. In neural network, a conventional choice of learning rate would be 0.02 and after some epochs, learning rate decay could be used to reduce by a factor of 10. Also, learning rate scheduler could be used to choose learning rate automatically during the training process. The aim of choosing a proper learning rate is to have the training loss decrease in a gentle fashion.

Logistic Classification Comparing ML/MAP/Bayesian approches

With PCA and logistic regression

Introduction

Maximum Likelihood Learning

Gradient Decent

Extension

References

CATALOG

FEATURED TAGS

FRIENDS