File failed to load: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/extensions/TeX/amsfonts.js

Logistic Classification Comparing ML/MAP/Bayesian approches

With PCA and logistic regression

Posted by Jingbiao on March 14, 2021, Reading time: 3 minutes.

Introduction

Logistic regression is investigated for solving a binary classification problem. Linear classification using original input features and non-linear RBF expansion on input features are both used and compared here. Undoubtedly, the non-linear classifier outperforms the linear classifier with classification accuracy 90.5% and 69.5% respectively using maximum likelihood estimation. After examining the effectiveness of using RBF, RBF is used as default in comparison for between ML, full Bayesian and Maximum a posteriori estimation.

Maximum Likelihood Learning

A dataset with n datapoints, y(n),x(n)Nn=1 is used to investigate a binary logistic classification problem with output being 0 or 1. With x(n) is of Dimension D, i.e. xnRD. Furthermore, ˜x(n) is defined by augmenting with a fixed unit input for x(n), i.e. ˜x(n)=(1,x(n)), therefore ˜xnRD+1. This enables bias terms to be multiplied directly from the biased weight β0. i.e.

ββT~x(n)=β0+Dd=1βdx(n)d.

where ββRD+1.

We can then define our probability of positive/negative class label and therefore, the overall probability of the dataset:

P(y(n)=1|˜x(n))=11+exp(ββT˜x(n))=σ(ββT˜x(n))

P(y(n)=0|˜x(n))=1σ(ββT˜x(n))=σ(ββT˜x(n))

P(y|X,ββ)=Nn=1P(y(n)|˜x(n))=Nn=1σ(ββT˜x(n))y(n)(1σ(ββT˜x(n)))1y(n)

The log-likelihood function can be defined as the log of the probability function:

L(ββ)=logP(y|X,ββ)=Nn=1y(n)logσ(x)+(1y(n))log(1σ(x))

where x=ββT˜x(n)

The derivative of the cost function w.r.t each βi is:

L(ββ)βi=Nn=1y(n)βilogσ(x)+(1y(n))βilog(1σ(x))

where we can calculate the two derivatives of log function separately:

(logσ(x))βi=(logσ(ββT˜x(n)))βi=[1+exp(ββT˜x(n))]1βiσ(ββT˜x(n))=[1+exp(ββT˜x(n))]2exp(ββT˜x(n))βiσ(ββT˜x(n))=σ2(ββT˜x(n))exp(ββT˜x(n))(x(n)i)σ(ββT˜x(n))=exp(ββT˜x(n))x(n)i1+exp(ββT˜x(n))=(1+exp(ββT˜x(n))1)x(n)i1+exp(ββT˜x(n))=(1σ(x))x(n)i

Similarly,

(log(1σ(x)))βi=(log(1σ(ββT˜x(n))))βi=11σ(x)σ2(x)exp(ββT˜x(n))(x(n)i)=11σ(x)σ2(x)(1σ(x)1)(x(n)i)=σ(x)x(n)i

Combining gives:

L(ββ)βi=Nn=1y(n)(1σ(x))x(n)i+(1y(n))(σ(x)x(n)i)=Nn=1(y(n)σ(ββT˜x(n)))x(n)i

where σ(ββT˜x(n))) is the sigmoid activated prediction value for y(n).

Vectorized implementation of the gradient is:

L(ββ)ββ=Nn=1(y(n)σ(ββT˜x(n)))˜x(n)

Therefore, gradient ascent can be used to estimate the weights ββ

ββ=ββ+ηL(ββ)ββ

which would be investigated in detail in the next paragraph.

Gradient Decent

The gradient ascent algorithm for estimating parameters ββ is shown below. Note that different definition for row and column of X would result in different transpose in the algorithm, always validate the matrix vector multiplication and make sure the gradient dimension matches the dimension of ββ

The learning rate parameter η is not obvious to determine, we can use trial and error. One way is to plot the training loss w.r.t epoch number. If training loss oscillates a lot, that means η is too large and need to be reduced. If training loss decreases very slowly, we can increase the training rate slightly. In neural network, a conventional choice of learning rate would be 0.02 and after some epochs, learning rate decay could be used to reduce by a factor of 10. Also, learning rate scheduler could be used to choose learning rate automatically during the training process. The aim of choosing a proper learning rate is to have the training loss decrease in a gentle fashion.

Extension

References



App ready for offline use.