\( \require{amstext} \require{amsmath} \require{amssymb} \require{amsfonts} \newcommand{\eg}{\textit{e}.\textit{g}.} \newcommand{\xb}{\mathbf{x}} \newcommand{\yb}{\mathbf{y}} \newcommand{\xbtn}{\tilde{\xb}^{(n)}} \newcommand{\yn}{y^{(n)}} \newcommand{\betab}{\pmb{\beta}} \newcommand{\lap}{\mathcal{L}} \newcommand{\dydx}[2]{\frac{\partial{#1}}{\partial{#2}}} \newcommand{\dydxsmall}[2]{\partial{#1}/\partial{#2}} \)

## Summary

In this post, cross entropy and KL divergence will be compared. Overall, CE and KL divergence are differently defined, however results would be the same with optimising over the two cost.

## Cross Entropy

Cross Entropy is a measure of the difference between two probability distributions for a given random variable. Note that it is not symmetric \( H(p,q) = - \sum_i p(i) \log q(i) \) In case of the cross entropy used in classification, $p$ usually denotes the correct/target label (ground truth), q denotes the prediction label. \( CE = \frac{1}{m}\sum_{i=1}^m y_{i} \log(p(\hat{y_{i}})) + (1 - y_{i}) \log(1- p(\hat{y_{i}})) \) where $m$ is the total number of the training examples

## KL divergence

Also known as the relative entropy, which measures how much one distribution differs from another. another interpretation is that when using the distribution $q$, how much is the excess surprise from the true distribution $p$. Note that it is again not symmetric \( KL(P || Q) = - \sum_i p(i) \log \frac{q(i)}{p(i)} \)

## Difference

Note that the entropy of $p$ is defined as \( H(p) = - \sum_i p(i) \log p(i) \) and \( KL(P || Q) = - \sum_i p(i) \log \frac{q(i)}{p(i)} = - \sum_i p(i) \log q(i) + \sum_i p(i) \log p(i) \) Therefore, \( H(p,q) = - \sum_i p(i) \log q(i) = KL(P || Q) + H(p) \) Since $H(p)$ is a constant for each defined dataset, therefore optimising over CE and KL would give the same results.