\( \require{amstext} \require{amsmath} \require{amssymb} \require{amsfonts} \newcommand{\eg}{\textit{e}.\textit{g}.} \newcommand{\xb}{\mathbf{x}} \newcommand{\yb}{\mathbf{y}} \newcommand{\xbtn}{\tilde{\xb}^{(n)}} \newcommand{\yn}{y^{(n)}} \newcommand{\betab}{\pmb{\beta}} \newcommand{\lap}{\mathcal{L}} \newcommand{\dydx}[2]{\frac{\partial{#1}}{\partial{#2}}} \newcommand{\dydxsmall}[2]{\partial{#1}/\partial{#2}} \)
Summary
In this post, cross entropy and KL divergence will be compared. Overall, CE and KL divergence are differently defined, however results would be the same with optimising over the two cost.
Cross Entropy
Cross Entropy is a measure of the difference between two probability distributions for a given random variable. Note that it is not symmetric \( H(p,q) = - \sum_i p(i) \log q(i) \) In case of the cross entropy used in classification, $p$ usually denotes the correct/target label (ground truth), q denotes the prediction label. \( CE = \frac{1}{m}\sum_{i=1}^m y_{i} \log(p(\hat{y_{i}})) + (1 - y_{i}) \log(1- p(\hat{y_{i}})) \) where $m$ is the total number of the training examples
KL divergence
Also known as the relative entropy, which measures how much one distribution differs from another. another interpretation is that when using the distribution $q$, how much is the excess surprise from the true distribution $p$. Note that it is again not symmetric \( KL(P || Q) = - \sum_i p(i) \log \frac{q(i)}{p(i)} \)
Difference
Note that the entropy of $p$ is defined as \( H(p) = - \sum_i p(i) \log p(i) \) and \( KL(P || Q) = - \sum_i p(i) \log \frac{q(i)}{p(i)} = - \sum_i p(i) \log q(i) + \sum_i p(i) \log p(i) \) Therefore, \( H(p,q) = - \sum_i p(i) \log q(i) = KL(P || Q) + H(p) \) Since $H(p)$ is a constant for each defined dataset, therefore optimising over CE and KL would give the same results.