Deriving Machine Learning Cost Functions using Maximum Likelihood Estimation (MLE)  Part II
In Part I of this article, we introduced Maximum Likelihood Estimation (MLE), Likelihood function, and derived Mean Squared Error (MSE) using Maximum likelihood estimation. In this article, we will use Maximum likelihood estimation to derive CrossEntropy cost function, which is commonly used for binary classification problems.
Background
Binary logistic regression is used to model the relationship between a categorical target variable $Y$ and a predictor vector $X = (X_1, X_2, \cdots, X_p)$. The target variable will have two possible values, such as whether a student passes an exam or not, or whether a visitor to a website subscribes to the website’s newsletter or not. The two possible categories are coded as ‘1’, called the positive class, and ‘0’, called the negative class. Binary logistic regression estimates the probability that the response variable $Y$ belongs to the positive class given $X$.
$$ \begin{align} p(X) = Pr(Y = 1  X) \end{align} $$
In linear regression, we model the expected value (the mean $\mu$) of the continuous target variable $Y$ as a linear combination of the predictor vector $X$ and estimate the weight parameters $\beta_1, \beta_2, \cdots, \beta_p$ using our training data.
$$ \begin{align} \mathbb{E}(YX) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p \label{eqn:trueregression} \end{align} $$
In this case where our target variable $Y$ is categorical and has two possible values coded as 0 and 1, the expected value or mean of $Y$ is the probability $p(X)$ of observing the positive class^{1}. It seems sensible then to model the expected value of our categorical $Y$ variable using equation \eqref{eqn:trueregression}, as in linear regression.
$$ \begin{align} \mathbb{E}(YX) = p(X) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p \label{eqn:probabilitylinear} \end{align} $$
The problem with modelling the probability $p(X)$ as a linear combination of the predictor variables is that probability $p(X)$ has a range $[0, 1]$, but the righthand side of the equation outputs values in the range $(\infty, +\infty)$. In other words, we will get meaningless estimates of the probability if we use that equation.
The solution is to use a function of probability $p(X)$ that provides a suitable relationship between the linear combination of the predictor variables $X$ and $p(X)$, the mean of the response variable. This function is called a link function, and it maps the probability range $[0, 1]$ to $(\infty, +\infty)$.
The most commonly used link function for binary logistic regression is the logit function (or logodds^{2}), given as:
$$ \begin{align} \text{logit}\bigg(p(X)\bigg) = \text{log}\bigg(\frac{p(X)}{1p(X)}\bigg) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p \label {eqn:logit} \end{align} $$
How do we then go from the logit function to getting the estimate of the probability p(X) of observing the positive class? Because logit is a function of probability, we can take its inverse to map arbitrary values in the range $(\infty, +\infty)$ back to the probability range $[0, 1]$.
Recall that the inverse function of the natural logarithm function is the exponential function, so if we take the inverse of equation \eqref{eqn:logit}, we get:
$$ \begin{align} \frac{p(X)}{1p(X)} = e^{(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)} \label{eqn:odds} \end{align} $$
If we solve for $p(X)$ in equation \eqref{eqn:odds}, we get^{3}:
$$ \begin{align} p(X) & = & \frac{e^{(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}}{e^{(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)} + 1} \nonumber \\ & = & \frac{1}{1 + e^{(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}} \label{eqn:logistic} \end{align} $$
Equation \eqref{eqn:logistic} is the logistic (or sigmoid) function, and it maps values in the logit range $(\infty, +\infty)$ back into the range $[0, 1]$ of probabilities.
Deriving Cost Entropy using MLE
Given a set of $n$ training examples $\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots, (x^{(n)}, y^{(n)})\}$, binary crossentropy is given by:
$$ \begin{align} \text{CrossEntropy} = \bigg[\frac{1}{n} \sum_{i=1}^n\bigg(y^{(i)}\text{log} p^{(i)} + (1 y^{(i)})\text{log}(1  p^{(i)})\bigg)\bigg] \label{eqn:crossentropy} \end{align} $$
where $x^{(i)}$ is the feature vector, $y^{(i)}$ is the true label (0 or 1) for the $i^{th}$ training example, and $p^{(i)}$ is the predicted probability that the $i^{th}$ training example belongs to the positive class, that is, $Pr(Y = 1  X = x^{(i)})$.
In this section, we will derive crossentropy using MLE. If you are not already familiar with MLE and likelihood function, I will advise that you read the section that explains both concepts in Part I of this article.
The derivation of crossentropy follows from using MLE to estimate the parameters $\beta_0, \beta_1, \cdots, \beta_p$ of our logistic model on our training data.
We start by describing the random process that generated $y^{(i)}$.
$y^{(i)}$ is a realisation of the Bernoulli random variable^{4} $Y$. The Bernoulli distribution is parameterised by $p$, and its probability mass function (pmf) is given by:
$$ \begin{align} Pr(Y = y^{(i)}) = \begin{cases} p & \text{if } y^{(i)} = 1, \\ 1p & \text {if } y^{(i)} = 0. \end{cases} \end{align} $$
which can be written in the more compact form:
$$ \begin{align} Pr(Y = y^{(i)}) = p^{y^{(i)}}(1p)^{1  y^{(i)}} \ \text{for} \ y^{(i)} \in \{0,1\} \label{eqn:compact} \end{align} $$
We then define our Likelihood function. The estimates of $\beta_0, \beta_1, \cdots, \beta_p$ we choose will be the ones that maximise the likelihood function. The likelihood function is a function of our parameter $p$ given our training data:
$$ \begin{align} \mathcal{L}(p  (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots, (x^{(n)}, y^{(n)})) & = \prod_{i=1}^n f(y^{(i)}p) \nonumber \\ & = \prod_{i=1}^n p^{y^{(i)}}(1p)^{1  y^{(i)}} \label{eq:likelihood} \end{align} $$
It is easier to work with the log of the likelihood function^{5}, called the loglikelihood, so if we take the natural logarithm of equation \eqref{eq:likelihood}, we get:
$$ \begin{align} \log \bigg(\mathcal{L}(p  y^{(1)}, y^{(2)}, \cdots, y^{(n)})\bigg) & = \log\bigg( \prod_{i=1}^n p^{y^{(i)}}(1p)^{1  y^{(i)}} \bigg) \nonumber \\ & = \sum_{i=1}^n \log \bigg(p^{y^{(i)}}(1p)^{1  y^{(i)}}\bigg) \nonumber \\ & = \sum_{i=1}^n \bigg(y^{(i)}\text{log}p^{(i)} + (1y^{(i)})\text{log}(1p^{(i)})\bigg) \label{eq:loglikelihood} \end{align} $$
Recall that for our training data, $p^{(i)}$ in equation \eqref{eq:loglikelihood} is the predicted probability of the $i^{th}$ training example gotten from the logisitic function, so it is a function of the parameters $\beta_0, \beta_1, \cdots, \beta_p$. The maximum likelihood estimate $\hat \beta$ is therefore the value of the parameters that maximises the loglikelihood function.
$$ \begin{align} \hat \beta = \underset{\beta}{\operatorname{arg\,max}} \bigg[ \sum_{i=1}^n \bigg(y^{(i)}\text{log}p^{(i)} + (1y^{(i)})\text{log}(1p^{(i)})\bigg) \bigg] \end{align} $$
We also know that maximising a function is the same as minimising its negative.
$$ \begin{align} \hat \beta = \underset{\beta}{\operatorname{arg\,min}} \bigg[ \sum_{i=1}^n \bigg(y^{(i)}\text{log}p^{(i)} + (1y^{(i)})\text{log}(1p^{(i)})\bigg) \bigg] \end{align} $$
Taking the average across our $n$ training examples, we get:
$$ \begin{align} \hat \beta = \underset{\beta}{\operatorname{arg\,min}} \bigg[\color{red}\frac{1}{n}\sum_{i=1}^n \bigg(y^{(i)}\text{log}p^{(i)} + (1y^{(i)})\text{log}(1p^{(i)})\bigg) \color{black}\bigg] \end{align} $$
which is the crossentropy as defined in equation \eqref{eqn:crossentropy}.
Footnotes

$Y$ is a Bernoulli random variable. The Bernoulli distribution is a discrete probability distribution that describes processes that have only two possible outcomes: 1, with a probability of $p$ and 0, with a probability of $1  p$. The expectation (or mean) of the Bernoulli distribution is $p$. [Proof] ↩

The odds is defined as the ratio of the probability $p$ of observing an event to the probability $1p$ of not observing that event. $\text{odds} = \frac{p}{1p}$. If, for example, the odds of an event happening is $o$:1, it means that the event happened $o$ times out of a total of $o+1$ occurrences. The logodds is simply the natural logarithm of the odds. ↩

Bernoulli distribution is the discrete probability distribution of a random variable that takes on two possible values: 1 with probability $p$ and 0 with probability $1p$. An experiment modelled by the Bernoulli distribution is called a Bernoull trial. Examples of Bernoulli trials include: tossing a coin (head/tail), playing a game (winning/not winning). ↩

The reasons why this is the case is explained clearly in Part I. Check the ‘What is Maximum Likelihood Estimation?’ section. ↩