Deriving Machine Learning Cost Functions using Maximum Likelihood Estimation (MLE)  Part I
Motivation
In supervised machine learning, cost functions are used to measure a trained model’s performance. The most commonly used cost function for regression is the Mean Squared Error (MSE), and Cross Entropy for binary classification problems. For many people, the reasons for choosing these cost functions are not at all clear.
The purpose of this twopart article is to shed some light on the choice of these cost functions by deriving them using Maximum Likelihood Estimation (MLE). Part I will focus on deriving MSE while Part II will focus on deriving Cross Entropy.
Background
Given a training set of $n$ examples $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots (x^{(n)}, y^{(n)})$ where $x^{(i)}$ is the feature vector for the $i^{th}$ training example and $y^{(i)}$ is its target, the goal of supervised learning is to learn a model $f: \mathcal{X} \to \mathcal{Y}$, a mapping that given $x \in \mathcal{X}$ outputs a prediction $\hat y \in \mathcal{Y}$. $\mathcal X$ and $\mathcal Y$ are called the input space and output space respectively.
In order to measure how well the model fits our training data, we define a loss function. For training example $(x^{(i)}, y^{(i)})$, the loss $\mathcal{L(y^{(i)}, \hat y^{(i)})}$ measures how different the model’s prediction $\hat y^{(i)}$ is from the true label or value. The loss is calculated for all training examples, and its average taken. This value is called the cost function and is given by:
$$ \tag{1} \begin{aligned} Cost = \frac{1}{n}\sum_{i=1}^n\mathcal{L} \bigg( y^{(i)}, \hat y^{(i)} \bigg) \end{aligned} $$
The model $f$ usually has some unknown parameter $\theta$ (In general, $\theta$ is a vector of parameters) which we will try to estimate using the training set. We can frame supervised learning as an optimisation problem  that is, we estimate the value of $\theta$ by picking the value that minimises the cost function we chose for our problem. Let us call this parameter estimate $\hat \theta$.
$$ \tag{2} \begin{aligned} \hat \theta = \underset{\theta}{\operatorname{arg\,min}} \frac{1}{n}\sum_{i=1}^n\mathcal{L}\bigg(y^{(i)}, \hat y^{(i)}\bigg) \end{aligned} $$
What is Maximum Likelihood Estimation?
Maximum Likelihood Estimation (MLE) is a method of estimating the unknown parameter $\theta$ of a model, given observed data. It estimates the model parameter by finding the parameter value that maximises the likelihood function. The parameter estimate is called the maximum likelihood estimate $\hat{\theta}_{MLE}$.
Likelihood Function
Given a set of independent and identically distributed (i.i.d)^{1} random variables $X_1, X_2, \cdots, X_n$ from a probability distribution $P_\theta$ with parameter $\theta$. We assume the random variables have a joint probability density function (or joint probability mass function in the case of discrete variables)^{2} $f(x_1, x_2,\cdots, x_n\theta)$. Suppose $x_1, x_2,\cdots, x_n$ are the observed values of the random variables, we define the likelihood function, a function of parameter $\theta$ as:
$$ \tag{3} \begin{aligned} \mathcal{L}(\theta  x_1, x_2, \cdots, x_n) = f(x_1, x_2, \cdots, x_n\theta) \end{aligned} $$
The likelihood function measures how plausible it is that the observed data was generated by the model with a particular value of $\theta$. For example, if we use $\theta_1$ and $\theta_2$ as values of $\theta$ and find that $\mathcal{L(\theta_1  x_1, \cdots, x_n)}$ > $\mathcal{L(\theta_2  x_1, \cdots, x_n)}$, we can reasonably conclude that the observed data is more likely to have been generated by the model with its parameter being $\theta_1$.
Remember that we assumed that the observed data $x_1, x_2,\cdots, x_n$ were drawn i.i.d from the probability distribution. Also recall that for independent random variables $X_1$ and $X_2$, $f(x_1,x_2\theta) = f(x_1\theta) \cdot f(x_2\theta)$. Considering these, our likelihood function in equation (3) becomes:
$$ \tag{4} \begin{aligned} \mathcal{L(\theta  x_1, x_2, \cdots, x_n)} & = f(x_1\theta)\cdot f(x_2\theta)\cdots\cdot f(x_n\theta) \\ & = \prod_{i=1}^n f(x_i\theta) \end{aligned} $$
where $f(x_i\theta)$ is the probability density (or mass for discrete variables) function of the random variable $X_i$. Since all the random variables are drawn from the same distribution, their probability density (or mass) function will be the same.
It is often easier to work with the natural logarithm of the likelihood function, called the loglikelihood. Recalling the product rule of logarithms, $log(a \cdot b) = log(a) + log(b)$. If we apply the product rule to equation (4) by taking its natural logarithm, it becomes:
$$ \tag{5} \begin{aligned} \log \bigg(\mathcal{L(\theta  x_1, x_2, \cdots, x_n)}\bigg) & = \log \bigg( f(x_1\theta)\cdot f(x_2\theta)\cdots\cdot f(x_n\theta) \bigg) \\ & = \log(f(x_1\theta)) + \log(f(x_2\theta)) + \cdots + \log (f(x_n\theta)) \\ & = \sum_{i=1}^n \log \bigg(f(x_i\theta) \bigg) \end{aligned} $$
To find the value of $\theta$ that maximises the likelihood function, we find its critical point, the point at which the function’s derivative is $0$. That is:
$$ \tag{6} \begin{aligned} \frac{\partial \mathcal{L(\theta)}}{\partial \theta} = 0 \end{aligned} $$
In most cases, this derivative is easier to compute for the loglikelihood function in equation (5) than for the vanilla likelihood function in equation «a class=”link” id=”4”>(4)</a>. Since logarithm is a monotononically increasing function, that is, if $x_1 > x_2$, then $\log(x_1) > log(x_2)$, the value that maximises the likelihood is also the value that maximises the loglikelihood function.
The other advantage of using loglikehood over likelihood is that it avoids numerical precision issues. Likelihoods are very small numbers, and taking the product of small numbers creates even smaller numbers, which can cause arithmetic underflow.
In other words,
$$ \tag{7} \begin{aligned} \hat{\theta}_{MLE} = \underset{\theta}{\operatorname{arg\,max}}\ \mathcal L(\theta\, x_1, \cdots, x_n) = \underset{\theta}{\operatorname{arg\,max}}\ \log (\mathcal L(\theta\, x_1, \cdots, x_n)) \end{aligned} $$
Deriving Mean Squared Error(MSE) using MLE
Mean Squared Error is the cost function commonly used for regression. Given a set of $n$ traning examples of the form $(x^{(i)}, y^{(i)})$, MSE is given by:
$$ \tag{8} \begin{aligned} MSE = \frac{1}{n} \sum_{i=1}^n\bigg(y^{(i)}  \hat y^{(i)}\bigg)^2 \end{aligned} $$
where $x^{(i)}$ is the feature vector, $y^{(i)}$ is the true output value, and $\hat y^{(i)}$ is the regression model’s prediction for the $i^{th}$ training example.
In this section, we will derive MSE using maximum likelihood estimation.
Regression Model
Given a vector of predictor variables $X = (X_1, X_2, \cdots X_p)$, and quantitative outcome variable $Y$, linear regression assumes that there is a linear relationship between the population mean of the outcome and the predictor variables. This relationship can be expressed by:
$$ \tag{9} \begin{aligned} \mathbb{E}(YX) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p \end{aligned} $$
$\mathbb{E}(YX)$ measures the population mean of $Y$ given a particular value of $X$ and it is often called the true regression line. Now, our observed data will not lie exactly on this true regression line. They will be spread about the true regression line. For example, if we are trying to predict height from age of a population, we will find that people of the same age will have different heights. Each person's height will differ from the population mean $\mathbb{E}(YX)$ by a certain amount. We represent this mathematically as:
$$ \tag{10} \begin{aligned} Y & = \mathbb{E}(YX) + \epsilon \\ & = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon \end{aligned} $$
The spread about the true regression line is what the $\epsilon$ term captures. We assume it is independent of $X$ and is drawn from a Normal distribution with zero mean ($\mu$ = 0) and variance $\sigma^2$, i.e. $\epsilon \sim \mathcal{N}(0, \sigma^2)$.
Estimating linear regression’s model parameters
The true regression line and its model parameters $\beta_0, \beta_1, \cdots \beta_p$ in equation (10) are unknown, so we will try to estimate them using training data $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots (x^{(n)}, y^{(n)})$, where $x^{(i)}$ is a vector of $p$ predictor variables $(x_1, x_2, \cdots, x_p)$ and $y^{(i)}$ is the target value. Our model (the estimate of the true regression line) is:
$$ \tag{11} \begin{aligned} y^{(i)} & = \beta_0 + \beta_1 x^{(i)}_1 + \cdots + \beta_p x^{(i)}_p + \epsilon \\ & = \beta^\intercal x^{(i)} + \epsilon \\ & = \hat y^{(i)} + \epsilon \end{aligned} $$
You will notice that $\hat y^{(i)}$ in equation (11) is the estimate of $\mathbb{E}(YX)$ in equation (10). The $\epsilon$ term is called the residual and it measures the difference between the observed value $y^{(i)}$ and the predicted value $\hat y^{(i)}$ from the regression equation.
Estimating $\beta_0, \beta_1, \cdots \beta_p$ using the training data is an optimisation problem that we can solve using MLE by defining a likelihood function. Since $\epsilon$ is normally distributed $\epsilon \sim \mathcal{N}(0, \sigma^2)$, our outcome variable $y$ will also be normally distributed. So, $y \sim \mathcal{N}(\beta^\intercal x, \sigma^2)$ The probability density function of the normal distribution (parameterised by $\mu$: mean, and $\sigma^2$: variance) is given by:
$$ \tag{12} \begin{aligned} f(y \mid \color{red}\mu, \color{black}\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2} } e^{ \frac{(y\color{red}\mu\color{black})^2}{2\sigma^2} } \end{aligned} $$
Therefore, if we replace the parameter $\mu$ with the mean of $y$, we get:
$$ \tag{13} \begin{aligned} f(y \mid \color{red}\beta^\intercal x \color{black}, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2} } e^{ \frac{(y\color{red}\beta^\intercal x\color{black})^2}{2\sigma^2} } \end{aligned} $$
The loglikelihood function will then be:
$$ \tag{14} \begin{aligned} \log \bigg(\mathcal{L(\beta  x^{(1)}, x^{(2)}, \cdots, x^{(n)})}\bigg) & = \sum_{i=1}^n \log \bigg( \frac{1}{\sqrt{2\pi\sigma^2} } e^{ \frac{(y\beta^\intercal x^{(i)})^2}{2\sigma^2} } \bigg) \\ & = n \log \bigg( \frac{1}{\sqrt{2\pi\sigma^2} } \bigg)  \sum_{i=1}^n \frac{(y\beta^\intercal x^{(i)})^2}{2\sigma^2} \end{aligned} $$
Since we need to take the derivative of loglikehood function with respect to $\beta$ to find the maximum likehood estimate of $\beta$, we can remove all the terms that do not contain our parameter $\beta$ as they do not have any effect on our optimisation ^{3}, so our equation becomes:
$$ \tag{15} \begin{aligned} \log \bigg(\mathcal{L(\beta  x^{(1)}, x^{(2)}, \cdots, x^{(n)})}\bigg) & =  \sum_{i=1}^n (y\beta^\intercal x^{(i)})^2 \\ & =  \sum_{i=1}^n (y\hat y^{(i)})^2 \end{aligned} $$
The value of $\beta$ that maximises equation (15) is the maximum likelihood estimate $\hat \beta_{MSE}$. That is,
$$ \tag{16} \begin{aligned} \hat \beta_{MSE} = \underset{\beta}{\operatorname{arg\,max}} \bigg[ \sum_{i=1}^n(y\hat y^{(i)})^2 \bigg] \end{aligned} $$
Recall that maximising a function is the same as minimising its negative, so we can rewrite equation (16) as
$$ \tag{17} \begin{aligned} \hat \beta_{MSE} & = \underset{\beta}{\operatorname{arg\,max}} \bigg[ \sum_{i=1}^n(y\hat y^{(i)})^2 \bigg] \\ & = \underset{\beta}{\operatorname{arg\,min}} \color{red}\sum_{i=1}^n(y\hat y^{(i)})^2 \end{aligned} $$
Taking the average across all $n$ training examples, we get:
$$ \tag{18} \begin{aligned} \hat \beta_{MSE} = \underset{\beta}{\operatorname{arg\,min}} \color{red} \frac{1}{n} \sum_{i=1}^n(y\hat y^{(i)})^2 \end{aligned} $$
which is exactly the mean squared error (MSE) cost function as defined in equation (8).
Conclusion
This article introduced the concept of maximum likehood estimation, likehood function, and showed the derivation of mean squared error (MSE). The second part will cover the derivation of cross entropy using maximum likelihood estimation.
Footnotes

A set of random variables $X_1, X_2, \cdots, X_n$ are said to be independent and identically distributed if they have the same probability distribution and are mutually independent. Wikipedia Article ↩

This note from the Introduction to Probability and Statistics class on MIT OpenCourseWare explains joint probability mass and density functions clearly. ↩

The terms without the parameter $\beta$ are regarded as constants in the loglikelihood function, and differentiating a constant gives us zero. ↩