Deriving Machine Learning Cost Functions using Maximum Likelihood Estimation (MLE) - Part I

Machine-Learning

Motivation

In supervised machine learning, cost functions are used to measure a trained model’s performance. The most commonly used cost function for regression is the Mean Squared Error (MSE), and Cross Entropy for binary classification problems. For many people, the reasons for choosing these cost functions are not at all clear.
The purpose of this two-part article is to shed some light on the choice of these cost functions by deriving them using Maximum Likelihood Estimation (MLE). Part I will focus on deriving MSE while Part II will focus on deriving Cross Entropy.

Background

Given a training set of nn examples (x(1),y(1)),(x(2),y(2)),(x(n),y(n))(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots (x^{(n)}, y^{(n)}) where x(i)x^{(i)} is the feature vector for the ithi^{th} training example and y(i)y^{(i)} is its target, the goal of supervised learning is to learn a model f:XYf: \mathcal{X} \to \mathcal{Y}, a mapping that given xXx \in \mathcal{X} outputs a prediction y^Y\hat y \in \mathcal{Y}. X\mathcal X and Y\mathcal Y are called the input space and output space respectively.

In order to measure how well the model fits our training data, we define a loss function. For training example (x(i),y(i))(x^{(i)}, y^{(i)}), the loss L(y(i),y^(i))\mathcal{L(y^{(i)}, \hat y^{(i)})} measures how different the model’s prediction y^(i)\hat y^{(i)} is from the true label or value. The loss is calculated for all training examples, and its average taken. This value is called the cost function and is given by:

Cost=1ni=1nL(y(i),y^(i))(1) \tag{1} \begin{aligned} Cost = \frac{1}{n}\sum_{i=1}^n\mathcal{L} \bigg( y^{(i)}, \hat y^{(i)} \bigg) \end{aligned}

The model ff usually has some unknown parameter θ\theta (In general, θ\theta is a vector of parameters) which we will try to estimate using the training set. We can frame supervised learning as an optimisation problem - that is, we estimate the value of θ\theta by picking the value that minimises the cost function we chose for our problem. Let us call this parameter estimate θ^\hat \theta.

θ^=arg minθ1ni=1nL(y(i),y^(i))(2) \tag{2} \begin{aligned} \hat \theta = \underset{\theta}{\operatorname{arg\,min}} \frac{1}{n}\sum_{i=1}^n\mathcal{L}\bigg(y^{(i)}, \hat y^{(i)}\bigg) \end{aligned}

What is Maximum Likelihood Estimation?

Maximum Likelihood Estimation (MLE) is a method of estimating the unknown parameter θ\theta of a model, given observed data. It estimates the model parameter by finding the parameter value that maximises the likelihood function. The parameter estimate is called the maximum likelihood estimate θ^MLE\hat{\theta}_{MLE}.

Likelihood Function

Given a set of independent and identically distributed (i.i.d)1 random variables X1,X2,,XnX_1, X_2, \cdots, X_n from a probability distribution PθP_\theta with parameter θ\theta. We assume the random variables have a joint probability density function (or joint probability mass function in the case of discrete variables)2 f(x1,x2,,xnθ)f(x_1, x_2,\cdots, x_n|\theta). Suppose x1,x2,,xnx_1, x_2,\cdots, x_n are the observed values of the random variables, we define the likelihood function, a function of parameter θ\theta as:

L(θx1,x2,,xn)=f(x1,x2,,xnθ)(3) \tag{3} \begin{aligned} \mathcal{L}(\theta | x_1, x_2, \cdots, x_n) = f(x_1, x_2, \cdots, x_n|\theta) \end{aligned}

The likelihood function measures how plausible it is that the observed data was generated by the model with a particular value of θ\theta. For example, if we use θ1\theta_1 and θ2\theta_2 as values of θ\theta and find that L(θ1x1,,xn)\mathcal{L(\theta_1 | x_1, \cdots, x_n)} > L(θ2x1,,xn)\mathcal{L(\theta_2 | x_1, \cdots, x_n)}, we can reasonably conclude that the observed data is more likely to have been generated by the model with its parameter being θ1\theta_1.

Remember that we assumed that the observed data x1,x2,,xnx_1, x_2,\cdots, x_n were drawn i.i.d from the probability distribution. Also recall that for independent random variables X1X_1 and X2X_2, f(x1,x2θ)=f(x1θ)f(x2θ)f(x_1,x_2|\theta) = f(x_1|\theta) \cdot f(x_2|\theta). Considering these, our likelihood function in equation becomes:

L(θx1,x2,,xn)=f(x1θ)f(x2θ)f(xnθ)=i=1nf(xiθ)(4) \tag{4} \begin{aligned} \mathcal{L(\theta | x_1, x_2, \cdots, x_n)} & = f(x_1|\theta)\cdot f(x_2|\theta)\cdots\cdot f(x_n|\theta) \\ & = \prod_{i=1}^n f(x_i|\theta) \end{aligned}

where f(xiθ)f(x_i|\theta) is the probability density (or mass for discrete variables) function of the random variable XiX_i. Since all the random variables are drawn from the same distribution, their probability density (or mass) function will be the same.

It is often easier to work with the natural logarithm of the likelihood function, called the log-likelihood. Recalling the product rule of logarithms, log(ab)=log(a)+log(b)log(a \cdot b) = log(a) + log(b). If we apply the product rule to equation by taking its natural logarithm, it becomes:

log(L(θx1,x2,,xn))=log(f(x1θ)f(x2θ)f(xnθ))=log(f(x1θ))+log(f(x2θ))++log(f(xnθ))=i=1nlog(f(xiθ))(5) \tag{5} \begin{aligned} \log \bigg(\mathcal{L(\theta | x_1, x_2, \cdots, x_n)}\bigg) & = \log \bigg( f(x_1|\theta)\cdot f(x_2|\theta)\cdots\cdot f(x_n|\theta) \bigg) \\ & = \log(f(x_1|\theta)) + \log(f(x_2|\theta)) + \cdots + \log (f(x_n|\theta)) \\ & = \sum_{i=1}^n \log \bigg(f(x_i|\theta) \bigg) \end{aligned}

To find the value of θ\theta that maximises the likelihood function, we find its critical point, the point at which the function’s derivative is 00. That is:

L(θ)θ=0(6) \tag{6} \begin{aligned} \frac{\partial \mathcal{L(\theta)}}{\partial \theta} = 0 \end{aligned}

In most cases, this derivative is easier to compute for the log-likelihood function in equation than for the vanilla likelihood function in equation «a class=”link” id=”4”>(4)</a>. Since logarithm is a monotononically increasing function, that is, if x1>x2x_1 > x_2, then log(x1)>log(x2)\log(x_1) > log(x_2), the value that maximises the likelihood is also the value that maximises the log-likelihood function.
The other advantage of using log-likehood over likelihood is that it avoids numerical precision issues. Likelihoods are very small numbers, and taking the product of small numbers creates even smaller numbers, which can cause arithmetic underflow.

In other words,

θ^MLE=arg maxθ L(θx1,,xn)=arg maxθ log(L(θx1,,xn))(7) \tag{7} \begin{aligned} \hat{\theta}_{MLE} = \underset{\theta}{\operatorname{arg\,max}}\ \mathcal L(\theta\,| x_1, \cdots, x_n) = \underset{\theta}{\operatorname{arg\,max}}\ \log (\mathcal L(\theta\,| x_1, \cdots, x_n)) \end{aligned}

Deriving Mean Squared Error(MSE) using MLE

Mean Squared Error is the cost function commonly used for regression. Given a set of nn traning examples of the form (x(i),y(i))(x^{(i)}, y^{(i)}), MSE is given by:

MSE=1ni=1n(y(i)y^(i))2(8) \tag{8} \begin{aligned} MSE = \frac{1}{n} \sum_{i=1}^n\bigg(y^{(i)} - \hat y^{(i)}\bigg)^2 \end{aligned}

where x(i)x^{(i)} is the feature vector, y(i)y^{(i)} is the true output value, and y^(i)\hat y^{(i)} is the regression model’s prediction for the ithi^{th} training example.

In this section, we will derive MSE using maximum likelihood estimation.

Regression Model

Given a vector of predictor variables X=(X1,X2,Xp)X = (X_1, X_2, \cdots X_p), and quantitative outcome variable YY, linear regression assumes that there is a linear relationship between the population mean of the outcome and the predictor variables. This relationship can be expressed by:

E(YX)=β0+β1X1++βpXp(9) \tag{9} \begin{aligned} \mathbb{E}(Y|X) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p \end{aligned}

E(YX)\mathbb{E}(Y|X) measures the population mean of YY given a particular value of XX and it is often called the true regression line. Now, our observed data will not lie exactly on this true regression line. They will be spread about the true regression line. For example, if we are trying to predict height from age of a population, we will find that people of the same age will have different heights. Each person's height will differ from the population mean E(YX)\mathbb{E}(Y|X) by a certain amount. We represent this mathematically as:

Y=E(YX)+ϵ=β0+β1X1++βpXp+ϵ(10) \tag{10} \begin{aligned} Y & = \mathbb{E}(Y|X) + \epsilon \\ & = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon \end{aligned}

The spread about the true regression line is what the ϵ\epsilon term captures. We assume it is independent of XX and is drawn from a Normal distribution with zero mean (μ\mu = 0) and variance σ2\sigma^2, i.e. ϵN(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2).

Estimating linear regression’s model parameters

The true regression line and its model parameters β0,β1,βp\beta_0, \beta_1, \cdots \beta_p in equation are unknown, so we will try to estimate them using training data (x(1),y(1)),(x(2),y(2)),(x(n),y(n))(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots (x^{(n)}, y^{(n)}), where x(i)x^{(i)} is a vector of pp predictor variables (x1,x2,,xp)(x_1, x_2, \cdots, x_p) and y(i)y^{(i)} is the target value. Our model (the estimate of the true regression line) is:

y(i)=β0+β1x1(i)++βpxp(i)+ϵ=βx(i)+ϵ=y^(i)+ϵ(11) \tag{11} \begin{aligned} y^{(i)} & = \beta_0 + \beta_1 x^{(i)}_1 + \cdots + \beta_p x^{(i)}_p + \epsilon \\ & = \beta^\intercal x^{(i)} + \epsilon \\ & = \hat y^{(i)} + \epsilon \end{aligned}

You will notice that y^(i)\hat y^{(i)} in equation is the estimate of E(YX)\mathbb{E}(Y|X) in equation . The ϵ\epsilon term is called the residual and it measures the difference between the observed value y(i)y^{(i)} and the predicted value y^(i)\hat y^{(i)} from the regression equation.

Estimating β0,β1,βp\beta_0, \beta_1, \cdots \beta_p using the training data is an optimisation problem that we can solve using MLE by defining a likelihood function. Since ϵ\epsilon is normally distributed ϵN(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2), our outcome variable yy will also be normally distributed. So, yN(βx,σ2)y \sim \mathcal{N}(\beta^\intercal x, \sigma^2) The probability density function of the normal distribution (parameterised by μ\mu: mean, and σ2\sigma^2: variance) is given by:

f(yμ,σ2)=12πσ2e(yμ)22σ2(12) \tag{12} \begin{aligned} f(y \mid \color{red}\mu, \color{black}\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2} } e^{ -\frac{(y-\color{red}\mu\color{black})^2}{2\sigma^2} } \end{aligned}

Therefore, if we replace the parameter μ\mu with the mean of yy, we get:

f(yβx,σ2)=12πσ2e(yβx)22σ2(13) \tag{13} \begin{aligned} f(y \mid \color{red}\beta^\intercal x \color{black}, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2} } e^{ -\frac{(y-\color{red}\beta^\intercal x\color{black})^2}{2\sigma^2} } \end{aligned}

The log-likelihood function will then be:

log(L(βx(1),x(2),,x(n)))=i=1nlog(12πσ2e(yβx(i))22σ2)=nlog(12πσ2)i=1n(yβx(i))22σ2(14) \tag{14} \begin{aligned} \log \bigg(\mathcal{L(\beta | x^{(1)}, x^{(2)}, \cdots, x^{(n)})}\bigg) & = \sum_{i=1}^n \log \bigg( \frac{1}{\sqrt{2\pi\sigma^2} } e^{ -\frac{(y-\beta^\intercal x^{(i)})^2}{2\sigma^2} } \bigg) \\ & = n \log \bigg( \frac{1}{\sqrt{2\pi\sigma^2} } \bigg) - \sum_{i=1}^n \frac{(y-\beta^\intercal x^{(i)})^2}{2\sigma^2} \end{aligned}

Since we need to take the derivative of log-likehood function with respect to β\beta to find the maximum likehood estimate of β\beta, we can remove all the terms that do not contain our parameter β\beta as they do not have any effect on our optimisation 3, so our equation becomes:

log(L(βx(1),x(2),,x(n)))=i=1n(yβx(i))2=i=1n(yy^(i))2(15) \tag{15} \begin{aligned} \log \bigg(\mathcal{L(\beta | x^{(1)}, x^{(2)}, \cdots, x^{(n)})}\bigg) & = - \sum_{i=1}^n (y-\beta^\intercal x^{(i)})^2 \\ & = - \sum_{i=1}^n (y-\hat y^{(i)})^2 \end{aligned}

The value of β\beta that maximises equation is the maximum likelihood estimate β^MSE\hat \beta_{MSE}. That is,

β^MSE=arg maxβ[i=1n(yy^(i))2](16) \tag{16} \begin{aligned} \hat \beta_{MSE} = \underset{\beta}{\operatorname{arg\,max}} \bigg[ -\sum_{i=1}^n(y-\hat y^{(i)})^2 \bigg] \end{aligned}

Recall that maximising a function is the same as minimising its negative, so we can rewrite equation as

β^MSE=arg maxβ[i=1n(yy^(i))2]=arg minβi=1n(yy^(i))2(17) \tag{17} \begin{aligned} \hat \beta_{MSE} & = \underset{\beta}{\operatorname{arg\,max}} \bigg[ -\sum_{i=1}^n(y-\hat y^{(i)})^2 \bigg] \\ & = \underset{\beta}{\operatorname{arg\,min}} \color{red}\sum_{i=1}^n(y-\hat y^{(i)})^2 \end{aligned}

Taking the average across all nn training examples, we get:

β^MSE=arg minβ1ni=1n(yy^(i))2(18) \tag{18} \begin{aligned} \hat \beta_{MSE} = \underset{\beta}{\operatorname{arg\,min}} \color{red} \frac{1}{n} \sum_{i=1}^n(y-\hat y^{(i)})^2 \end{aligned}

which is exactly the mean squared error (MSE) cost function as defined in equation .

Conclusion

This article introduced the concept of maximum likehood estimation, likehood function, and showed the derivation of mean squared error (MSE). The second part will cover the derivation of cross entropy using maximum likelihood estimation.

Footnotes

  1. A set of random variables X1,X2,,XnX_1, X_2, \cdots, X_n are said to be independent and identically distributed if they have the same probability distribution and are mutually independent. Wikipedia Article 

  2. This note from the Introduction to Probability and Statistics class on MIT OpenCourseWare explains joint probability mass and density functions clearly. 

  3. The terms without the parameter β\beta are regarded as constants in the log-likelihood function, and differentiating a constant gives us zero.