Demystifying Metrics and Evaluation in Machine Learning

Arthur Adinayev
11 min readApr 27, 2023

--

Metrics and evaluation in machine learning, including the difference between regression and classification metrics. It also introduces the concept of loss functions, which measure the performance or error of a model’s predictions and can be used for gradient-based training. While loss functions are optimized for gradient descent, metrics prioritize providing an intuitive understanding of a model’s effectiveness.

To put it simply, although loss functions and metrics have similar functions, they are different tools. Some metrics cannot be used as loss functions because they are undifferentiable. While some loss functions can be easily understood by humans, metrics prioritize providing an intuitive explanation of model performance, while loss functions are designed to optimize gradient descent.

Mean Absolute Error

the Mean Absolute Error (MAE) is a commonly used regression metric that measures the average difference between the predicted values of a model and the true values of the data it was trained on. It provides a way to evaluate how accurate a model’s predictions are, by taking into account the absolute differences between the predicted and true values.

MAE
import numpy as np 


def mean_absolute_error(y_true, y_pred):
assert y_pred.shape == y_true.shape
return np.sum(np.absolute(y_true - y_pred)) / len(y_pred)

Imagine a class of 30 students taking an exam that is scored out of 100 points. The teacher predicted the exam scores for each student based on their past performance in class. However, when the actual exam scores were returned, the teacher realized that their predictions were not accurate.

To assess the accuracy of their predictions, the teacher calculated the difference between their predicted score and each student’s actual score. However, since they wanted to have a more general idea of how wrong their predictions were overall, they decided to take the average of all the differences between the predicted and actual scores. This is called the Mean Absolute Error, and it gives a single value that represents the average difference between the predicted and actual scores for all students in the class.

This method is straightforward and easily understood telling us the average difference between the labels and model predictions. However, one major issue is that MAE does not account for outlier predictions. So, the problem with MAE is that it treats all prediction errors equally, regardless of their magnitude. This can be a problem because a large prediction error can have a much bigger impact on the usefulness of the model than a small error. For example, a model that is off by 50–55% or more is probably not very useful, while a model that is off by only 1–6% may still be considered reasonably accurate.

implementation with one line using sklearn library:

from sklearn.metrics import mean_absolute_error


mean_absolute_error(y_true, y_pred)

Mean Squared Error

As the name suggests, the Mean Squared Error raises the difference between model predictions and the ground-truth label to the second power instead of taking the absolute value. Mean Squared Error(MSE):

MSE

However, we do not have an intuition of how exactly the model is incorrect. Since we squared the result instead of taking the absolute value, the score we obtain doesn’t scale to the range of test scores for example an average error of 127.8 is impossible for a test that’s scored between 0 and 100.

import numpy as np 


def mean_squared_error(y_true, y_pred):
assert y_pred.shape == y_true.shape
return np.sum((y_true - y_pred) ** 2) / len(y_pred)

To counter such an issue, we can take the square root of the result to “reverse” the operation of raising to the second power, putting the error into contextual scale while still emphasizing any outlier predictions. This metric is the Root Mean Squared Error (RMSE):

RMSE
import numpy as np


def root_mean_squared_error(y_true, y_pred):
assert y_pred.shape == y_true.shape
return mean_squared_error(y_true, y_pred) ** 0.5

Implementation with one line using sklearn library:

from sklearn.metrics import mean_squared_error


mean_squared_error(y_reg_true, y_reg_pred)

mean_squared_error(y_reg_true, y_reg_pred) ** 0.5

The MAE, MSE, and RMSE are some of the most important ones for regression tasks.

Confusion Matrix

Similar to regression metrics that use the concept of MAE, many classification metrics use a Confusion Matrix to assess the performance of a model. A Confusion Matrix provides a useful visualization that shows how much error is made and what type of error is occurring in a classification task. By using a Confusion Matrix, we can better understand the performance of the classification model and identify any specific areas where the model may need to be improved.

A Confusion Matrix is a tool used to evaluate the performance of a model by comparing its predictions to the actual labels. It consists of four components: true positives, false positives, true negatives, and false negatives. Each component is described by two words. The first word indicates whether the model’s prediction was correct or incorrect (it’s “true” if the prediction is correct and “false” if it’s incorrect). The second word refers to the actual label or ground truth (“positive” for label 1 and “negative” for label 0).

  • When the model prediction is 1 and the ground truth is 1, it’s a true positive.
  • When the model prediction is 1 and the ground truth is 0, it’s a false positive.
  • When the model prediction is 0 and the ground truth is 0, it’s a true negative.
  • When the model prediction is 0 and the ground truth is 1, it’s a false negative.

Accuracy

One of the simplest methods for assessing the performance of a classification model is by calculating its accuracy. Accuracy measures the percentage of values that the model has correctly predicted. In technical terms, accuracy is calculated by dividing the sum of true positives and true negatives by the total number of prediction values.

Accuracy

Accuracy is a popular metric for evaluating classification models because it’s easy to understand and calculate. However, accuracy has a drawback when dealing with imbalanced data. Imbalanced data is when there are significantly more samples of one class than the other. In such cases, accuracy can be misleading because it doesn’t consider the imbalance in the data.

import numpy as np


def accuracy(y_true, y_pred):
assert y_true.shape == y_pred.shape
return np.average(y_true == y_pred)

For instance, if there are 100 samples, out of which 90 belong to the positive class and 10 to the negative class, a model that simply predicts all samples as positive will achieve an accuracy of 90%, even though it’s not a good model. On the other hand, a well-developed model that correctly predicts 81 out of 90 positive samples and 9 out of 10 negative samples is far superior. Therefore, accuracy alone is not a sufficient metric for evaluating classification models when dealing with imbalanced data.

implementation with one line using sklearn library:

from sklearn.metrics import accuracy_score


accuracy_score(y_class_true, y_class_pred)

Precision

Precision is a metric that takes into account the problem of imbalanced classes in data. It measures the accuracy of the positive predictions made by the model, by calculating the ratio of true positives to the sum of true positives and false positives. Precision is useful in penalizing models that perform poorly on the positive class in datasets that are imbalanced and have a large number of negative values.

Precision
import numpy as np


def precision(y_true, y_pred):
assert y_true.shape == y_pred.shape
return ((y_pred == 1) & (y_true == 1)).sum() / y_pred.sum()

It is particularly useful in datasets where the number of negative samples greatly outweighs the number of positive samples, such as in disease diagnosis. Precision provides meaningful insights into how accurate the model is when predicting the rare positive class.

implementation with one line using sklearn library:

from sklearn.metrics import precision_score


precision_score(y_class_pred, y_class_pred)

Recall

Recall is a metric that focuses on the correctness of the positive class predictions made by a model. It measures the percentage of true positives (correctly predicted positive samples) across all positive labels. Recall is particularly useful in scenarios where the positive class is rare or critical to predict accurately, such as in medical diagnosis where false negatives (failing to identify a disease) can be life-threatening. In these cases, optimizing for high recall is crucial, even if it means accepting a higher rate of false positives.

Avoiding having false negatives is much more crucial than avoiding having false positives when resources are limited to develop a perfect model we would be optimizing for a higher recall as we want the model to be as accurate as possible, especially in predicting the positive class.

Recall
import numpy as np


def recall(y_true, y_pred):
assert y_true.shape == y_pred.shape
return ((y_pred == 1) & (y_true == 1)).sum() / y_true.sum()

implementation with one line using sklearn library:

from sklearn.metrics import recall_score


recall_score(y_class_true, y_class_pred)

F1 Score

The F1 score is a way to combine precision and recall into a single metric that provides a more complete picture of the model’s performance. The harmonic mean is used because precision and recall are both expressed as rates, so we need a way to combine them that takes into account their relative contributions. The F1 score gives equal weight to precision and recall, which is important when we want to balance the trade-off between them. For example, if we are diagnosing a rare disease, we may want to optimize for a high recall, but not at the expense of precision. The F1 score helps us find the optimal balance between these two metrics.

F1 score


def f1_score(y_true, y_pred):
num = (2 * precision(y_true, y_pred) * recall(y_true, y_pred))
denom = (precision(y_true, y_pred) + recall(y_true, y_pred))
return num / denom

Compared to precision and recall, the F1 score provides a better understanding of the accuracy of the positive class classification.

The F1 score can be extended to the F-beta score, which allows adjusting the importance given to precision and recall. The value of beta determines the weight or emphasis placed on each of these metrics.

Generalize F-Beta

Setting beta lower than 1 gives more weight to precision, meaning that the precision score will be valued more during the harmonic average. On the other hand, setting beta to a value greater than 1 emphasizes the recall score. This gives more flexibility to the combination of precision and recall scores as the amount of attention can be adjusted based on the situation.

Implementation with one line using sklearn library:

from sklearn.metrics import f1_score


f1_score(y_class_true, y_class_pred)

The F1 score is a useful metric for evaluating the accuracy of a model’s positive class predictions, but it has some limitations. One limitation is that it only measures the positive class and doesn’t provide information about how well the model is performing on the negative class. Another limitation is that it relies on a binary classification threshold, which may not be the optimal threshold for a given problem. This can result in the loss of important information about the model’s confidence in its predictions. Therefore, it’s important to consider these limitations when using the F1 score and other binary evaluation metrics.

ROC-AUC

area under the receiver operating characteristics curve is plots the true positive rate against the false positive rate for various thresholds.

The recall or true positive rate indicates the proportion of correct predictions out of all the positive predictions made by the model. On the other hand, the false positive rate is the ratio of negative samples that were wrongly classified as positive. It is a measure of how often the model is wrong when it predicts a positive result. By plotting the true positive rate and false positive rate for various classification thresholds, the ROC curve is generated. This curve helps to determine the optimal threshold value and allows us to visualize the trade-off between true positive rate and false positive rate.

The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) for different thresholds of the model’s predicted probabilities. The TPR is the proportion of actual positives that are correctly classified as positive, while the FPR is the proportion of actual negatives that are incorrectly classified as positive.

def get_tpr_fpr(y_pred, y_true):

tp = (y_pred == 1) & (y_true == 1)
tn = (y_pred == 0) & (y_true == 0)

fp = (y_pred == 1) & (y_true == 0)
fn = (y_pred == 0) & (y_true == 1)

tpr = tp.sum() / (tp.sum() + fn.sum())
fpr = fp.sum() / (fp.sum() + tn.sum())

return tpr, fpr

To plot the ROC curve, we need to compute the true positive rate (TPR) and false positive rate (FPR) for various threshold values. A function can be used to calculate TPR and FPR for each threshold, based on predicted probabilities generated by a model. The function selects a specific number of thresholds and computes TPR and FPR for each of those thresholds. These values are then stored in separate lists and returned by the function.

Once we have the TPR and FPR values for different thresholds, we can plot the ROC curve, where the x-axis represents FPR and the y-axis represents TPR. However, it is difficult to calculate the area under the curve (AUC) using integrals since we do not have an exact function that represents the curve for all TPR and FPR values. Instead, we can estimate the area by dividing the curve into rectangular sections and summing their areas. As we use more and more rectangles, the estimated area becomes increasingly closer to the true area under the curve. The AUC provides a measure of how well the model distinguishes between classes, with a value of 1 indicating perfect distinction and a value of 0.5 indicating random predictions.


def roc_curve(y_pred, y_true, n_thresholds=100):
fpr_thresh = []
tpr_thresh = []

for i in range(n_thresholds + 1):
threshold_vector = (y_true >= i/n_thresholds)
tpr, fpr = get_tpr_fpr(y_pred, y_true)
fpr_thresh.append(fpr)
tpr_thresh.append(tpr)

return tpr_thresh, fpr_thresh



def area_under_roc_curve(y_true, y_pred):
fpr, tpr = roc_curve(y_pred, y_true)
rectangle_roc = 0
for k in range(len(fpr) - 1):
rectangle_roc += (fpr[k]- fpr[k + 1]) * tpr[k]
return 1 - rectangle_roc

Implementation with one line using sklearn library:

from sklearn.metrics import roc_auc_score


roc_auc_score(y_true, y_pred)

This article discusses various metrics and evaluation techniques used in machine learning, including accuracy, precision, recall, F1 score, ROC curve, and AUC. It explains the intuition behind each metric and how to calculate them. The article also covers the limitations of these metrics and how to address them. Overall, the article aims to demystify the concepts of metrics and evaluation in machine learning, making them more accessible to beginners and practitioners alike.

--

--

Arthur Adinayev
Arthur Adinayev

Written by Arthur Adinayev

Physicist | Deep Learning Engineer

No responses yet