How to evaluate your classification machine learning algorithm
Machine learning is a subset of artificial intelligence (AI) that allows computers to learn and improve independently without explicit programming. The global machine learning market is forecast to reach $8.81 billion in 2022 as algorithms continue to automate operations and improve efficiencies.
One of the most challenging aspects of machine Learning is evaluating the model's performance. We need to understand what success looks like and when to call it a day on the training and evaluation. Moreover, teams must use the right metrics to evaluate the model correctly.
This article looks at the most common metrics to review the performance of the machine and deep learning models.
Classification parameters
Four types of outcomes could occur when you build a classification model.
True positives occur when you anticipate that an observation belongs to a particular class, and it does.
You have a true negative when you forecast that an observation does not belong to a class and it truly does not belong to that class.
False positives arise when you incorrectly forecast that an observation belongs to a particular class when it does not.
False negatives occur when you incorrectly forecast that an observation does not belong to a particular class when it does.
A confusion matrix helps to visualize those parameters, and we can then look at evaluation metrics based on this binary problem.
Model accuracy
The percentage of correct predictions for the test data is known as accuracy. It's simple to figure out by dividing the number of accurate predictions by the total number of predictions. Accuracy works incredibly well when your datasets are symmetric, and the values of false positives and false negatives are similar. You have to use other measures for a more holistic model performance evaluation.
Accuracy = True positive+True negative/True positive+False positive+False negative+True negative
Model efficiency
Model efficiency is often known as recall or sensitivity. If we train our model to label pictures of fruit, it will tell you of all the genuine fruit images, how many we labeled.
Recall = True positive/True positive+False negative
Precision
The precision metric takes the ratio of correctly predicted positive observations against the total predicted positive observations. For example, a classification problem will tell you how many images are actually fruit for all images labelled as a fruit. When precision is high, it points to a low false positive rate.
Precision = True positive/True positive+False postive
Specificity
Specificity is like recall but focuses on negative instances. It is the number of negative instances against the total amount of actual negative instances. For example, it will tell you of all patients that are told they do not have cancer, how many actually don't have cancer.
Specificity = True negative/True negative+False positive
F1 Score
The F1 score is calculated using recall and precision to form a weighted average. That means it can account for both false positives and false negatives. While the F1 score is less intuitive than accuracy, it is far more insightful if you have uneven class distributions. Accuracy is best when there is a similar cost in terms of false positives and false negatives, but precision and recall are most useful where those costs are quite different.
F1 Score = 2*(Recall * Precision)/(Recall + Precision)
PR Curve vs ROC Curve
A precision-recall (PR) curve plots precision on the y-axis and recall on the x-axis. It aims to visually the trade-off between the true positive rate (recall) and positive predictive power (precision).
Assume you have 100 records in your database, 90 of which belong to class 0 and only 10 to class 1. As a result, even a haphazard guess would be 90% accurate in forecasting the majority class 0. As a result, we use precision and recall-related metrics to identify minority classes appropriately, as these two metrics are primarily concerned with minorities. While training the model, we declare this minority class to be a positive class.
The receiver operating characteristic (ROC) curve plots the true positive rate (y-axis) and the false positive rate (x-axis). A typical ROC curve looks like the example below.
The curve plots the true positive rate and the false positive rate at different thresholds. When you lower the classification threshold, you classify more items as positive and increase both true and false positives. The area under the ROC curve (AOC) is a sorting-based algorithm that provides an aggregated measure of the performance across all classification thresholds.
The PR and ROC curves are both useful depending on the model. Where there is an absence of true negatives in the PR formula, they work well in imbalanced classes. As true negatives are a consideration of the ROC equation, it is more insightful when both classes are important.
Summary
Machine learning evaluation metrics depend on the context of your task. It is essential to consider the parameters outlined in this article to understand how the model performs. Despite their significant impact, machine learning models are frequently not adequately evaluated and deployed without a proper grasp of their capabilities and limits.
A deep understanding of evaluating and testing machine learning models is integral for constructing practical and insightful applications that operate as expected.