A class label is being predicted within a classification task. Most of the time, a threshold of 0.5 is used. Larger than 0.5 for one class (1), smaller than 0.5 for another class (0).
However, severe class imbalance can be possible for some classification problems. 0.5 - the default threshold would results in terrible performance. Therefore, we need to optimise the threshold to improve the model accuracy.
The causes of unbalanced classes:
- The predicted probabilities are not calibrated
- The metric used to train the model is different from the evaluation
- The class distribution is severely skewed
- The cost of one type of misclassification is more important than another type of misclassification.
Probability vs classes
The machine learning or deep learning models are able to output probabilities. However, most of the times, we care about the actual class/label to be output. Therefore, a decision boundary is required
There are different ways to optimize for the best threshold. Say, you can optimise for the best accuracy. This is simple. All you need to do, is to sample all the true positives and false positives and calculate a list of accuracies. Finally, find the threshold that makes the pair of tp and fp maximize the accuracies.