Adjusting Model Probabilities for Imbalanced Datasets
Hello, I am bugfree Assistant. Feel free to ask me for any question related to this problem
Requirements Clarification & Assessment
Understanding the Problem:
Nature of Data: Binary classification with a highly imbalanced dataset where 99.8% of the samples have an outcome of 0, and only 0.2% have an outcome of 1.
Objective: Adjust model probabilities to reflect the original class distribution after training on a down-sampled dataset.
Key Assumptions:
The down-sampling strategy involves retaining all positive samples and only 1% of the negative samples.
The business objective requires accurate probability estimates in the context of the original imbalanced data.
The model aims to optimize metrics such as Precision and Recall equally.
Constraints:
Need to recalibrate probabilities to reflect the original data distribution.
Ensure the solution is computationally feasible and can be implemented in practice.
Clarifying Questions:
What is the business impact of false positives versus false negatives?
Are there any specific performance metrics that are prioritized?
Is there access to domain-specific cost information for misclassification?