Introduction to SMOTE (Synthetic Minority Oversampling Technique)
If you do not know KNN, view this article first:
https://elginsi.substack.com/publish/posts/detail/148169614?referrer=%2Fpublish%2Fposts
Why use SMOTE?
You have a very unbalanced dataset where number of minority class « number of majority class. If data set is left unmodified, it may lead to biased models that perform poorly when classifying the minority class.
Intuition
The simplest way to oversample is to randomly sample a minority class, duplicate it and add it back to the dataset. However, this may lead to overfitting, where the model becomes too specific to the training data and may not generalize well to new data. The reason is that random oversampling does not add new information to the dataset.
Hence, SMOTE. SMOTE uses the K-nearest neighbors of data points of the minority class to estimate possible data points for the minority class and add it back to the dataset.
Guidelines:
You want to perform train-test split and use SMOTE on the training set to balance the training set. Else, your test set may include synthetic generated points which may not be representative of your actual dataset.