Mungeol Heo: AI - Imbalanced data

Monday, May 11, 2020

AI - Imbalanced data

Data

https://www.kaggle.com/mlg-ulb/creditcardfraud

30

https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data

58

https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset

24

Metrics

Confusion metrics, precision, recall, f1
ROC
AUC PR

Maximize precision-recall curve for the less common class

Normalized Gini Coefficient
Business impact

E.g. reducing false positive is more important than reducing false negative

Algorithms

Decision tree

Random forest, xgboost, lightGBM
from sklearn.ensemble import IsolationForest

AutoEncoders

Use latent representations as inputs for another model

GAN

https://www.kaggle.com/utkarshroy17/credit-card-fraud-detection-using-gan

Resampling

Over/up-sampling minority class

Note

Always split into test and train sets before trying oversampling techniques

Options

Adding more copies of the minority class
Synthetic sampling

from imblearn.over_sampling import SMOTE

Adaptive sampling

Borderline-SMOTE

SMOTE 개선

ADASYN (adaptive synthetic sampling)

Under/down-sampling majority class

Options

Removing examples from the majority class
Tomek links

from imblearn.under_sampling import TomekLinks

Cluster centroids

from imblearn.under_sampling import ClusterCentroids

Over-sampling followed by under-sampling

Options

SMOTE and Tomek links

from imblearn.combine import SMOTETomek

Stratified sampling

Removing the variance of the proportion inside batches
Can be useful when batch training a classifier

Outliers

Interquartile Range (IQR)

We calculate this by the difference between the 75th percentile and the 25th percentile. Our aim is to create a threshold beyond the 75th and 25th percentile that in case some instance passes this threshold the instance will be deleted.

Features

Major class와 minor class의 데이터가 잘 분리되여 있으면 imbalace하여도 문제 없음

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD

Adding

If classes were not well separable: maybe can we find a new additional feature that can help distinguish between the two classes and, so, improve the classifier accuracy
Compared to the approaches mentioned in the previous subsection that suggest changing the reality of data, this approach that consists of enriching data with more information from reality is a far better idea when it is possible.

Deleting

Drop all of the features that have very similar distributions between the two types of transactions

Hyperparameters

Classification threshold

Use the Precision-Recall curve to choose the threshold

Cross-validation

from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline

Resampling happens during CV

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Subscribe to: Post Comments (Atom)