Data
30
https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data
58
https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset
24
Metrics
Confusion metrics, precision, recall, f1
ROC
AUC PR
Maximize precision-recall curve for the less common class
Normalized Gini Coefficient
Business impact
E.g. reducing false positive is more important than reducing false negative
Algorithms
Decision tree
Random forest, xgboost, lightGBM
from sklearn.ensemble import IsolationForest
AutoEncoders
Use latent representations as inputs for another model
GAN
Resampling
Over/up-sampling minority class
Note
Always split into test and train sets before trying oversampling techniques
Options
Adding more copies of the minority class
Synthetic sampling
from imblearn.over_sampling import SMOTE
Adaptive sampling
Borderline-SMOTE
SMOTE 개선
ADASYN (adaptive synthetic sampling)
Under/down-sampling majority class
Options
Removing examples from the majority class
Tomek links
from imblearn.under_sampling import TomekLinks
Cluster centroids
from imblearn.under_sampling import ClusterCentroids
Over-sampling followed by under-sampling
Options
SMOTE and Tomek links
from imblearn.combine import SMOTETomek
Stratified sampling
Removing the variance of the proportion inside batches
Can be useful when batch training a classifier
Outliers
Interquartile Range (IQR)
We calculate this by the difference between the 75th percentile and the 25th percentile. Our aim is to create a threshold beyond the 75th and 25th percentile that in case some instance passes this threshold the instance will be deleted.
Features
Major class와 minor class의 데이터가 잘 분리되여 있으면 imbalace하여도 문제 없음
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD
Adding
If classes were not well separable: maybe can we find a new additional feature that can help distinguish between the two classes and, so, improve the classifier accuracy
Compared to the approaches mentioned in the previous subsection that suggest changing the reality of data, this approach that consists of enriching data with more information from reality is a far better idea when it is possible.
Deleting
Drop all of the features that have very similar distributions between the two types of transactions
Hyperparameters
Classification threshold
Use the Precision-Recall curve to choose the threshold
Cross-validation
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
Resampling happens during CV
Monday, May 11, 2020
AI - Imbalanced data
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.