Monday, May 11, 2020

AI - Imbalanced data

  • Data

  • Metrics

    • Confusion metrics, precision, recall, f1

    • ROC

    • AUC PR

      • Maximize precision-recall curve for the less common class

    • Normalized Gini Coefficient

    • Business impact

      • E.g. reducing false positive is more important than reducing false negative

  • Algorithms

  • Resampling

    • Over/up-sampling minority class

      • Note

        • Always split into test and train sets before trying oversampling techniques

      • Options

        • Adding more copies of the minority class

        • Synthetic sampling

          • from imblearn.over_sampling import SMOTE

        • Adaptive sampling

          • Borderline-SMOTE

            • SMOTE 개선

        • ADASYN (adaptive synthetic sampling)

    • Under/down-sampling majority class

      • Options

        • Removing examples from the majority class

        • Tomek links

          • from imblearn.under_sampling import TomekLinks

        • Cluster centroids

          • from imblearn.under_sampling import ClusterCentroids

    • Over-sampling followed by under-sampling

      • Options

        • SMOTE and Tomek links

          • from imblearn.combine import SMOTETomek

    • Stratified sampling

      • Removing the variance of the proportion inside batches

      • Can be useful when batch training a classifier

  • Outliers

    • Interquartile Range (IQR)

      • We calculate this by the difference between the 75th percentile and the 25th percentile. Our aim is to create a threshold beyond the 75th and 25th percentile that in case some instance passes this threshold the instance will be deleted.

  • Features

    • Major class와 minor class의 데이터가 잘 분리되여 있으면 imbalace하여도 문제 없음

      • from sklearn.manifold import TSNE 

      • from sklearn.decomposition import PCA, TruncatedSVD

    • Adding

      • If classes were not well separable: maybe can we find a new additional feature that can help distinguish between the two classes and, so, improve the classifier accuracy

      • Compared to the approaches mentioned in the previous subsection that suggest changing the reality of data, this approach that consists of enriching data with more information from reality is a far better idea when it is possible.

    • Deleting

      • Drop all of the features that have very similar distributions between the two types of transactions

  • Hyperparameters

    • Classification threshold

      • Use the Precision-Recall curve to choose the threshold

  • Cross-validation

    • from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline

      • Resampling happens during CV

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.