Wednesday, August 22, 2018

AI - Preprocessing

  • Missing data
  • Text data
    • Replace data
      • Specific data to specific characters/symbols
  • Feature 
    • Create  new features
    • Bin/bucket
      • When bucketize the numerical column?

        • Numbers that are not meaningful

        • When you’re respecting the nonlinear relationship with your numeric values

        • When you try both wide and deep features

      • Tensorflow

        def get_quantile_based_boundaries(feature_values, num_buckets):
          boundaries = np.arange(1.0, num_buckets) / num_buckets
          quantiles = feature_values.quantile(boundaries)
          return [quantiles[q] for q in quantiles.keys()]
        
        # Divide longitude into 10 buckets.
        bucketized_longitude = tf.feature_column.bucketized_column(
          longitude, boundaries=get_quantile_based_boundaries(
            training_examples["longitude"], 10))
          
        # Divide latitude into 10 buckets.
        bucketized_latitude = tf.feature_column.bucketized_column(
          latitude, boundaries=get_quantile_based_boundaries(
            training_examples["latitude"], 10))


      • Python
        • df['price-binned'] = pd.cut(df['a'], np.linspace(min(df.a), max(df.a), 4) , labels=['l', 'm', 'h'], include_lowest=True)
    •  Interaction
    • Crosses
      • Tensorflow

        tf.feature_column.crossed_column(
          set([bucketized_longitude, bucketized_latitude]), hash_bucket_size=1000)


    • One hot encoding, dummy coding, effect coding, label encoding
    • Transformation
      • A log transform is a powerful tool for dealing with positive numbers with a heavy-tailed distribution
        • np.log10(biz_df['review_count'])
        • scipy.stats.boxcox(biz_df['review_count'], lmbda=0)
      • The Box-Cox formulation only works when the data is positive
        • For nonpositive data, one could shift the values by adding a fixed constant
        • stats.boxcox(biz_df['review_count'])
          • Finds the optimal transform parameter
    • Scaling / normalization
      • Feature scaling is useful in situations where a set of input features differs wildly in scale.
        • Use caution when performing min-max scaling and standardization on sparse features
      • Min-Max
        • Squeezes (or stretches) all feature values to be within the range of [0, 1]
        • E.g. x1 = (x1 - min(x1)) / (max(x1) - min(x1))
        • sklearn.preprocessing.minmax_scale(df[['n_tokens_content']])
        • This can hurt some models as it takes away weight from outliers
      • (z-score) standardization / variance scaling / mean normalization
        • Scaled feature has a mean of 0 and a variance of 1
        • E.g. x1 = (x1 - avg(x1)) / standard deviation of the x1
        • sklearn.preprocessing.StandardScaler().fit_transform(df[['n_tokens_content']])
        • Algorithm using Euclidean distance, such as KNN
      • L2 / Euclidean
        • the feature column has norm 1
        • sklearn.preprocessing.normalize(df[['n_tokens_content']], axis=0)
        • This comes in handy, especially when working with text data or clustering algorithms
      • Robust
        • RobustScaler is less prone to outliers
        • from sklearn.preprocessing import RobustScaler

      • Spark
        • StandardScaler

      • Pandas

        def linear_scale(series):
          min_val = series.min()
          max_val = series.max()
          scale = (max_val - min_val) / 2.0
          return series.apply(lambda x:((x - min_val) / scale) - 1.0)
         
        def log_normalize(series):
          return series.apply(lambda x:math.log(x+1.0))
        
        def clip(series, clip_to_min, clip_to_max):
          return series.apply(lambda x:(
            min(max(x, clip_to_min), clip_to_max)))
        
        def z_score_normalize(series):
          mean = series.mean()
          std_dv = series.std()
          return series.apply(lambda x:(x - mean) / std_dv)
        
        def binary_threshold(series, threshold):
          return series.apply(lambda x:(1 if x > threshold else 0))


      • Try alternate normalizations for various features to further improve performance.
        • Pandas
          • normalized_training_examples.hist(bins=20, figsize=(18, 12), xlabelsize=10)
      • Note: you can't possibly do a logarithm transformation after standardization because about half of the standardized values will be 0 or negative, hence have no logarithm
      • Example
    • Clipping
      • roomsPerPerson = min(totalRooms / population, 4)
        • Clipping the feature value at 4.0 doesn't mean that we ignore all values greater than 4.0. Rather, it means that all values that were greater than 4.0 now become 4.0. This explains the funny hill at 4.0. Despite that hill, the scaled feature set is now more useful than the original data.
    • Hashing
  • Feature Selection
    • Simple approach
      • Repeatedly using one feature to train, then select and add the best feature to the model. Repeat this process.
    • In modern deep learning, when data is plentiful, there has been a shift away from feature selection, and we are now more likely to give all the features we have to the algorithm and let the algorithm sort out which ones to use based on the data
    • rules of thumbs
      • If your features are mostly categorical, you should start by trying to implement a SelectKBest with a Chi2 ranker or a tree-based model selector
      • If your features are largely quantitative, using linear models as model-based selectors and relying on correlations tends to yield greater results
      • If you are solving a binary classification problem, using a Support Vector Classification model along with a SelectFromModel selector will probably fit nicely, as the SVC tries to find coefficients to optimize for binary classification tasks
      • A little bit of EDA can go a long way in manual feature selection. The importance of having domain knowledge in the domain from which the data originated cannot be understated
    • Filter methods
      • correlation coefficient
      • ANOVA test
      • chi-square test
      • variance threshold
    • Wrapper methods
      • recursive feature elimination 
      • sequential feature selection algorithms 
      • genetic algorithms
    • Embedded methods
      • Decision tree
      • L1 regularizer
        • Linear model
      • Embedding layer
        • How to choose the number of neurons of an embedding layer?

          • Try starting from the 4th root of the total number of possible values

          • Hyper tun: max = 35 

          • Higher dimensions -> higher chance of overfitting, slower training

        • multi-sense embeddings

          • Not always work

      • Weight
    • Spark
      • ChiSqSelector
    • Python 3
    • Example
  • Feature Extraction
    • Feature transformation
      • TSNE
        • from sklearn.manifold import TSNE
      • PCA
      • SVD
        • Singular value decomposition module will return the same components as PCA if our data is scaled, but different components when using the raw unscaled data
        • from sklearn.decomposition import TruncatedSVD
      • LDA
        • Linear Discriminant Analysis (LDA) is a feature transformation technique as well as a supervised classifier. It is commonly used as a preprocessing step for classification pipelines. The goal of LDA, like PCA, is to extract a new coordinate system and project datasets onto a lower-dimensional space. The main difference between LDA and PCA is that instead of focusing on the variance of the data as a whole like PCA, LDA optimizes the lower-dimensional space for the best class separability. This means that the new coordinate system is more useful in finding decision boundaries for classification models, which is perfect when building classification pipelines. The reason that LDA is extremely useful is that separating based on class separability helps us avoid overfitting in our machine learning pipelines. This is also known as preventing the curse of dimensionality. LDA also reduces computational costs.
      • LSA
        • Latent semantic analysis (LSA) is a feature extraction tool. It is helpful for the text that is a series of these three steps
          • A TF-IDF vectorization
          • A PCA (SVD, in this case, to account for the sparsity of text)
          • Row normalization
      • Nonlinear Featurization via K-Means Model Stacking
        • https://github.com/mungeol/feature-engineering-book/blob/master/07.03-05_K-means_featurization.ipynb
        • With cluster features, the linear classifier performs just as well as nonlinear classifiers
        • K-means featurization is useful for real-valued, bounded numeric features that form clumps of dense regions in space
        • k-means cannot handle feature spaces where the Euclidean distance does not make sense—i.e., weirdly distributed numeric variables or categorical variables. If the feature set contains those variables, then there are several ways to handle them: 
          • Apply k-means featurization only on the real-valued, bounded numeric features
          • Define a custom metric to handle multiple data types and use the k-medoids algorithms. (k-medoids is analogous to k-means but allows for arbitrary distance metrics.)
          • Convert categorical variables to binning statistics (see “Bin Counting” on page 87), then featurize them using k-means
      • Example
    • Feature learning
      • RBM
        • Restricted Boltzmann Machines is a simple deep learning architecture that is set up to learn a set number of new dimensions based on a probabilistic model that data follows. These machines are a family of algorithms with only one implemented in scikit-learn. The BernoulliRBM may be a nonparametric feature learner; however, as the name suggests, some expectations are set as to the values of the cells of the dataset.
      • Word embeddings
        • Likely one of the biggest contributors to the recent deep learning-fueled advancements of natural language processing/understanding/generation is the ability to project strings (words and phrases) into an n-dimensional feature set to grasp the context and minute detail in wording.
        • Approaches
      • Example
  • Imbalanced data / skewed classes
    • Reference
      • 2017 Mastering Machine Learning with Python in Six Steps
  • Outlier
    • Plot it
      • Box
    • Collect more outlier data
    • Keep it
      • Anomaly detection
    • Replace it with reasonable minimum or maximum value
    • Remove it
  • Shuffling
    • Pandas
      • df = df.reindex(np.random.permutation(df.index)

      • df = df.sample(frac=1)
      • df = df.sample(frac=1).reset_index(drop=True)
      • from sklearn.utils import shuffle
        • df = shuffle(df)
  • Image augmentation
  • Training, validation/dev, Test set
    • Your dev and test sets should come from the same distribution
    • Choose dev and test sets from a distribution that reflects what data you expect to get in the future and want to do well on. This may not be the same as your training data’s distribution
    • When you should train and test on different distributions
      • 2018 Machine learning yearning
        • P71
    • How to decide whether to use all your data (which have different distributions)
      • 2018 Machine learning yearning
        • P73
    • How to decide whether to include inconsistent data
      • 2018 Machine learning yearning
        • P75
    • How large do the dev/test sets need to be?
      • The old heuristic of a 70%/30% train/test split does not apply for problems where you have a lot of data; the dev and test sets can be much less than 30% of the data
      • The dev set should be large enough to detect differences between algorithms that you are trying out. For example, if classifier A has an accuracy of 90.0% and classifier B has an accuracy of 90.1%, then a dev set of 100 examples would not be able to detect this 0.1% difference. Compared to other machine learning problems I’ve seen, a 100 example dev set is small. Dev sets with sizes from 1,000 to 10,000 examples are common. With 10,000 examples, you will have a good chance of detecting an improvement of 0.1%
      • There is no need to have excessively large dev/test sets beyond what is needed to evaluate the performance of your algorithms
    • Eyeball and BlackBox dev set
      • 2018 Machine learning yearning
        • P36, P38
    • Training dev set
      • 2018 Machine learning yearning
        • Generalizing from the training set to the dev set
          • P77

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.