Mungeol Heo: AI - Preprocessing

Missing data
- Remove data point
- Remove feature
- Major value
- Mean value
- Median value
- Predict
- Example
  - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter03/Ch_3_pima.ipynb
Text data
- Replace data
- - Specific data to specific characters/symbols
    - E.g. 123 -> $NUMBER
    - E.g. exmaple@gmail.com -> $EMAIL
    - E.g. 010-1234-5678 -> $PHONE
- Stopwords
- Frequency-based filtering
  - Frequent/rare words
- Stemming
  - Could hurt more than it helps
  - News and new are different
- Lowercase
  - Capitalize matters sometimes
- Bag-of-Words / Bag-of-n-Grams
- TF-IDF
- Chunking and part-of-speech tagging
  - https://github.com/mungeol/feature-engineering-book/blob/master/03.02_Chunking_and_POS_Tagging.ipynb
- Examples
  - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter04/Ch_4.ipynb
- Lib
  - Universal Sentence Encoder

Feature

Create new features
- x ^ 2, x ^ 3, ...
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

Bin/bucket

When bucketize the numerical column?
- Numbers that are not meaningful

When you’re respecting the nonlinear relationship with your numeric values
When you try both wide and deep features

Tensorflow

def get_quantile_based_boundaries(feature_values, num_buckets):
  boundaries = np.arange(1.0, num_buckets) / num_buckets
  quantiles = feature_values.quantile(boundaries)
  return [quantiles[q] for q in quantiles.keys()]

# Divide longitude into 10 buckets.
bucketized_longitude = tf.feature_column.bucketized_column(
  longitude, boundaries=get_quantile_based_boundaries(
    training_examples["longitude"], 10))
  
# Divide latitude into 10 buckets.
bucketized_latitude = tf.feature_column.bucketized_column(
  latitude, boundaries=get_quantile_based_boundaries(
    training_examples["latitude"], 10))

Python
- df['price-binned'] = pd.cut(df['a'], np.linspace(min(df.a), max(df.a), 4) , labels=['l', 'm', 'h'], include_lowest=True)

Interaction
- Linear model
- X2 = sklearn.preprocessing.PolynomialFeatures(include_bias=False).fit_transform(X)
- Example
  - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter04/Ch_4.ipynb

Crosses

Tensorflow

tf.feature_column.crossed_column(
  set([bucketized_longitude, bucketized_latitude]), hash_bucket_size=1000)

One hot encoding, dummy coding, effect coding, label encoding
- Encoding at the
  - nominal level
  - ordinal level
- Tensorflow
  - tf.keras.utils.to_categorical
- Python
  - pd.get_dummies(df['a'])
- Example
  - https://github.com/mungeol/feature-engineering-book/blob/master/05.01-02_Regression_on_Categorical_Variable.ipynb
  - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter04/Ch_4.ipynb
Transformation
- A log transform is a powerful tool for dealing with positive numbers with a heavy-tailed distribution
  - np.log10(biz_df['review_count'])
  - scipy.stats.boxcox(biz_df['review_count'], lmbda=0)
- The Box-Cox formulation only works when the data is positive
  - For nonpositive data, one could shift the values by adding a fixed constant
  - stats.boxcox(biz_df['review_count'])
    - Finds the optimal transform parameter

Scaling / normalization

Feature scaling is useful in situations where a set of input features differs wildly in scale.
- Use caution when performing min-max scaling and standardization on sparse features
Min-Max
- Squeezes (or stretches) all feature values to be within the range of [0, 1]
- E.g. x1 = (x1 - min(x1)) / (max(x1) - min(x1))
- sklearn.preprocessing.minmax_scale(df[['n_tokens_content']])
- This can hurt some models as it takes away weight from outliers
(z-score) standardization / variance scaling / mean normalization
- Scaled feature has a mean of 0 and a variance of 1
- E.g. x1 = (x1 - avg(x1)) / standard deviation of the x1
- sklearn.preprocessing.StandardScaler().fit_transform(df[['n_tokens_content']])
- Algorithm using Euclidean distance, such as KNN
L2 / Euclidean
- the feature column has norm 1
- sklearn.preprocessing.normalize(df[['n_tokens_content']], axis=0)
- This comes in handy, especially when working with text data or clustering algorithms
Robust
- RobustScaler is less prone to outliers
- from sklearn.preprocessing import RobustScaler
Spark
- StandardScaler

Pandas

def linear_scale(series):
  min_val = series.min()
  max_val = series.max()
  scale = (max_val - min_val) / 2.0
  return series.apply(lambda x:((x - min_val) / scale) - 1.0)
 
def log_normalize(series):
  return series.apply(lambda x:math.log(x+1.0))

def clip(series, clip_to_min, clip_to_max):
  return series.apply(lambda x:(
    min(max(x, clip_to_min), clip_to_max)))

def z_score_normalize(series):
  mean = series.mean()
  std_dv = series.std()
  return series.apply(lambda x:(x - mean) / std_dv)

def binary_threshold(series, threshold):
  return series.apply(lambda x:(1 if x > threshold else 0))

Try alternate normalizations for various features to further improve performance.
- Pandas
  - normalized_training_examples.hist(bins=20, figsize=(18, 12), xlabelsize=10)
Note: you can't possibly do a logarithm transformation after standardization because about half of the standardized values will be 0 or negative, hence have no logarithm
Example
- https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter03/Ch_3_pima.ipynb

Clipping
- roomsPerPerson = min(totalRooms / population, 4)
  - Clipping the feature value at 4.0 doesn't mean that we ignore all values greater than 4.0. Rather, it means that all values that were greater than 4.0 now become 4.0. This explains the funny hill at 4.0. Despite that hill, the scaled feature set is now more useful than the original data.
Hashing
- https://github.com/mungeol/feature-engineering-book/blob/master/05.05_Feature_Hashing.ipynb

Feature Selection
- Simple approach
  - Repeatedly using one feature to train, then select and add the best feature to the model. Repeat this process.
- In modern deep learning, when data is plentiful, there has been a shift away from feature selection, and we are now more likely to give all the features we have to the algorithm and let the algorithm sort out which ones to use based on the data
- rules of thumbs
  - If your features are mostly categorical, you should start by trying to implement a SelectKBest with a Chi2 ranker or a tree-based model selector
  - If your features are largely quantitative, using linear models as model-based selectors and relying on correlations tends to yield greater results
  - If you are solving a binary classification problem, using a Support Vector Classification model along with a SelectFromModel selector will probably fit nicely, as the SVC tries to find coefficients to optimize for binary classification tasks
  - A little bit of EDA can go a long way in manual feature selection. The importance of having domain knowledge in the domain from which the data originated cannot be understated
- Filter methods
  - correlation coefficient
  - ANOVA test
  - chi-square test
  - variance threshold
- Wrapper methods
  - recursive feature elimination
  - sequential feature selection algorithms
  - genetic algorithms
- Embedded methods
  - Decision tree
  - L1 regularizer
    - Linear model
  - Embedding layer
    - How to choose the number of neurons of an embedding layer?
    - multi-sense embeddings
  - Weight
    - High weight means high importance
- Spark
  - ChiSqSelector
- Python 3
  - PymRMR: https://github.com/fbrundu/pymrmr
- Example
  - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter05/Ch_5.ipynb
Feature Extraction
- Feature transformation
- - TSNE
    - from sklearn.manifold import TSNE
  - PCA
    - https://github.com/mungeol/feature-engineering-book/blob/master/06.01_PCA_on_MNIST_digits.ipynb
    - De-correlating features
    - Try both scaled and un-scaled data
      - StandardScaler
    - It is best not to apply PCA to the data that has large outliers.
    - from sklearn.decomposition import PCA
  - SVD
    - Singular value decomposition module will return the same components as PCA if our data is scaled, but different components when using the raw unscaled data
    - from sklearn.decomposition import TruncatedSVD
  - LDA
    - Linear Discriminant Analysis (LDA) is a feature transformation technique as well as a supervised classifier. It is commonly used as a preprocessing step for classification pipelines. The goal of LDA, like PCA, is to extract a new coordinate system and project datasets onto a lower-dimensional space. The main difference between LDA and PCA is that instead of focusing on the variance of the data as a whole like PCA, LDA optimizes the lower-dimensional space for the best class separability. This means that the new coordinate system is more useful in finding decision boundaries for classification models, which is perfect when building classification pipelines. The reason that LDA is extremely useful is that separating based on class separability helps us avoid overfitting in our machine learning pipelines. This is also known as preventing the curse of dimensionality. LDA also reduces computational costs.
  - LSA
    - Latent semantic analysis (LSA) is a feature extraction tool. It is helpful for the text that is a series of these three steps
      - A TF-IDF vectorization
      - A PCA (SVD, in this case, to account for the sparsity of text)
      - Row normalization
  - Nonlinear Featurization via K-Means Model Stacking
    - https://github.com/mungeol/feature-engineering-book/blob/master/07.03-05_K-means_featurization.ipynb
    - With cluster features, the linear classifier performs just as well as nonlinear classifiers
    - K-means featurization is useful for real-valued, bounded numeric features that form clumps of dense regions in space
    - k-means cannot handle feature spaces where the Euclidean distance does not make sense—i.e., weirdly distributed numeric variables or categorical variables. If the feature set contains those variables, then there are several ways to handle them:
      - Apply k-means featurization only on the real-valued, bounded numeric features
      - Define a custom metric to handle multiple data types and use the k-medoids algorithms. (k-medoids is analogous to k-means but allows for arbitrary distance metrics.)
      - Convert categorical variables to binning statistics (see “Bin Counting” on page 87), then featurize them using k-means
  - Example
    - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter06/Ch_6.ipynb
      - PCA, LDA
    - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter08/Ch_8.ipynb
      - PCA, LDA, TruncatedSVD, LSA
- Feature learning
  - RBM
    - Restricted Boltzmann Machines is a simple deep learning architecture that is set up to learn a set number of new dimensions based on a probabilistic model that data follows. These machines are a family of algorithms with only one implemented in scikit-learn. The BernoulliRBM may be a nonparametric feature learner; however, as the name suggests, some expectations are set as to the values of the cells of the dataset.
  - Word embeddings
    - Likely one of the biggest contributors to the recent deep learning-fueled advancements of natural language processing/understanding/generation is the ability to project strings (words and phrases) into an n-dimensional feature set to grasp the context and minute detail in wording.
    - Approaches
      - Word2Vec, GloVe
        https://radimrehurek.com/gensim/
  - Example
    - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter07/Ch_7.ipynb
      - RBM, Word2Vec
Imbalanced data / skewed classes
- Reference
  - 2017 Mastering Machine Learning with Python in Six Steps
Outlier
- Plot it
  - Box
- Collect more outlier data
- Keep it
  - Anomaly detection
- Replace it with reasonable minimum or maximum value
- Remove it
Shuffling
- Pandas
  - df = df.reindex(np.random.permutation(df.index)
  - df = df.sample(frac=1)
  - df = df.sample(frac=1).reset_index(drop=True)
  - from sklearn.utils import shuffle
    - df = shuffle(df)
Image augmentation
- E.g.
  - https://github.com/mungeol/training-data-analyst/blob/master/courses/machine_learning/deepdive/08_image/flowersmodel/model.py
Training, validation/dev, Test set
- Your dev and test sets should come from the same distribution
- Choose dev and test sets from a distribution that reflects what data you expect to get in the future and want to do well on. This may not be the same as your training data’s distribution
- When you should train and test on different distributions
  - 2018 Machine learning yearning
    - P71
- How to decide whether to use all your data (which have different distributions)
  - 2018 Machine learning yearning
    - P73
- How to decide whether to include inconsistent data
  - 2018 Machine learning yearning
    - P75
- How large do the dev/test sets need to be?
  - The old heuristic of a 70%/30% train/test split does not apply for problems where you have a lot of data; the dev and test sets can be much less than 30% of the data
  - The dev set should be large enough to detect differences between algorithms that you are trying out. For example, if classifier A has an accuracy of 90.0% and classifier B has an accuracy of 90.1%, then a dev set of 100 examples would not be able to detect this 0.1% difference. Compared to other machine learning problems I’ve seen, a 100 example dev set is small. Dev sets with sizes from 1,000 to 10,000 examples are common. With 10,000 examples, you will have a good chance of detecting an improvement of 0.1%
  - There is no need to have excessively large dev/test sets beyond what is needed to evaluate the performance of your algorithms
- Eyeball and BlackBox dev set
  - 2018 Machine learning yearning
    - P36, P38
- Training dev set
  - 2018 Machine learning yearning
    - Generalizing from the training set to the dev set
      - P77

Mungeol Heo

Wednesday, August 22, 2018

AI - Preprocessing

No comments:

Post a Comment