Mungeol Heo: 2018-08-19

Wednesday, August 22, 2018

Help 4 GCP

Access Denied: Table X:Y.Z: The user 123-compute@developer.gserviceaccount.com does not have permission to query table X:Y.Z (bigQuery)
- Python
  - from google.cloud import bigquery
  - from google.oauth2 import service_account
  - credentials = service_account.Credentials.from_service_account_file( 'path/to/file.json')
  - project_id = 'my-bq'
  - client = bigquery.Client(credentials= credentials, project=project_id)
Cannot connect to the instance using SSH since the disk is full (GCE)
- Check if your operating system supports automatic resizing: If so, using Cloud Console you can edit VM's root disk and increase its size. Your virtual machine instance can automatically resize the partition to recognize the additional space after you restart the instance.
- Use Interactive Serial Console feature to login to your VM and clean up your VM's disk or copy them to another storage, if you would need them later.
- If you know what data you want to delete, you can configure a startup script to remove the files and reboot your VM to run the script (e.g. rm /tmp/*).
- You can detach the persistent disk and attach this disk to another machine as an additional disk. On the temporary machine, you can mount it and clean up your data or copy them to another storage, if you would need them later. Finally, recreate the original instance with the same boot disk. You can follow the same steps described in this video to add your disk to another Linux VM but add your existing boot disk instead of creating a new disk.
- Check if your operating system supports automatic resizing: If yes, then create a snapshot of your persistent disk, create a new persistent disk with larger size from the snapshot. Finally, recreate the original instance with this larger boot disk.
No scalar data was found (tensorboard)
- Use gcloud command to train the model.
prediction_lib.PredictionError: Failed to load model: Cloud ML only supports TF 1.0 or above and models saved in SavedModel format. (Error code: 0) (ml enine)
- Check the model path which is the value of "--model-dir" flag.
- Note:
  - Do not use the model location in the log info.
  - E.g.
    - INFO:tensorflow:SavedModel written to: b"output/export/census/temp-b'1531882849'/saved_model.pb"
    - However, you should use "output/export/census/1531882849"
"error": "Prediction failed: unknown error." (ml engine)
- This is because the model doesn't support the specified instance format.
- E.g.
  - The model supports JSON instance for prediction.
  - However, a CSV instance has been specified for prediction.
- If the error still happens, then try to specify the version of the model which supports instances for prediction.
ERROR: (gcloud.ml-engine.jobs.submit.training) Could not copy [/tmp/.../output/trainer-0.0.0.tar.gz] to [.../trainer-0.0.0.tar.gz]. Please retry: HTTPError 404: Not Found (ml engine)
- Check the bucket name.
The schema of pandas dataframe created from read_gbq is different from bigQuery table (bigQuery)
- Use from google.cloud import bigquery instead.
- E.g. client.query('SELECT * FROM `pojectId.dataset.table` limit 1').result().schema
java.net.UnknownHostException: metadata (general)
- Set one of the configurations place below.
  - google.cloud.auth.service.account.json.keyfile
  - fs.gs.auth.service.account.json.keyfile
java.io.IOException: Error accessing: bucket: null (hadoop)
- Set "mapred.bq.gcs.bucket" configuration.
java.lang.NullPointerException: Required parameter projectId must be specified (hadoop)
- Set "mapred.bq.project.id" configuration.
org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create load job with id prefix ${ID prefix}, reached max retries: 3, last failed load job (bigQuery)
- org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create load job with id prefix ${ID prefix}, reached max retries: 3, last failed load job
  - Make sure using right data type for related column while creating TableRow.
Error detected while parsing row starting at position: 556531513. Error: Bad character (ASCII 0) encountered (bigQuery)
- Find the character causing the problem.
  - less +556531513P test.csv
- There will be a character like Ctrl-@ which is ^@.
- Avoid or remove it before producing the CSV file, or from the CSV file.
  - gsutil cp gs://bucket/test.csv - | tr -d '\000' | gsutil cp - gs://bucket/test2.csv

AI - Preprocessing

Missing data
- Remove data point
- Remove feature
- Major value
- Mean value
- Median value
- Predict
- Example
  - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter03/Ch_3_pima.ipynb
Text data
- Replace data
- - Specific data to specific characters/symbols
    - E.g. 123 -> $NUMBER
    - E.g. exmaple@gmail.com -> $EMAIL
    - E.g. 010-1234-5678 -> $PHONE
- Stopwords
- Frequency-based filtering
  - Frequent/rare words
- Stemming
  - Could hurt more than it helps
  - News and new are different
- Lowercase
  - Capitalize matters sometimes
- Bag-of-Words / Bag-of-n-Grams
- TF-IDF
- Chunking and part-of-speech tagging
  - https://github.com/mungeol/feature-engineering-book/blob/master/03.02_Chunking_and_POS_Tagging.ipynb
- Examples
  - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter04/Ch_4.ipynb
- Lib
  - Universal Sentence Encoder

Feature

Create new features
- x ^ 2, x ^ 3, ...
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

Bin/bucket

When bucketize the numerical column?
- Numbers that are not meaningful

When you’re respecting the nonlinear relationship with your numeric values
When you try both wide and deep features

Tensorflow

def get_quantile_based_boundaries(feature_values, num_buckets):
  boundaries = np.arange(1.0, num_buckets) / num_buckets
  quantiles = feature_values.quantile(boundaries)
  return [quantiles[q] for q in quantiles.keys()]

# Divide longitude into 10 buckets.
bucketized_longitude = tf.feature_column.bucketized_column(
  longitude, boundaries=get_quantile_based_boundaries(
    training_examples["longitude"], 10))
  
# Divide latitude into 10 buckets.
bucketized_latitude = tf.feature_column.bucketized_column(
  latitude, boundaries=get_quantile_based_boundaries(
    training_examples["latitude"], 10))

Python
- df['price-binned'] = pd.cut(df['a'], np.linspace(min(df.a), max(df.a), 4) , labels=['l', 'm', 'h'], include_lowest=True)

Interaction
- Linear model
- X2 = sklearn.preprocessing.PolynomialFeatures(include_bias=False).fit_transform(X)
- Example
  - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter04/Ch_4.ipynb

Crosses

Tensorflow

tf.feature_column.crossed_column(
  set([bucketized_longitude, bucketized_latitude]), hash_bucket_size=1000)

One hot encoding, dummy coding, effect coding, label encoding
- Encoding at the
  - nominal level
  - ordinal level
- Tensorflow
  - tf.keras.utils.to_categorical
- Python
  - pd.get_dummies(df['a'])
- Example
  - https://github.com/mungeol/feature-engineering-book/blob/master/05.01-02_Regression_on_Categorical_Variable.ipynb
  - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter04/Ch_4.ipynb
Transformation
- A log transform is a powerful tool for dealing with positive numbers with a heavy-tailed distribution
  - np.log10(biz_df['review_count'])
  - scipy.stats.boxcox(biz_df['review_count'], lmbda=0)
- The Box-Cox formulation only works when the data is positive
  - For nonpositive data, one could shift the values by adding a fixed constant
  - stats.boxcox(biz_df['review_count'])
    - Finds the optimal transform parameter

Scaling / normalization

Feature scaling is useful in situations where a set of input features differs wildly in scale.
- Use caution when performing min-max scaling and standardization on sparse features
Min-Max
- Squeezes (or stretches) all feature values to be within the range of [0, 1]
- E.g. x1 = (x1 - min(x1)) / (max(x1) - min(x1))
- sklearn.preprocessing.minmax_scale(df[['n_tokens_content']])
- This can hurt some models as it takes away weight from outliers
(z-score) standardization / variance scaling / mean normalization
- Scaled feature has a mean of 0 and a variance of 1
- E.g. x1 = (x1 - avg(x1)) / standard deviation of the x1
- sklearn.preprocessing.StandardScaler().fit_transform(df[['n_tokens_content']])
- Algorithm using Euclidean distance, such as KNN
L2 / Euclidean
- the feature column has norm 1
- sklearn.preprocessing.normalize(df[['n_tokens_content']], axis=0)
- This comes in handy, especially when working with text data or clustering algorithms
Robust
- RobustScaler is less prone to outliers
- from sklearn.preprocessing import RobustScaler
Spark
- StandardScaler

Pandas

def linear_scale(series):
  min_val = series.min()
  max_val = series.max()
  scale = (max_val - min_val) / 2.0
  return series.apply(lambda x:((x - min_val) / scale) - 1.0)
 
def log_normalize(series):
  return series.apply(lambda x:math.log(x+1.0))

def clip(series, clip_to_min, clip_to_max):
  return series.apply(lambda x:(
    min(max(x, clip_to_min), clip_to_max)))

def z_score_normalize(series):
  mean = series.mean()
  std_dv = series.std()
  return series.apply(lambda x:(x - mean) / std_dv)

def binary_threshold(series, threshold):
  return series.apply(lambda x:(1 if x > threshold else 0))

Try alternate normalizations for various features to further improve performance.
- Pandas
  - normalized_training_examples.hist(bins=20, figsize=(18, 12), xlabelsize=10)
Note: you can't possibly do a logarithm transformation after standardization because about half of the standardized values will be 0 or negative, hence have no logarithm
Example
- https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter03/Ch_3_pima.ipynb

Clipping
- roomsPerPerson = min(totalRooms / population, 4)
  - Clipping the feature value at 4.0 doesn't mean that we ignore all values greater than 4.0. Rather, it means that all values that were greater than 4.0 now become 4.0. This explains the funny hill at 4.0. Despite that hill, the scaled feature set is now more useful than the original data.
Hashing
- https://github.com/mungeol/feature-engineering-book/blob/master/05.05_Feature_Hashing.ipynb

Feature Selection
- Simple approach
  - Repeatedly using one feature to train, then select and add the best feature to the model. Repeat this process.
- In modern deep learning, when data is plentiful, there has been a shift away from feature selection, and we are now more likely to give all the features we have to the algorithm and let the algorithm sort out which ones to use based on the data
- rules of thumbs
  - If your features are mostly categorical, you should start by trying to implement a SelectKBest with a Chi2 ranker or a tree-based model selector
  - If your features are largely quantitative, using linear models as model-based selectors and relying on correlations tends to yield greater results
  - If you are solving a binary classification problem, using a Support Vector Classification model along with a SelectFromModel selector will probably fit nicely, as the SVC tries to find coefficients to optimize for binary classification tasks
  - A little bit of EDA can go a long way in manual feature selection. The importance of having domain knowledge in the domain from which the data originated cannot be understated
- Filter methods
  - correlation coefficient
  - ANOVA test
  - chi-square test
  - variance threshold
- Wrapper methods
  - recursive feature elimination
  - sequential feature selection algorithms
  - genetic algorithms
- Embedded methods
  - Decision tree
  - L1 regularizer
    - Linear model
  - Embedding layer
    - How to choose the number of neurons of an embedding layer?
    - multi-sense embeddings
  - Weight
    - High weight means high importance
- Spark
  - ChiSqSelector
- Python 3
  - PymRMR: https://github.com/fbrundu/pymrmr
- Example
  - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter05/Ch_5.ipynb
Feature Extraction
- Feature transformation
- - TSNE
    - from sklearn.manifold import TSNE
  - PCA
    - https://github.com/mungeol/feature-engineering-book/blob/master/06.01_PCA_on_MNIST_digits.ipynb
    - De-correlating features
    - Try both scaled and un-scaled data
      - StandardScaler
    - It is best not to apply PCA to the data that has large outliers.
    - from sklearn.decomposition import PCA
  - SVD
    - Singular value decomposition module will return the same components as PCA if our data is scaled, but different components when using the raw unscaled data
    - from sklearn.decomposition import TruncatedSVD
  - LDA
    - Linear Discriminant Analysis (LDA) is a feature transformation technique as well as a supervised classifier. It is commonly used as a preprocessing step for classification pipelines. The goal of LDA, like PCA, is to extract a new coordinate system and project datasets onto a lower-dimensional space. The main difference between LDA and PCA is that instead of focusing on the variance of the data as a whole like PCA, LDA optimizes the lower-dimensional space for the best class separability. This means that the new coordinate system is more useful in finding decision boundaries for classification models, which is perfect when building classification pipelines. The reason that LDA is extremely useful is that separating based on class separability helps us avoid overfitting in our machine learning pipelines. This is also known as preventing the curse of dimensionality. LDA also reduces computational costs.
  - LSA
    - Latent semantic analysis (LSA) is a feature extraction tool. It is helpful for the text that is a series of these three steps
      - A TF-IDF vectorization
      - A PCA (SVD, in this case, to account for the sparsity of text)
      - Row normalization
  - Nonlinear Featurization via K-Means Model Stacking
    - https://github.com/mungeol/feature-engineering-book/blob/master/07.03-05_K-means_featurization.ipynb
    - With cluster features, the linear classifier performs just as well as nonlinear classifiers
    - K-means featurization is useful for real-valued, bounded numeric features that form clumps of dense regions in space
    - k-means cannot handle feature spaces where the Euclidean distance does not make sense—i.e., weirdly distributed numeric variables or categorical variables. If the feature set contains those variables, then there are several ways to handle them:
      - Apply k-means featurization only on the real-valued, bounded numeric features
      - Define a custom metric to handle multiple data types and use the k-medoids algorithms. (k-medoids is analogous to k-means but allows for arbitrary distance metrics.)
      - Convert categorical variables to binning statistics (see “Bin Counting” on page 87), then featurize them using k-means
  - Example
    - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter06/Ch_6.ipynb
      - PCA, LDA
    - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter08/Ch_8.ipynb
      - PCA, LDA, TruncatedSVD, LSA
- Feature learning
  - RBM
    - Restricted Boltzmann Machines is a simple deep learning architecture that is set up to learn a set number of new dimensions based on a probabilistic model that data follows. These machines are a family of algorithms with only one implemented in scikit-learn. The BernoulliRBM may be a nonparametric feature learner; however, as the name suggests, some expectations are set as to the values of the cells of the dataset.
  - Word embeddings
    - Likely one of the biggest contributors to the recent deep learning-fueled advancements of natural language processing/understanding/generation is the ability to project strings (words and phrases) into an n-dimensional feature set to grasp the context and minute detail in wording.
    - Approaches
      - Word2Vec, GloVe
        https://radimrehurek.com/gensim/
  - Example
    - https://github.com/mungeol/Feature-Engineering-Made-Easy/blob/master/Chapter07/Ch_7.ipynb
      - RBM, Word2Vec
Imbalanced data / skewed classes
- Reference
  - 2017 Mastering Machine Learning with Python in Six Steps
Outlier
- Plot it
  - Box
- Collect more outlier data
- Keep it
  - Anomaly detection
- Replace it with reasonable minimum or maximum value
- Remove it
Shuffling
- Pandas
  - df = df.reindex(np.random.permutation(df.index)
  - df = df.sample(frac=1)
  - df = df.sample(frac=1).reset_index(drop=True)
  - from sklearn.utils import shuffle
    - df = shuffle(df)
Image augmentation
- E.g.
  - https://github.com/mungeol/training-data-analyst/blob/master/courses/machine_learning/deepdive/08_image/flowersmodel/model.py
Training, validation/dev, Test set
- Your dev and test sets should come from the same distribution
- Choose dev and test sets from a distribution that reflects what data you expect to get in the future and want to do well on. This may not be the same as your training data’s distribution
- When you should train and test on different distributions
  - 2018 Machine learning yearning
    - P71
- How to decide whether to use all your data (which have different distributions)
  - 2018 Machine learning yearning
    - P73
- How to decide whether to include inconsistent data
  - 2018 Machine learning yearning
    - P75
- How large do the dev/test sets need to be?
  - The old heuristic of a 70%/30% train/test split does not apply for problems where you have a lot of data; the dev and test sets can be much less than 30% of the data
  - The dev set should be large enough to detect differences between algorithms that you are trying out. For example, if classifier A has an accuracy of 90.0% and classifier B has an accuracy of 90.1%, then a dev set of 100 examples would not be able to detect this 0.1% difference. Compared to other machine learning problems I’ve seen, a 100 example dev set is small. Dev sets with sizes from 1,000 to 10,000 examples are common. With 10,000 examples, you will have a good chance of detecting an improvement of 0.1%
  - There is no need to have excessively large dev/test sets beyond what is needed to evaluate the performance of your algorithms
- Eyeball and BlackBox dev set
  - 2018 Machine learning yearning
    - P36, P38
- Training dev set
  - 2018 Machine learning yearning
    - Generalizing from the training set to the dev set
      - P77

AI - EDA

Visualization

df.hist
plt.scatter
sns.heatmap

Univariate

Categorical

pd.crosstab
sns.countplot

Continuous

df.describe
boxplot
displot
kdeplot

Bivariate

Category to category

sns.factorplot

Category to continuous

sns.jointplot

Continuous to category

sns.factorplot().map(sns.kde/dist/box)

Other

sns.regplot

Correlation
- Pearson
  - Pandas
    - Dataframe.corr
  - Spark
    - Dataframe.stat.corr
- Spearman
- Kendall
Tool
- https://pair-code.github.io/facets/

AI - Algorithm

Classification
- The recommended approach
  - Use AUC to select the model when you do not know which threshold will be used
  - Then use FN and FP to decide the threshold
- SVM
  - Pros
    - Accurate in high-dimensional spaces
    - Memory efficient
  - Cons
    - Prone to overfitting
    - No probability estimation
    - For small datasets
  - Applications
    - Image recognition
    - Text category assignment
    - Detecting spam
    - Sentiment analysis
    - Gene expression classification
    - Regression, outlier detection and clustering
CNN
- Kernel size?
RNN

Cell_size = N_inputs // (size of the internal state in each of the cell)

Lstm = 4 internal states
Gru = 3

Use custom loss function
- E.g. use several outputs to calculate the loss
Dropout is available

Recommendation
- Collaborative Filtering
  - User-based
  - Item-based
  - Challenges
    - Data sparsity
    - Cold start
    - Scalability
  - WALS: Weighted Alternating Least Squares
- Context-aware
- - Contextual pre-filtering, contextual post-filtering, and contextual modeling
- Hybrid

AI - Tuning

Optimal probability cutoff point
- 2017 Mastering Machine Learning with Python in Six Steps
Bias and Variance
- High variance
- High bias
- Bias = Optimal error rate (“unavoidable bias”) + Avoidable bias = training error
- Optimal error rate / unavoidable bias
  - Use human-level performance to estimate the optimal error rate and also set achievable “desired error rate.”
- Avoidable bias
  - More complex model
    - DL: increase the model size, such as the number of neurons/layers
  - More features
  - More polynomial features
  - Reduce or eliminate regularization
- Variance = dev error - training error
  - More training data
    - Collect more
    - Data augmentation
  - Regularization
    - Works well when we have a lot of features, each of which contributes a bit to predicting y
  - Early stoping
  - Fewer features
    - Model selection or feature selection
    - Dimension reduction
  - Noise robustness
  - Sparse representation
  - More simple model
    - Try others first
    - DL: decrease the model size, such as the number of neurons/layers
  - If you find that your dev set performance is much better than your test set performance, it is a sign that you have overfitted to the dev set.
    - In this case, get a fresh dev set/get more dev set data
- Both
  - Choosing the right model parameters
    - Regularization
      - Try decreasing lambda (fixes high bias)
      - Try increasing lambda (fixes high variance)
  - Modify input features based on insights from error analysis
  - Modify model architecture
    - Such as neural network architecture, so that it is more suitable for your problem
Data Mismatch
- Try to understand what properties of the data differ between the training and the dev set distributions.
- Try to find more training data that better matches the dev set examples that your algorithm has trouble with.
Regularization
- L1 or lasso
- L2 or ridge
  - When to use l1 and l2
- Elastic Net
  - The elastic net is just a linear combination of the L1 and L2 regularizing penalties. This way, you get the benefits of sparsity for really poor predictive features while also keeping decent and great features with smaller weights to provide a good generalization. The only trade-off now is there are two instead of one hyperparameters to tune with the two different Lambda regularization parameters.
- Dropout
  - When to use dropout
Hyperparameter
- Tune Hyperparameters When Comparing Models
Neurons stop learning
- Lower the learning rate
  - Increase the number of epoch or steps
- Use other activation function, like leaky Relu
- Use dropout
- Batch normalization
  - weight normalization, layer normalization, self normalizing networks
  - Redesign the network
    - Identity shortcut
Visualization
- Tensorboard
  - https://tensorboard.dev/
- TFDV
- TFMA
Error analysis
- 2018 Machine learning yearning
  - P30, P32, P52
The Optimization Verification test
- 2018 Machine learning yearning
  - P85

AI - Evaluation

Plot
- The distribution between y and yhat
  - Closer is better
  - ax1 = sns.distplot(Y, hist=False, color="r", label="Actual Value")
  - sns.distplot(Yhat, hist=False, color="b", label="Fitted Values" , ax=ax1)
Establish a single-number evaluation metric for your team to optimize
- Choose a single-number evaluation metric for your team to optimize. If there are multiple goals that you care about, consider combining them into a single formula (such as averaging multiple error metrics) or defining satisficing and optimizing metrics
- E.g. use F1, F2, or AUC instead of precision and recall
MAE

A good metric for measuring the accuracy of predictions for time series
It does not heavily punish larger errors as square errors do

MAPE
MSE
RMSE

More sensitive to outliers than MAE

RMSLE

A good metric to avoid penalizing differences for large prediction values more heavily than for small prediction values

R^2
- Close to 1 is better
- Negative value = overfitting
- from sklearn.metrics import r2_score
- Or, lm.fit then lm.score
Confusion matrix
- https://en.wikipedia.org/wiki/Confusion_matrix
- Python function
  - http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
  - http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
- Precision-recall curve
F1 measure
Log Loss
- Performance of a classifier where the predicted output is a probability value between 0 and 1
AUC: area under the curve
Time-dependent ROC curve
- For game update
IBS: Integrated Brie score
- For game update
Dunn index

A metric for evaluating clustering algorithms

Cross-validation
- https://en.wikipedia.org/wiki/Cross-validation_(statistics)
- Python function
  - http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html
- Spark
  - CrossValidator
- K-fold
  - For a small data set
  - E.g. , m = about 1000
  - Normally k=10
  - https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation
  - python function
    - http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html
- Leave-one-out
  - For a very small data set
  - E.g., m < 100, 20 examples, k=20
  - https://en.wikipedia.org/wiki/Cross-validation_(statistics)#Leave-one-out_cross-validation
  - python function
    - http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LeaveOneOut.html
Learning curves
- Telling that adding more training data is helpful or not
- Diagnosing bias and variance
- 2018 Machine learning yearning
  - P55

AI

The recommended approach
- ML
- - Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
  - Plot learning curves to decide if more data, more features, etc.
  - Error analysis
    - Manually examine the examples (in cross-validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.
- DL
  - Try to have the same number of hidden units in every layer
    - Usually the more the better, but computational
  - ReLU variants
    - Softplus, leaky relu, prelu, relu6, elu
- Common
  - Learning rate
  - Learning rate automation
    - https://www.tensorflow.org/api_docs/python/tf/keras/optimizers
      - Some optimizers support it
    - https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules
  - Continue to verify and monitor your data since it may change for many reasons in reality
  - Stop learning in particular circumstances
    - E.g. service failure -> only a few users can access the service -> incomplete/incorrect data
- Transfer learning
  - Where to cut?
  - Do I make the source models weights trainable, as in allowing to change values during subsequent model training or do I make them constant?
  - Whether to make the pre-trained embeddings trainable or not.
  - Pretrained embedding
    - Tensorflow hub
      - E.g.
        https://github.com/mungeol/training-data-analyst/blob/master/courses/machine_learning/deepdive/09_sequence/reusable-embeddings.ipynb
  - Pretrained model
    - Tensorflow hub
      - https://www.tensorflow.org/hub
    - Keras applications
      - https://keras.io/applications/
Rules of Machine Learning: Best Practices for ML Engineering

See also
- Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
- Choosing the right estimator

Tip 4 Big Data

String 2 BeamRecord (beam)

Option 1

.apply(ParDo.of(new DoFn<String, BeamRecord>() {
    @ProcessElement
    public void processElement(ProcessContext c) {
        //System.out.println(c.element());
        c.output(new BeamRecord(type, c.element()));
    }
}))

Option 2

 .apply(MapElements.via(new SimpleFunction<String, BeamRecord>() {
    public BeamRecord apply(String input) {
        //System.out.println(input);
        return new BeamRecord(type, input);
    }
}))
  
/* which can be expressed as below
.apply(MapElements.via(apply(intput) -> {
                return new BeamRecord(type, input);
        }))
*/

Using snappy (hive)

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

The place storing table statistics (hive)
- MySQL
- select * from TABLE_PARAMS
- select * from PARTITION_PARAMS

Options for specifying a schema (spark)

// 1
val schema = new StructType()  .add("i_logid", IntegerType, false)
  .add("i_logdetailid", IntegerType, false)
  .add("i_logdes", new StructType().add("gamecode", StringType, true), false)
 
// 2
val schema = StructType(
  StructField("i_logid", IntegerType, false) ::
    StructField("i_logdetailid", IntegerType, false) ::
    StructField("i_logdes", new StructType().add("gamecode", StringType, true), false) ::
    Nil
)
 
// 3
case class Des(gamecode: String)
case class Log(i_logid: Int, i_logdetailid: Int, i_logdes: Des)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Log].schema
 
// 4
spark.sql("select get_json_object(lower(cast(value as string)), '$.i_regdatetime') as i_regdatetime from rawData")
 
// 5
val schema = spark.read.table("netmarbles.log_20170813").schema

Estimates the sizes of java objects (spark)
- https://spark.apache.org/docs/2.1.0/api/scala/#org.apache.spark.util.SizeEstimator$
- E.g.
  - import org.apache.spark.util.SizeEstimator
  - SizeEstimator.estimate(myRdd)
  - SizeEstimator.estimate(myDf)
  - SizeEstimator.estimate(myDs)
Using the desc option in the orderBy API (spark)
- orderBy($"count".desc)
- orderBy('count.desc)
- orderBy(-'count)
RDB 2 local using sqoop (sqoop)
- Use -jt option
  - E.g. sqoop import -jt local --target-dir file:///home/hdfs/temp
- Use -fs and -jt options
  - E.g. sqoop import -fs local -jt local
- File file:/hdp/apps/2.6.0.3-8/mapreduce/mapreduce.tar.gz does not exist
  - mkdir -p /hdp/apps/2.6.0.3-8/mapreduce
  - chown -R hdfs:hadoop /hdp
  - cd /hdp/apps/2.6.0.3-8/mapreduce
  - hdfs dfs -get /hdp/apps/2.6.0.3-8/mapreduce/mapreduce.tar.gz
Read files in s3a from spark (spark)
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key","XXX")
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","false")
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint","host:port")
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key","XXX")
- spark.read.text("s3a://path/to/the/file")
Setting the logging level of the ambari-agent.log (ambari)
- cd /etc/ambari-agent/conf
- cp logging.conf.sample logging.conf
- vim logging.conf
  - [logger_root]
    level=WARNING
Setting the logging level of the hiveserver2.log (hive)
- Ambari web UI -> hive--> config --> advanced hive-log4j --> hive.root.logger=INFO,DRFA
Push JSON Records (spark)
- val df = temp.toDF("createdAt", "users", "tweet")
- json_rdd = df.toJSON.rdd
- json_rdd.foreachPartition ( partition => { // Send records to Kinesis / Kafka })
How to specify hive tez job name showing at resource manager UI (tez)
- You cannot. At lease, not the full name, because it is hard coded.
  - https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
    - final TezClient session = TezClient.newBuilder("HIVE-" + sessionId, tezConfig)
- However you can set the session ID using hive.session.id
  - hive --hiveconf hive.session.id=session_id_name
    - HIVE-session_id_name