Friday, December 28, 2018
LCS
pystrgrp: https://drive.google.com/open?id=1Ig_ATnmLUJIuHbFPRGlvZdM3Xd5Yp32U
Example
from pystrgrp import Strgrp
def pystrgrp(strings):
clusters = Strgrp(0.7)
for string in (x.strip() for x in strings):
seq, id = string.split(',')
clusters.add(seq, id)
return clusters
data = sorted(['12345,1','1234567,2','1234568,3','2345678,4',
'2345679,5','345678,6','1234578,7','3456789,8','abcdefg,9','bcdefg,10'], reverse=0)
grps = pystrgrp(data)
grps
grps_list = [g for g in grps]
grps_list
import pandas as pd
df = pd.DataFrame()
for i in range(len(grps_list)):
grp = [g for g in grps_list[i]]
for j in range(len(grp)):
print(i, grp[j].key(), grp[j].value())
df = pd.concat([df, pd.DataFrame([tuple([i, grp[j].key(), grp[j].value()])],
columns=['cluster','seq','id'])], ignore_index=True)
df
Clustering
- Partitioned-based clustering
- k-means, k-median, fuzzy c-means
- Hierarchical clustering
- Produces trees of clusters
- Agglomerative, divisive
- Advantages
- It does not require the number of clusters to be specified.
- Produces a dendrogram which helps with understanding the data.
- Disadvantages
- It can never undo any previous steps throughout the algorithm.
- Sometimes difficult to identify the number of clusters by the dendrogram.
- Density-bassed clustering
- Produces arbitrary shaped clusters
- Locates regions of high density, and separates outliers
- DBSCAN
- Does not require specification of the number of clusters
- Time-series clustering by features
- Time-series clustering by features.
- Raw data.
- Autocorrelation.
- Spectral density.
- Extreme value behavior.
- Model-based time series clustering.
- Forecast based clustering.
- Model with a cluster structure.
- Time-series clustering by dependence.
- Time-series clustering by features.
- Clustering high dimensional data
- Many clustering algorithms deal with 1-3 dimensions
- These methods may not work well when the number of dimensions grows to 20
- Methods for clustering high dimensional data
- Methods can be grouped into two categories
- Subspace clustering
- CLIQUE, ProClus, and bi-clustering approaches
- Dimensionality reduction approaches
- Spectral clustering and various dimensionality reduction methods
- Subspace clustering
- Clustering should not only consider dimensions but also attributes/features
- Feature selection
- Feature transformation
- Principal component analysis, singular value decomposition
- Methods can be grouped into two categories
Wednesday, August 22, 2018
Help 4 GCP
Access Denied: Table X:Y.Z: The user 123-compute@developer.gserviceaccount.com does not have permission to query table X:Y.Z (bigQuery)
- Python
- from google.cloud import bigquery
- from google.oauth2 import service_account
- credentials = service_account.Credentials.from_service_account_file( 'path/to/file.json')
- project_id = 'my-bq'
- client = bigquery.Client(credentials= credentials, project=project_id)
- Python
Cannot connect to the instance using SSH since the disk is full (GCE)
- Check if your operating system supports automatic resizing: If so, using Cloud Console you can edit VM's root disk and increase its size. Your virtual machine instance can automatically resize the partition to recognize the additional space after you restart the instance.
- Use Interactive Serial Console feature to login to your VM and clean up your VM's disk or copy them to another storage, if you would need them later.
- If you know what data you want to delete, you can configure a startup script to remove the files and reboot your VM to run the script (e.g. rm /tmp/*).
- You can detach the persistent disk and attach this disk to another machine as an additional disk. On the temporary machine, you can mount it and clean up your data or copy them to another storage, if you would need them later. Finally, recreate the original instance with the same boot disk. You can follow the same steps described in this video to add your disk to another Linux VM but add your existing boot disk instead of creating a new disk.
- Check if your operating system supports automatic resizing: If yes, then create a snapshot of your persistent disk, create a new persistent disk with larger size from the snapshot. Finally, recreate the original instance with this larger boot disk.
No scalar data was found (tensorboard)
- Use gcloud command to train the model.
prediction_lib.PredictionError: Failed to load model: Cloud ML only supports TF 1.0 or above and models saved in SavedModel format. (Error code: 0) (ml enine)
- Check the model path which is the value of "--model-dir" flag.
- Note:
- Do not use the model location in the log info.
- E.g.
- INFO:tensorflow:SavedModel written to: b"output/export/census/temp-b'1531882849'/saved_model.pb"
- However, you should use "output/export/census/1531882849"
"error": "Prediction failed: unknown error." (ml engine)
- This is because the model doesn't support the specified instance format.
- E.g.
- The model supports JSON instance for prediction.
- However, a CSV instance has been specified for prediction.
- If the error still happens, then try to specify the version of the model which supports instances for prediction.
ERROR: (gcloud.ml-engine.jobs.submit.training) Could not copy [/tmp/.../output/trainer-0.0.0.tar.gz] to [.../trainer-0.0.0.tar.gz]. Please retry: HTTPError 404: Not Found (ml engine)
- Check the bucket name.
The schema of pandas dataframe created from read_gbq is different from bigQuery table (bigQuery)
- Use from google.cloud import bigquery instead.
- E.g. client.query('SELECT * FROM `pojectId.dataset.table` limit 1').result().schema
java.net.UnknownHostException: metadata (general)
- Set one of the configurations place below.
- google.cloud.auth.service.account.json.keyfile
- fs.gs.auth.service.account.json.keyfile
- Set one of the configurations place below.
- java.io.IOException: Error accessing: bucket: null (hadoop)
- Set "mapred.bq.gcs.bucket" configuration.
- java.lang.NullPointerException: Required parameter projectId must be specified (hadoop)
- Set "mapred.bq.project.id" configuration.
- org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create load job with id prefix ${ID prefix}, reached max retries: 3, last failed load job (bigQuery)
- org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create load job with id prefix ${ID prefix}, reached max retries: 3, last failed load job
- Make sure using right data type for related column while creating TableRow.
- org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create load job with id prefix ${ID prefix}, reached max retries: 3, last failed load job
- Error detected while parsing row starting at position: 556531513. Error: Bad character (ASCII 0) encountered (bigQuery)
- Find the character causing the problem.
- less +556531513P test.csv
- There will be a character like Ctrl-@ which is ^@.
- Avoid or remove it before producing the CSV file, or from the CSV file.
- gsutil cp gs://bucket/test.csv - | tr -d '\000' | gsutil cp - gs://bucket/test2.csv
- Find the character causing the problem.
AI - Preprocessing
- Missing data
- Remove data point
- Remove feature
- Major value
- Mean value
- Median value
- Predict
- Example
- Text data
- Replace data
- Specific data to specific characters/symbols
- E.g. 123 -> $NUMBER
- E.g. exmaple@gmail.com -> $EMAIL
- E.g. 010-1234-5678 -> $PHONE
- Specific data to specific characters/symbols
- Stopwords
- Frequency-based filtering
- Frequent/rare words
- Stemming
- Could hurt more than it helps
- News and new are different
- Lowercase
- Capitalize matters sometimes
- Bag-of-Words / Bag-of-n-Grams
- TF-IDF
- Chunking and part-of-speech tagging
- Examples
- Lib
- Universal Sentence Encoder
- Feature
- Create new features
- Bin/bucket
When bucketize the numerical column?
Numbers that are not meaningful
When you’re respecting the nonlinear relationship with your numeric values
When you try both wide and deep features
Tensorflow
def get_quantile_based_boundaries(feature_values, num_buckets): boundaries = np.arange(1.0, num_buckets) / num_buckets quantiles = feature_values.quantile(boundaries) return [quantiles[q] for q in quantiles.keys()] # Divide longitude into 10 buckets. bucketized_longitude = tf.feature_column.bucketized_column( longitude, boundaries=get_quantile_based_boundaries( training_examples["longitude"], 10)) # Divide latitude into 10 buckets. bucketized_latitude = tf.feature_column.bucketized_column( latitude, boundaries=get_quantile_based_boundaries( training_examples["latitude"], 10))
- Python
- df['price-binned'] = pd.cut(df['a'], np.linspace(min(df.a), max(df.a), 4) , labels=['l', 'm', 'h'], include_lowest=True)
- Interaction
- Linear model
- X2 = sklearn.preprocessing.PolynomialFeatures(include_bias=False).fit_transform(X)
- Example
- Crosses
Tensorflow
tf.feature_column.crossed_column( set([bucketized_longitude, bucketized_latitude]), hash_bucket_size=1000)
- One hot encoding, dummy coding, effect coding, label encoding
- Encoding at the
- nominal level
- ordinal level
- Tensorflow
- tf.keras.utils.to_categorical
- Python
- pd.get_dummies(df['a'])
- Example
- Encoding at the
- Transformation
- A log transform is a powerful tool for dealing with positive numbers with a heavy-tailed distribution
- np.log10(biz_df['review_count'])
- scipy.stats.boxcox(biz_df['review_count'], lmbda=0)
- The Box-Cox formulation only works when the data is positive
- For nonpositive data, one could shift the values by adding a fixed constant
- stats.boxcox(biz_df['review_count'])
- Finds the optimal transform parameter
- A log transform is a powerful tool for dealing with positive numbers with a heavy-tailed distribution
- Scaling / normalization
- Feature scaling is useful in situations where a set of input features differs wildly in scale.
- Use caution when performing min-max scaling and standardization on sparse features
- Min-Max
- Squeezes (or stretches) all feature values to be within the range of [0, 1]
- E.g. x1 = (x1 - min(x1)) / (max(x1) - min(x1))
- sklearn.preprocessing.minmax_scale(df[['n_tokens_content']])
- This can hurt some models as it takes away weight from outliers
- (z-score) standardization / variance scaling / mean normalization
- Scaled feature has a mean of 0 and a variance of 1
- E.g. x1 = (x1 - avg(x1)) / standard deviation of the x1
- sklearn.preprocessing.StandardScaler().fit_transform(df[['n_tokens_content']])
- Algorithm using Euclidean distance, such as KNN
- L2 / Euclidean
- the feature column has norm 1
- sklearn.preprocessing.normalize(df[['n_tokens_content']], axis=0)
- This comes in handy, especially when working with text data or clustering algorithms
- Robust
- RobustScaler is less prone to outliers
from sklearn.preprocessing import RobustScaler
- Spark
StandardScaler
Pandas
def linear_scale(series): min_val = series.min() max_val = series.max() scale = (max_val - min_val) / 2.0 return series.apply(lambda x:((x - min_val) / scale) - 1.0) def log_normalize(series): return series.apply(lambda x:math.log(x+1.0)) def clip(series, clip_to_min, clip_to_max): return series.apply(lambda x:( min(max(x, clip_to_min), clip_to_max))) def z_score_normalize(series): mean = series.mean() std_dv = series.std() return series.apply(lambda x:(x - mean) / std_dv) def binary_threshold(series, threshold): return series.apply(lambda x:(1 if x > threshold else 0))
- Try alternate normalizations for various features to further improve performance.
- Pandas
- normalized_training_examples.hist(bins=20, figsize=(18, 12), xlabelsize=10)
- Pandas
- Note: you can't possibly do a logarithm transformation after standardization because about half of the standardized values will be 0 or negative, hence have no logarithm
- Example
- Feature scaling is useful in situations where a set of input features differs wildly in scale.
- Clipping
- roomsPerPerson = min(totalRooms / population, 4)
- Clipping the feature value at 4.0 doesn't mean that we ignore all values greater than 4.0. Rather, it means that all values that were greater than 4.0 now become 4.0. This explains the funny hill at 4.0. Despite that hill, the scaled feature set is now more useful than the original data.
- roomsPerPerson = min(totalRooms / population, 4)
- Hashing
- Feature Selection
- Simple approach
- Repeatedly using one feature to train, then select and add the best feature to the model. Repeat this process.
- In modern deep learning, when data is plentiful, there has been a shift away from feature selection, and we are now more likely to give all the features we have to the algorithm and let the algorithm sort out which ones to use based on the data
- rules of thumbs
- If your features are mostly categorical, you should start by trying to implement a SelectKBest with a Chi2 ranker or a tree-based model selector
- If your features are largely quantitative, using linear models as model-based selectors and relying on correlations tends to yield greater results
- If you are solving a binary classification problem, using a Support Vector Classification model along with a SelectFromModel selector will probably fit nicely, as the SVC tries to find coefficients to optimize for binary classification tasks
- A little bit of EDA can go a long way in manual feature selection. The importance of having domain knowledge in the domain from which the data originated cannot be understated
- Filter methods
- correlation coefficient
- ANOVA test
- chi-square test
- variance threshold
- Wrapper methods
- recursive feature elimination
- sequential feature selection algorithms
- genetic algorithms
- Embedded methods
- Decision tree
- L1 regularizer
- Linear model
- Embedding layer
How to choose the number of neurons of an embedding layer?
Try starting from the 4th root of the total number of possible values
Hyper tun: max = 35
Higher dimensions -> higher chance of overfitting, slower training
multi-sense embeddings
Not always work
- Weight
High weight means high importance
- Spark
- ChiSqSelector
- Python 3
- Example
- Simple approach
- Feature Extraction
- Feature transformation
- TSNE
- from sklearn.manifold import TSNE
- PCA
- https://github.com/mungeol/feature-engineering-book/blob/master/06.01_PCA_on_MNIST_digits.ipynb
- De-correlating features
- Try both scaled and un-scaled data
- StandardScaler
- It is best not to apply PCA to the data that has large outliers.
- from sklearn.decomposition import PCA
- SVD
- Singular value decomposition module will return the same components as PCA if our data is scaled, but different components when using the raw unscaled data
- from sklearn.decomposition import TruncatedSVD
- LDA
- Linear Discriminant Analysis (LDA) is a feature transformation technique as well as a supervised classifier. It is commonly used as a preprocessing step for classification pipelines. The goal of LDA, like PCA, is to extract a new coordinate system and project datasets onto a lower-dimensional space. The main difference between LDA and PCA is that instead of focusing on the variance of the data as a whole like PCA, LDA optimizes the lower-dimensional space for the best class separability. This means that the new coordinate system is more useful in finding decision boundaries for classification models, which is perfect when building classification pipelines. The reason that LDA is extremely useful is that separating based on class separability helps us avoid overfitting in our machine learning pipelines. This is also known as preventing the curse of dimensionality. LDA also reduces computational costs.
- LSA
- Latent semantic analysis (LSA) is a feature extraction tool. It is helpful for the text that is a series of these three steps
- A TF-IDF vectorization
- A PCA (SVD, in this case, to account for the sparsity of text)
- Row normalization
- Latent semantic analysis (LSA) is a feature extraction tool. It is helpful for the text that is a series of these three steps
- Nonlinear Featurization via K-Means Model Stacking
- https://github.com/mungeol/feature-engineering-book/blob/master/07.03-05_K-means_featurization.ipynb
- With cluster features, the linear classifier performs just as well as nonlinear classifiers
- K-means featurization is useful for real-valued, bounded numeric features that form clumps of dense regions in space
- k-means cannot handle feature spaces where the Euclidean distance does not make sense—i.e., weirdly distributed numeric variables or categorical variables. If the feature set contains those variables, then there are several ways to handle them:
- Apply k-means featurization only on the real-valued, bounded numeric features
- Define a custom metric to handle multiple data types and use the k-medoids algorithms. (k-medoids is analogous to k-means but allows for arbitrary distance metrics.)
- Convert categorical variables to binning statistics (see “Bin Counting” on page 87), then featurize them using k-means
- Example
- TSNE
- Feature learning
- RBM
- Restricted Boltzmann Machines is a simple deep learning architecture that is set up to learn a set number of new dimensions based on a probabilistic model that data follows. These machines are a family of algorithms with only one implemented in scikit-learn. The BernoulliRBM may be a nonparametric feature learner; however, as the name suggests, some expectations are set as to the values of the cells of the dataset.
- Word embeddings
- Likely one of the biggest contributors to the recent deep learning-fueled advancements of natural language processing/understanding/generation is the ability to project strings (words and phrases) into an n-dimensional feature set to grasp the context and minute detail in wording.
- Approaches
- Word2Vec, GloVe
- Example
- RBM
- Imbalanced data / skewed classes
- Reference
- 2017 Mastering Machine Learning with Python in Six Steps
- Reference
- Outlier
- Plot it
- Box
- Collect more outlier data
- Keep it
- Anomaly detection
- Replace it with reasonable minimum or maximum value
- Remove it
- Plot it
- Shuffling
- Pandas
df = df.reindex(np.random.permutation(df.index)
- df = df.sample(frac=1)
- df = df.sample(frac=1).reset_index(drop=True)
- from sklearn.utils import shuffle
- df = shuffle(df)
- Pandas
- Image augmentation
- Training, validation/dev, Test set
- Your dev and test sets should come from the same distribution
- Choose dev and test sets from a distribution that reflects what data you expect to get in the future and want to do well on. This may not be the same as your training data’s distribution
- When you should train and test on different distributions
- 2018 Machine learning yearning
- P71
- 2018 Machine learning yearning
- How to decide whether to use all your data (which have different distributions)
- 2018 Machine learning yearning
- P73
- 2018 Machine learning yearning
- How to decide whether to include inconsistent data
- 2018 Machine learning yearning
- P75
- 2018 Machine learning yearning
- How large do the dev/test sets need to be?
- The old heuristic of a 70%/30% train/test split does not apply for problems where you have a lot of data; the dev and test sets can be much less than 30% of the data
- The dev set should be large enough to detect differences between algorithms that you are trying out. For example, if classifier A has an accuracy of 90.0% and classifier B has an accuracy of 90.1%, then a dev set of 100 examples would not be able to detect this 0.1% difference. Compared to other machine learning problems I’ve seen, a 100 example dev set is small. Dev sets with sizes from 1,000 to 10,000 examples are common. With 10,000 examples, you will have a good chance of detecting an improvement of 0.1%
- There is no need to have excessively large dev/test sets beyond what is needed to evaluate the performance of your algorithms
- Eyeball and BlackBox dev set
- 2018 Machine learning yearning
- P36, P38
- 2018 Machine learning yearning
- Training dev set
- 2018 Machine learning yearning
- Generalizing from the training set to the dev set
- P77
- Generalizing from the training set to the dev set
- 2018 Machine learning yearning
AI - EDA
- Visualization
- df.hist
- plt.scatter
- sns.heatmap
- Univariate
- Categorical
- pd.crosstab
- sns.countplot
- Continuous
- df.describe
- boxplot
- displot
- kdeplot
- Bivariate
- Category to category
- sns.factorplot
- Category to continuous
- sns.jointplot
- Continuous to category
- sns.factorplot().map(sns.kde/dist/box)
- Other
- sns.regplot
- Correlation
- Pearson
- Pandas
- Dataframe.corr
- Spark
- Dataframe.stat.corr
- Pandas
- Spearman
- Kendall
- Pearson
- Tool
AI - Algorithm
- Classification
- The recommended approach
- Use AUC to select the model when you do not know which threshold will be used
- Then use FN and FP to decide the threshold
- SVM
- Pros
- Accurate in high-dimensional spaces
- Memory efficient
- Cons
- Prone to overfitting
- No probability estimation
- For small datasets
- Applications
- Image recognition
- Text category assignment
- Detecting spam
- Sentiment analysis
- Gene expression classification
- Regression, outlier detection and clustering
- Pros
- The recommended approach
- CNN
Kernel size?
Recent research has shown that it's better to use smaller kernel sizes and add more convolutional layers. In other words, instead of using a nine by nine filter, try sequencing two layers of three by three filters
RNN
Cell_size = N_inputs // (size of the internal state in each of the cell)
Lstm = 4 internal states
Gru = 3
Use custom loss function
E.g. use several outputs to calculate the loss
- Dropout is available
- Recommendation
- Collaborative Filtering
- User-based
- Item-based
- Challenges
- Data sparsity
- Cold start
- Scalability
- WALS: Weighted Alternating Least Squares
Context-aware
Contextual pre-filtering, contextual post-filtering, and contextual modeling
- Hybrid
- Collaborative Filtering
AI - Tuning
- Optimal probability cutoff point
- 2017 Mastering Machine Learning with Python in Six Steps
- Bias and Variance
- High variance
- High bias
-
- Bias = Optimal error rate (“unavoidable bias”) + Avoidable bias = training error
- Optimal error rate / unavoidable bias
- Use human-level performance to estimate the optimal error rate and also set achievable “desired error rate.”
- Avoidable bias
- More complex model
- DL: increase the model size, such as the number of neurons/layers
- More features
- More polynomial features
- Reduce or eliminate regularization
- More complex model
- Variance = dev error - training error
- More training data
- Collect more
- Data augmentation
- Regularization
- Works well when we have a lot of features, each of which contributes a bit to predicting y
- Early stoping
- Fewer features
- Model selection or feature selection
- Dimension reduction
- Noise robustness
Sparse representation
- More simple model
- Try others first
- DL: decrease the model size, such as the number of neurons/layers
If you find that your dev set performance is much better than your test set performance, it is a sign that you have overfitted to the dev set.
In this case, get a fresh dev set/get more dev set data
- More training data
- Both
- Choosing the right model parameters
- Regularization
- Try decreasing lambda (fixes high bias)
- Try increasing lambda (fixes high variance)
- Regularization
- Modify input features based on insights from error analysis
- Modify model architecture
- Such as neural network architecture, so that it is more suitable for your problem
- Choosing the right model parameters
- Data Mismatch
- Try to understand what properties of the data differ between the training and the dev set distributions.
- Try to find more training data that better matches the dev set examples that your algorithm has trouble with.
- Regularization
- L1 or lasso
- L2 or ridge
When to use l1 and l2
In practice, usually, the L2-norm provides more generalizable models than the L1 norm. However, we will end up with much more complex heavy models if we use L2 instead of L1. This happens because often features have a high correlation with each other, and L1 regularization which use one of them and throw the other away, whereas L2 regularization will keep both features and keep their weight magnitudes small. So with L1, you can end up with a smaller model but it may be less predictive.
- Elastic Net
- The elastic net is just a linear combination of the L1 and L2 regularizing penalties. This way, you get the benefits of sparsity for really poor predictive features while also keeping decent and great features with smaller weights to provide a good generalization. The only trade-off now is there are two instead of one hyperparameters to tune with the two different Lambda regularization parameters.
- Dropout
When to use dropout
You also want to use this on larger networks because there is more capacity for the model to learn independent representations. In other words, there are more possible paths for the network to try. The more you drop out, therefore the less you keep, the stronger the regularization.
- Hyperparameter
- Tune Hyperparameters When Comparing Models
- Neurons stop learning
Lower the learning rate
Increase the number of epoch or steps
Learn slow
- Use other activation function, like leaky Relu
Use dropout
Limit the ability to learn
Batch normalization
weight normalization, layer normalization, self normalizing networks
Redesign the network
Identity shortcut
have auxiliary outputs at intermediate layers in the network
have alternate routes through the network that are shorter
Train faster
- Visualization
- Tensorboard
TFDV
- Tensorflow Data Validation
Monitor the difference between training, validation and test dataset
TFMA
- TensorFlow Model Analysis
Check the ROC curve for each class
Check hourly performance
- Error analysis
- 2018 Machine learning yearning
- P30, P32, P52
- 2018 Machine learning yearning
- The Optimization Verification test
- 2018 Machine learning yearning
- P85
- 2018 Machine learning yearning
AI - Evaluation
- Plot
- The distribution between y and yhat
- Closer is better
- ax1 = sns.distplot(Y, hist=False, color="r", label="Actual Value")
- sns.distplot(Yhat, hist=False, color="b", label="Fitted Values" , ax=ax1)
- The distribution between y and yhat
- Establish a single-number evaluation metric for your team to optimize
- Choose a single-number evaluation metric for your team to optimize. If there are multiple goals that you care about, consider combining them into a single formula (such as averaging multiple error metrics) or defining satisficing and optimizing metrics
- E.g. use F1, F2, or AUC instead of precision and recall
- MAE
- A good metric for measuring the accuracy of predictions for time series
- It does not heavily punish larger errors as square errors do
- MAPE
- MSE
- RMSE
- More sensitive to outliers than MAE
- RMSLE
- A good metric to avoid penalizing differences for large prediction values more heavily than for small prediction values
- R^2
- Close to 1 is better
- Negative value = overfitting
- from sklearn.metrics import r2_score
- Or, lm.fit then lm.score
- Confusion matrix
- F1 measure
- Log Loss
- Performance of a classifier where the predicted output is a probability value between 0 and 1
- AUC: area under the curve
- Time-dependent ROC curve
- For game update
- IBS: Integrated Brie score
- For game update
- Dunn index
- A metric for evaluating clustering algorithms
- Cross-validation
- https://en.wikipedia.org/wiki/Cross-validation_(statistics)
- Python function
- Spark
- CrossValidator
- K-fold
- For a small data set
- E.g. , m = about 1000
- Normally k=10
- https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation
- python function
- Leave-one-out
- For a very small data set
- E.g., m < 100, 20 examples, k=20
- https://en.wikipedia.org/wiki/Cross-validation_(statistics)#Leave-one-out_cross-validation
- python function
- Learning curves
- Telling that adding more training data is helpful or not
- Diagnosing bias and variance
- 2018 Machine learning yearning
- P55
AI
- The recommended approach
- ML
- Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
- Plot learning curves to decide if more data, more features, etc.
- Error analysis
- Manually examine the examples (in cross-validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.
- DL
- Try to have the same number of hidden units in every layer
- Usually the more the better, but computational
- ReLU variants
- Softplus, leaky relu, prelu, relu6, elu
- Try to have the same number of hidden units in every layer
- Common
Learning rate
< 1/sqrt(num_features)
- Learning rate automation
Continue to verify and monitor your data since it may change for many reasons in reality
Stop learning in particular circumstances
E.g. service failure -> only a few users can access the service -> incomplete/incorrect data
- Transfer learning
- Where to cut?
- By convention, we cut the source network after the convolutional layers and append a number of fully connected layers of our own. This is consistent with the view that convolutional layers are excellent feature extractors for the image domain
- Do I make the source models weights trainable, as in allowing to change values during subsequent model training or do I make them constant?
- Leaving them constant, effectively treats the source model as a feature extractor. If your new data set is small, this is the recommended approach, at the risk of overfitting your data.
- The larger your data set is, the more confident you can be that letting the source network continue to train will not result in overfitting
- Whether to make the pre-trained embeddings trainable or not.
- the primary factor to consider when making this decision is dataset size. The larger your dataset is, the less likely that letting the embeddings be trainable, will result in over-fitting.
- Pretrained embedding
- Pretrained model
- Tensorflow hub
- Keras applications
Rules of Machine Learning: Best Practices for ML Engineering
- See also
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
Choosing the right estimator
Tip 4 Big Data
String 2 BeamRecord (beam)
Option 1
.apply(ParDo.of(
new
DoFn<String, BeamRecord>() {
@ProcessElement
public
void
processElement(ProcessContext c) {
//System.out.println(c.element());
c.output(
new
BeamRecord(type, c.element()));
}
}))
Option 2
.apply(MapElements.via(
new
SimpleFunction<String, BeamRecord>() {
public
BeamRecord apply(String input) {
//System.out.println(input);
return
new
BeamRecord(type, input);
}
}))
/* which can be expressed as below
.apply(MapElements.via(apply(intput) -> {
return new BeamRecord(type, input);
}))
*/
Using snappy (hive)
SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.type=BLOCK;
The place storing table statistics (hive)
- MySQL
- select * from TABLE_PARAMS
- select * from PARTITION_PARAMS
Options for specifying a schema (spark)
// 1
val
schema
=
new
StructType() .add(
"i_logid"
, IntegerType,
false
)
.add(
"i_logdetailid"
, IntegerType,
false
)
.add(
"i_logdes"
,
new
StructType().add(
"gamecode"
, StringType,
true
),
false
)
// 2
val
schema
=
StructType(
StructField(
"i_logid"
, IntegerType,
false
)
::
StructField(
"i_logdetailid"
, IntegerType,
false
)
::
StructField(
"i_logdes"
,
new
StructType().add(
"gamecode"
, StringType,
true
),
false
)
::
Nil
)
// 3
case
class
Des(gamecode
:
String)
case
class
Log(i
_
logid
:
Int, i
_
logdetailid
:
Int, i
_
logdes
:
Des)
import
org.apache.spark.sql.Encoders
val
schema
=
Encoders.product[Log].schema
// 4
spark.sql(
"select get_json_object(lower(cast(value as string)), '$.i_regdatetime') as i_regdatetime from rawData"
)
// 5
val
schema
=
spark.read.table(
"netmarbles.log_20170813"
).schema
- Estimates the sizes of java objects (spark)
- https://spark.apache.org/docs/2.1.0/api/scala/#org.apache.spark.util.SizeEstimator$
- E.g.
- import org.apache.spark.util.SizeEstimator
- SizeEstimator.estimate(myRdd)
- SizeEstimator.estimate(myDf)
- SizeEstimator.estimate(myDs)
- Using the desc option in the orderBy API (spark)
orderBy($"count".desc)
orderBy('count.desc)
orderBy(-'count)
- RDB 2 local using sqoop (sqoop)
- Use -jt option
- E.g. sqoop import -jt local --target-dir file:///home/hdfs/temp
- Use -fs and -jt options
- E.g. sqoop import -fs local -jt local
- File file:/hdp/apps/2.6.0.3-8/mapreduce/mapreduce.tar.gz does not exist
- mkdir -p /hdp/apps/2.6.0.3-8/mapreduce
- chown -R hdfs:hadoop /hdp
- cd /hdp/apps/2.6.0.3-8/mapreduce
- hdfs dfs -get /hdp/apps/2.6.0.3-8/mapreduce/mapreduce.tar.gz
- Use -jt option
- Read files in s3a from spark (spark)
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key","XXX")
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","false")
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint","host:port")
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key","XXX")
- spark.read.text("s3a://path/to/the/file")
- Setting the logging level of the ambari-agent.log (ambari)
- cd /etc/ambari-agent/conf
- cp logging.conf.sample logging.conf
- vim logging.conf
[logger_root]
level=WARNING
- Setting the logging level of the hiveserver2.log (hive)
- Ambari web UI -> hive--> config --> advanced hive-log4j --> hive.root.logger=INFO,DRFA
- Push JSON Records (spark)
- val df = temp.toDF("createdAt", "users", "tweet")
- json_rdd = df.toJSON.rdd
json_rdd.foreachPartition ( partition => { // Send records to Kinesis / Kafka })
- How to specify hive tez job name showing at resource manager UI (tez)
- You cannot. At lease, not the full name, because it is hard coded.
- https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
- final TezClient session = TezClient.newBuilder("HIVE-" + sessionId, tezConfig)
- https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
- However you can set the session ID using hive.session.id
- hive --hiveconf hive.session.id=session_id_name
- HIVE-session_id_name
- hive --hiveconf hive.session.id=session_id_name
- You cannot. At lease, not the full name, because it is hard coded.