Wednesday, August 22, 2018

Help 4 GCP

  • Access Denied: Table X:Y.Z: The user 123-compute@developer.gserviceaccount.com does not have permission to query table X:Y.Z (bigQuery)

    • Python
      • from google.cloud import bigquery 
      • from google.oauth2 import service_account 
      • credentials = service_account.Credentials.from_service_account_file( 'path/to/file.json') 
      • project_id = 'my-bq' 
      • client = bigquery.Client(credentials= credentials, project=project_id)
  • Cannot connect to the instance using SSH since the disk is full (GCE)

    • Check if your operating system supports automatic resizing: If so, using Cloud Console you can edit VM's root disk and increase its size. Your virtual machine instance can automatically resize the partition to recognize the additional space after you restart the instance. 
    • Use Interactive Serial Console feature to login to your VM and clean up your VM's disk or copy them to another storage, if you would need them later. 
    • If you know what data you want to delete, you can configure a startup script to remove the files and reboot your VM to run the script (e.g. rm /tmp/*). 
    • You can detach the persistent disk and attach this disk to another machine as an additional disk. On the temporary machine, you can mount it and clean up your data or copy them to another storage, if you would need them later. Finally, recreate the original instance with the same boot disk. You can follow the same steps described in this video to add your disk to another Linux VM but add your existing boot disk instead of creating a new disk. 
    • Check if your operating system supports automatic resizing: If yes, then create a snapshot of your persistent disk, create a new persistent disk with larger size from the snapshot. Finally, recreate the original instance with this larger boot disk.
  • No scalar data was found (tensorboard)

    • Use gcloud command to train the model.
  • prediction_lib.PredictionError: Failed to load model: Cloud ML only supports TF 1.0 or above and models saved in SavedModel format. (Error code: 0) (ml enine)

    • Check the model path which is the value of "--model-dir" flag.
    • Note:
      • Do not use the model location in the log info.
      • E.g. 
        • INFO:tensorflow:SavedModel written to: b"output/export/census/temp-b'1531882849'/saved_model.pb"
        • However, you should use "output/export/census/1531882849"
  • "error": "Prediction failed: unknown error." (ml engine)

    • This is because the model doesn't support the specified instance format.
    • E.g.
      • The model supports JSON instance for prediction.
      • However, a CSV instance has been specified for prediction.
    • If the error still happens, then try to specify the version of the model which supports instances for prediction.
  • ERROR: (gcloud.ml-engine.jobs.submit.training) Could not copy [/tmp/.../output/trainer-0.0.0.tar.gz] to [.../trainer-0.0.0.tar.gz]. Please retry: HTTPError 404: Not Found (ml engine)

    • Check the bucket name.
  • The schema of pandas dataframe created from read_gbq is different from bigQuery table (bigQuery)

    • Use from google.cloud import bigquery instead.
    • E.g. client.query('SELECT * FROM `pojectId.dataset.table` limit 1').result().schema
  • java.net.UnknownHostException: metadata (general)

    • Set one of the configurations place below.
      • google.cloud.auth.service.account.json.keyfile
      • fs.gs.auth.service.account.json.keyfile
  • java.io.IOException: Error accessing: bucket: null (hadoop)
    • Set "mapred.bq.gcs.bucket" configuration.
  • java.lang.NullPointerException: Required parameter projectId must be specified (hadoop)
    • Set "mapred.bq.project.id" configuration.
  • org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create load job with id prefix ${ID prefix}, reached max retries: 3, last failed load job (bigQuery)
    • org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create load job with id prefix ${ID prefix}, reached max retries: 3, last failed load job
      • Make sure using right data type for related column while creating TableRow.
  • Error detected while parsing row starting at position: 556531513. Error: Bad character (ASCII 0) encountered (bigQuery)
    • Find the character causing the problem.
      • less +556531513P test.csv
    • There will be a character like Ctrl-@ which is ^@.
    • Avoid or remove it before producing the CSV file, or from the CSV file.
      • gsutil cp gs://bucket/test.csv - | tr -d '\000' | gsutil cp - gs://bucket/test2.csv

AI - Preprocessing

  • Missing data
  • Text data
    • Replace data
      • Specific data to specific characters/symbols
  • Feature 
    • Create  new features
    • Bin/bucket
      • When bucketize the numerical column?

        • Numbers that are not meaningful

        • When you’re respecting the nonlinear relationship with your numeric values

        • When you try both wide and deep features

      • Tensorflow

        def get_quantile_based_boundaries(feature_values, num_buckets):
          boundaries = np.arange(1.0, num_buckets) / num_buckets
          quantiles = feature_values.quantile(boundaries)
          return [quantiles[q] for q in quantiles.keys()]
        
        # Divide longitude into 10 buckets.
        bucketized_longitude = tf.feature_column.bucketized_column(
          longitude, boundaries=get_quantile_based_boundaries(
            training_examples["longitude"], 10))
          
        # Divide latitude into 10 buckets.
        bucketized_latitude = tf.feature_column.bucketized_column(
          latitude, boundaries=get_quantile_based_boundaries(
            training_examples["latitude"], 10))


      • Python
        • df['price-binned'] = pd.cut(df['a'], np.linspace(min(df.a), max(df.a), 4) , labels=['l', 'm', 'h'], include_lowest=True)
    •  Interaction
    • Crosses
      • Tensorflow

        tf.feature_column.crossed_column(
          set([bucketized_longitude, bucketized_latitude]), hash_bucket_size=1000)


    • One hot encoding, dummy coding, effect coding, label encoding
    • Transformation
      • A log transform is a powerful tool for dealing with positive numbers with a heavy-tailed distribution
        • np.log10(biz_df['review_count'])
        • scipy.stats.boxcox(biz_df['review_count'], lmbda=0)
      • The Box-Cox formulation only works when the data is positive
        • For nonpositive data, one could shift the values by adding a fixed constant
        • stats.boxcox(biz_df['review_count'])
          • Finds the optimal transform parameter
    • Scaling / normalization
      • Feature scaling is useful in situations where a set of input features differs wildly in scale.
        • Use caution when performing min-max scaling and standardization on sparse features
      • Min-Max
        • Squeezes (or stretches) all feature values to be within the range of [0, 1]
        • E.g. x1 = (x1 - min(x1)) / (max(x1) - min(x1))
        • sklearn.preprocessing.minmax_scale(df[['n_tokens_content']])
        • This can hurt some models as it takes away weight from outliers
      • (z-score) standardization / variance scaling / mean normalization
        • Scaled feature has a mean of 0 and a variance of 1
        • E.g. x1 = (x1 - avg(x1)) / standard deviation of the x1
        • sklearn.preprocessing.StandardScaler().fit_transform(df[['n_tokens_content']])
        • Algorithm using Euclidean distance, such as KNN
      • L2 / Euclidean
        • the feature column has norm 1
        • sklearn.preprocessing.normalize(df[['n_tokens_content']], axis=0)
        • This comes in handy, especially when working with text data or clustering algorithms
      • Robust
        • RobustScaler is less prone to outliers
        • from sklearn.preprocessing import RobustScaler

      • Spark
        • StandardScaler

      • Pandas

        def linear_scale(series):
          min_val = series.min()
          max_val = series.max()
          scale = (max_val - min_val) / 2.0
          return series.apply(lambda x:((x - min_val) / scale) - 1.0)
         
        def log_normalize(series):
          return series.apply(lambda x:math.log(x+1.0))
        
        def clip(series, clip_to_min, clip_to_max):
          return series.apply(lambda x:(
            min(max(x, clip_to_min), clip_to_max)))
        
        def z_score_normalize(series):
          mean = series.mean()
          std_dv = series.std()
          return series.apply(lambda x:(x - mean) / std_dv)
        
        def binary_threshold(series, threshold):
          return series.apply(lambda x:(1 if x > threshold else 0))


      • Try alternate normalizations for various features to further improve performance.
        • Pandas
          • normalized_training_examples.hist(bins=20, figsize=(18, 12), xlabelsize=10)
      • Note: you can't possibly do a logarithm transformation after standardization because about half of the standardized values will be 0 or negative, hence have no logarithm
      • Example
    • Clipping
      • roomsPerPerson = min(totalRooms / population, 4)
        • Clipping the feature value at 4.0 doesn't mean that we ignore all values greater than 4.0. Rather, it means that all values that were greater than 4.0 now become 4.0. This explains the funny hill at 4.0. Despite that hill, the scaled feature set is now more useful than the original data.
    • Hashing
  • Feature Selection
    • Simple approach
      • Repeatedly using one feature to train, then select and add the best feature to the model. Repeat this process.
    • In modern deep learning, when data is plentiful, there has been a shift away from feature selection, and we are now more likely to give all the features we have to the algorithm and let the algorithm sort out which ones to use based on the data
    • rules of thumbs
      • If your features are mostly categorical, you should start by trying to implement a SelectKBest with a Chi2 ranker or a tree-based model selector
      • If your features are largely quantitative, using linear models as model-based selectors and relying on correlations tends to yield greater results
      • If you are solving a binary classification problem, using a Support Vector Classification model along with a SelectFromModel selector will probably fit nicely, as the SVC tries to find coefficients to optimize for binary classification tasks
      • A little bit of EDA can go a long way in manual feature selection. The importance of having domain knowledge in the domain from which the data originated cannot be understated
    • Filter methods
      • correlation coefficient
      • ANOVA test
      • chi-square test
      • variance threshold
    • Wrapper methods
      • recursive feature elimination 
      • sequential feature selection algorithms 
      • genetic algorithms
    • Embedded methods
      • Decision tree
      • L1 regularizer
        • Linear model
      • Embedding layer
        • How to choose the number of neurons of an embedding layer?

          • Try starting from the 4th root of the total number of possible values

          • Hyper tun: max = 35 

          • Higher dimensions -> higher chance of overfitting, slower training

        • multi-sense embeddings

          • Not always work

      • Weight
    • Spark
      • ChiSqSelector
    • Python 3
    • Example
  • Feature Extraction
    • Feature transformation
      • TSNE
        • from sklearn.manifold import TSNE
      • PCA
      • SVD
        • Singular value decomposition module will return the same components as PCA if our data is scaled, but different components when using the raw unscaled data
        • from sklearn.decomposition import TruncatedSVD
      • LDA
        • Linear Discriminant Analysis (LDA) is a feature transformation technique as well as a supervised classifier. It is commonly used as a preprocessing step for classification pipelines. The goal of LDA, like PCA, is to extract a new coordinate system and project datasets onto a lower-dimensional space. The main difference between LDA and PCA is that instead of focusing on the variance of the data as a whole like PCA, LDA optimizes the lower-dimensional space for the best class separability. This means that the new coordinate system is more useful in finding decision boundaries for classification models, which is perfect when building classification pipelines. The reason that LDA is extremely useful is that separating based on class separability helps us avoid overfitting in our machine learning pipelines. This is also known as preventing the curse of dimensionality. LDA also reduces computational costs.
      • LSA
        • Latent semantic analysis (LSA) is a feature extraction tool. It is helpful for the text that is a series of these three steps
          • A TF-IDF vectorization
          • A PCA (SVD, in this case, to account for the sparsity of text)
          • Row normalization
      • Nonlinear Featurization via K-Means Model Stacking
        • https://github.com/mungeol/feature-engineering-book/blob/master/07.03-05_K-means_featurization.ipynb
        • With cluster features, the linear classifier performs just as well as nonlinear classifiers
        • K-means featurization is useful for real-valued, bounded numeric features that form clumps of dense regions in space
        • k-means cannot handle feature spaces where the Euclidean distance does not make sense—i.e., weirdly distributed numeric variables or categorical variables. If the feature set contains those variables, then there are several ways to handle them: 
          • Apply k-means featurization only on the real-valued, bounded numeric features
          • Define a custom metric to handle multiple data types and use the k-medoids algorithms. (k-medoids is analogous to k-means but allows for arbitrary distance metrics.)
          • Convert categorical variables to binning statistics (see “Bin Counting” on page 87), then featurize them using k-means
      • Example
    • Feature learning
      • RBM
        • Restricted Boltzmann Machines is a simple deep learning architecture that is set up to learn a set number of new dimensions based on a probabilistic model that data follows. These machines are a family of algorithms with only one implemented in scikit-learn. The BernoulliRBM may be a nonparametric feature learner; however, as the name suggests, some expectations are set as to the values of the cells of the dataset.
      • Word embeddings
        • Likely one of the biggest contributors to the recent deep learning-fueled advancements of natural language processing/understanding/generation is the ability to project strings (words and phrases) into an n-dimensional feature set to grasp the context and minute detail in wording.
        • Approaches
      • Example
  • Imbalanced data / skewed classes
    • Reference
      • 2017 Mastering Machine Learning with Python in Six Steps
  • Outlier
    • Plot it
      • Box
    • Collect more outlier data
    • Keep it
      • Anomaly detection
    • Replace it with reasonable minimum or maximum value
    • Remove it
  • Shuffling
    • Pandas
      • df = df.reindex(np.random.permutation(df.index)

      • df = df.sample(frac=1)
      • df = df.sample(frac=1).reset_index(drop=True)
      • from sklearn.utils import shuffle
        • df = shuffle(df)
  • Image augmentation
  • Training, validation/dev, Test set
    • Your dev and test sets should come from the same distribution
    • Choose dev and test sets from a distribution that reflects what data you expect to get in the future and want to do well on. This may not be the same as your training data’s distribution
    • When you should train and test on different distributions
      • 2018 Machine learning yearning
        • P71
    • How to decide whether to use all your data (which have different distributions)
      • 2018 Machine learning yearning
        • P73
    • How to decide whether to include inconsistent data
      • 2018 Machine learning yearning
        • P75
    • How large do the dev/test sets need to be?
      • The old heuristic of a 70%/30% train/test split does not apply for problems where you have a lot of data; the dev and test sets can be much less than 30% of the data
      • The dev set should be large enough to detect differences between algorithms that you are trying out. For example, if classifier A has an accuracy of 90.0% and classifier B has an accuracy of 90.1%, then a dev set of 100 examples would not be able to detect this 0.1% difference. Compared to other machine learning problems I’ve seen, a 100 example dev set is small. Dev sets with sizes from 1,000 to 10,000 examples are common. With 10,000 examples, you will have a good chance of detecting an improvement of 0.1%
      • There is no need to have excessively large dev/test sets beyond what is needed to evaluate the performance of your algorithms
    • Eyeball and BlackBox dev set
      • 2018 Machine learning yearning
        • P36, P38
    • Training dev set
      • 2018 Machine learning yearning
        • Generalizing from the training set to the dev set
          • P77

AI - EDA

  • Visualization
    • df.hist
    • plt.scatter
    • sns.heatmap
  • Univariate
    • Categorical
      • pd.crosstab
      • sns.countplot
    • Continuous
      • df.describe
      • boxplot
      • displot
      • kdeplot
  • Bivariate
    • Category to category
      • sns.factorplot
    • Category to continuous
      • sns.jointplot
    • Continuous to category
      • sns.factorplot().map(sns.kde/dist/box)
    • Other
      • sns.regplot
  • Correlation
    • Pearson
      • Pandas
        • Dataframe.corr
      • Spark
        • Dataframe.stat.corr
    • Spearman
    • Kendall
  • Tool

AI - Algorithm

  • Classification
    • The recommended approach
      • Use AUC to select the model when you do not know which threshold will be used
      • Then use FN and FP to decide the threshold
    • SVM
      • Pros
        • Accurate in high-dimensional spaces
        • Memory efficient
      • Cons
        • Prone to overfitting
        • No probability estimation
        • For small datasets
      • Applications
        • Image recognition
        • Text category assignment
        • Detecting spam
        • Sentiment  analysis
        • Gene expression classification
        • Regression, outlier detection and clustering
  • CNN
    • Kernel size?

      • Recent research has shown that it's better to use smaller kernel sizes and add more convolutional layers. In other words, instead of using a nine by nine filter, try sequencing two layers of three by three filters

  • RNN

    • Cell_size = N_inputs // (size of the internal state in each of the cell)

      • Lstm = 4 internal states

      • Gru = 3

    • Use custom loss function

      • E.g. use several outputs to calculate the loss

    • Dropout is available
  • Recommendation
    • Collaborative Filtering
      • User-based
      • Item-based
      • Challenges
        • Data sparsity
        • Cold start
        • Scalability
      • WALS: Weighted Alternating Least Squares
    • Context-aware

      • Contextual pre-filtering, contextual post-filtering, and contextual modeling

    • Hybrid

AI - Tuning

  • Optimal probability cutoff point
    • 2017 Mastering Machine Learning with Python in Six Steps
  • Bias and Variance

    •  
    • High variance


    • High bias
      •  

    • Bias = Optimal error rate (“unavoidable bias”) + Avoidable bias = training error
    • Optimal error rate / unavoidable bias
      • Use human-level performance to estimate the optimal error rate and also set achievable “desired error rate.”
    • Avoidable bias
      • More complex model
        • DL: increase the  model size, such as the number of neurons/layers
      • More features
      • More polynomial features
      • Reduce or eliminate regularization
    • Variance = dev error - training error
      • More training data
        • Collect more
        • Data augmentation
      • Regularization
        • Works well when we have a lot of features, each of which contributes a bit to predicting y
      • Early stoping
      • Fewer features
        • Model selection or feature selection
        • Dimension reduction
      • Noise robustness
      • Sparse representation

      • More simple model
        • Try others first
        • DL: decrease the  model size, such as the number of neurons/layers
      • If you find that your dev set performance is much better than your test set performance, it is a sign that you have overfitted to the dev set. 

        • In this case, get a fresh dev set/get more dev set data

    • Both
      • Choosing the right model parameters
        • Regularization
          • Try decreasing lambda (fixes high bias)
          • Try increasing lambda (fixes high variance)
      • Modify input features based on insights from error analysis
      • Modify model architecture
        • Such as neural network architecture, so that it is more suitable for your problem
  • Data Mismatch


    • Try to understand what properties of the data differ between the training and the dev set distributions.
    • Try to find more training data that better matches the dev set examples that your algorithm has trouble with.
  •  Regularization
    • L1 or lasso
    • L2 or ridge
      • When to use l1 and l2

        • In practice, usually, the L2-norm provides more generalizable models than the L1 norm. However, we will end up with much more complex heavy models if we use L2 instead of L1. This happens because often features have a high correlation with each other, and L1 regularization which use one of them and throw the other away, whereas L2 regularization will keep both features and keep their weight magnitudes small. So with L1, you can end up with a smaller model but it may be less predictive.

    • Elastic Net
      • The elastic net is just a linear combination of the L1 and L2 regularizing penalties. This way, you get the benefits of sparsity for really poor predictive features while also keeping decent and great features with smaller weights to provide a good generalization. The only trade-off now is there are two instead of one hyperparameters to tune with the two different Lambda regularization parameters.
    • Dropout
      • When to use dropout

        • You also want to use this on larger networks because there is more capacity for the model to learn independent representations. In other words, there are more possible paths for the network to try. The more you drop out, therefore the less you keep, the stronger the regularization.

  • Hyperparameter
    • Tune Hyperparameters When Comparing Models
  • Neurons stop learning
    • Lower the learning rate

      • Increase the number of epoch or steps

      • Learn slow

    • Use other activation function, like leaky Relu
    • Use dropout

      • Limit the ability to learn

    • Batch normalization

      • weight normalization, layer normalization, self normalizing networks

      • Redesign the network

        • Identity shortcut

        • have auxiliary outputs at intermediate layers in the network

        • have alternate routes through the network that are shorter

      • Train faster

  • Visualization
    • Tensorboard
    • TFDV 

      • Tensorflow Data Validation
      • Monitor the difference between training, validation and test dataset

    • TFMA

      • TensorFlow Model Analysis
      • Check the ROC curve for each class

      • Check hourly performance

  • Error analysis
    • 2018 Machine learning yearning
      • P30, P32, P52
  • The Optimization Verification test
    • 2018 Machine learning yearning
      • P85

AI - Evaluation

AI

Tip 4 Big Data

  • String 2 BeamRecord (beam)

    • Option 1

      .apply(ParDo.of(new DoFn<String, BeamRecord>() {
          @ProcessElement
          public void processElement(ProcessContext c) {
              //System.out.println(c.element());
              c.output(new BeamRecord(type, c.element()));
          }
      }))
    • Option 2

       .apply(MapElements.via(new SimpleFunction<String, BeamRecord>() {
          public BeamRecord apply(String input) {
              //System.out.println(input);
              return new BeamRecord(type, input);
          }
      }))
        
      /* which can be expressed as below
      .apply(MapElements.via(apply(intput) -> {
                      return new BeamRecord(type, input);
              }))
      */
  • Using snappy (hive)

    • SET hive.exec.compress.output=true;
      SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
      SET mapred.output.compression.type=BLOCK;
  • The place storing table statistics (hive)

    • MySQL
    • select * from TABLE_PARAMS
    • select * from PARTITION_PARAMS
  • Options for specifying a schema (spark)

    // 1
    val schema = new StructType()  .add("i_logid", IntegerType, false)
      .add("i_logdetailid", IntegerType, false)
      .add("i_logdes"new StructType().add("gamecode", StringType, true), false)
     
    // 2
    val schema = StructType(
      StructField("i_logid", IntegerType, false::
        StructField("i_logdetailid", IntegerType, false::
        StructField("i_logdes"new StructType().add("gamecode", StringType, true), false::
        Nil
    )
     
    // 3
    case class Des(gamecode: String)
    case class Log(i_logid: Int, i_logdetailid: Int, i_logdes: Des)
    import org.apache.spark.sql.Encoders
    val schema = Encoders.product[Log].schema
     
    // 4
    spark.sql("select get_json_object(lower(cast(value as string)), '$.i_regdatetime') as i_regdatetime from rawData")
     
    // 5
    val schema = spark.read.table("netmarbles.log_20170813").schema
  • Estimates the sizes of java objects (spark)
  • Using the desc option in the orderBy API (spark)
    • orderBy($"count".desc)

    • orderBy('count.desc)

    • orderBy(-'count)

  • RDB 2 local using sqoop (sqoop)
    • Use -jt option
    • Use -fs and -jt options
      • E.g. sqoop import -fs local -jt local
    • File file:/hdp/apps/2.6.0.3-8/mapreduce/mapreduce.tar.gz does not exist
      • mkdir -p /hdp/apps/2.6.0.3-8/mapreduce
      • chown -R hdfs:hadoop /hdp
      • cd /hdp/apps/2.6.0.3-8/mapreduce
      • hdfs dfs -get /hdp/apps/2.6.0.3-8/mapreduce/mapreduce.tar.gz
  • Read files in s3a from spark (spark)
    • spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key","XXX") 
    • spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","false") 
    • spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint","host:port") 
    • spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key","XXX")
    • spark.read.text("s3a://path/to/the/file")
  • Setting the logging level of the ambari-agent.log (ambari)
    • cd /etc/ambari-agent/conf
    • cp logging.conf.sample logging.conf
    • vim logging.conf
      • [logger_root]
        level=WARNING

  • Setting the logging level of the hiveserver2.log (hive)
    • Ambari web UI -> hive--> config --> advanced hive-log4j --> hive.root.logger=INFO,DRFA
  • Push JSON Records (spark)
    • val df = temp.toDF("createdAt", "users", "tweet")
    • json_rdd = df.toJSON.rdd
    • json_rdd.foreachPartition ( partition => { // Send records to Kinesis / Kafka })

  • How to specify hive tez job name showing at resource manager UI (tez)