Monday, May 11, 2020

AI - Optimizer

  • Gradient descent (parameter learning)
    • Stochastic gradient descent
      • Big data
      • Online/streaming learning
      • data suffering
      • Step size
        • Learning curve
        • Step size that decreases with iterations is very important
      • Coefficient
        • Never use the lastest learned coefficients
          • Never converge
        • Use average
      • Regularization
      • Mini-batch (suggested)
        • batch size = 100 (general)
        • batch size = 32, 64, 128 ...
  • Conjugate gradient, BFGS, and L-BFGS V.S. gradient descent
    • Advantages
      • No need to manually pick alpha which is the learning rate
      • Often faster than gradient descent
    • Disadvantage
      • more complex
  • Newton's method
    • Converge fast
  • Normal equation
    • Versus gradient descent
      • No need to choose alpha
      • Do not need to iterate
      • Slow if features number is very large
    • Feature scaling is not actually necessary
    • When X transpose and X will be non-invertible?
      • Redundant features
        • E.g. x1 = size of feet ^ 2, x2 = size of m ^ 2 -> x1 = (3.28) ^ 2 * x2
      • Too many features
        • E.g. 100 features and 10 training data
      • Solutions
        • Delete some features
        • Use regularization
  • Adagrad
    • The Adagrad optimizer is one alternative. The key insight of Adagrad is that it modifies the learning rate adaptively for each coefficient in a model, monotonically lowering the effective learning rate. This works great for convex problems but isn't always ideal for the non-convex problem Neural Net training. You can use Adagrad by specifying AdagradOptimizer instead of GradientDescentOptimizer. Note that you may need to use a larger learning rate with Adagrad. 
      • my_optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)

      • my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)

  • Adam
    • For non-convex optimization problems, Adam is sometimes more efficient than Adagrad. To use Adam, invoke the tf.train.AdamOptimizer method. This method takes several optional hyperparameters as arguments. In a production setting, you should specify and tune the optional hyperparameters carefully
  • Ftrl
    • For wide network
  • Weighting data (custom loss function)
    • 2018 Machine learning yearning
      • P76
  • Loss function

Tip 4 GCP

BigQuery
  • pivot table
  • select date, m_name, m_value
    from test.kpi_daily
    unpivot(m_value for m_name in (b2c_signup_cnt, b2b_signup_cnt))

  • Format an array (bigQeury)
    • format("%T", array_agg(date order by date))
  • Get the last value (bigQuery)

    select array_agg(value order by time desc)[OFFSET(0)] value_last 
    from (
    select 1 time, 33 value
    union all
    select 2 time, 22 value
    union all
    select 3 time, 11 value
    )


  • Unnest multiple arrays (bigQuery)

    WITH data AS(
      SELECT 1 n, ['a', 'b'] AS r, [1,2] b, ['a1','b2'] c UNION ALL
      SELECT 2, ['c', 'd', 'e'], [3,4,5], ['c3','d4','e5'] UNION ALL
      select 3, ['f'], [6], ['f6']
    )
     
    SELECT n, r, b, c
    FROM data, UNNEST(r) r WITH OFFSET pos1, unnest(b) b WITH OFFSET pos2, unnest(c) c WITH OFFSET pos3
    where pos1=pos2 and pos2=pos3


  • Create dataset, execute a query and write to a table, export table to CSV in GCS (bigQuery)

  • Using the BigQuery Storage API to download large results (bigQuery)

  • Readability (bigQuery)

    • format(“%d”, 1000) = 1,000
  • Login streak (bigQuery)


    with tmp as select 'a' usn, date('2019-01-01') login_date
    union all select 'a'date('2019-01-02'union all select 'a'date('2019-01-04')
    union all select 'a'date('2019-01-05'union all select 'a'date('2019-01-06')
    union all select 'b'date('2019-01-02'union all select 'b'date('2019-01-03'))

    , tmp_user_min_login_date as (select usn, min(login_date) start_login_date from tmp group by usn)

    select usn, min(login_date) std_dt, max(login_date) end_dt, count(*) cnt from (
    select ta.usn, login_date, row_number() over (order by ta.usn, login_date) + date_diff(start_login_date, login_date, day) num
    from tmp ta inner join tmp_user_min_login_date tb on ta.usn=tb.usn
    )
    group by usn, num
    having cnt > 1
    order by usn, num desc


  • Column number of a table (bigQuery)

    • select array_length(regexp_extract_all(to_json_string(`netmarble-gameservice-ai.rmt_stonemmotw_ml.feature_20190517`),"\":"))total_columns
      from `netmarble-gameservice-ai.rmt_stonemmotw_ml.feature_20190517` limit 1

AI - Imbalanced data

  • Data

  • Metrics

    • Confusion metrics, precision, recall, f1

    • ROC

    • AUC PR

      • Maximize precision-recall curve for the less common class

    • Normalized Gini Coefficient

    • Business impact

      • E.g. reducing false positive is more important than reducing false negative

  • Algorithms

  • Resampling

    • Over/up-sampling minority class

      • Note

        • Always split into test and train sets before trying oversampling techniques

      • Options

        • Adding more copies of the minority class

        • Synthetic sampling

          • from imblearn.over_sampling import SMOTE

        • Adaptive sampling

          • Borderline-SMOTE

            • SMOTE 개선

        • ADASYN (adaptive synthetic sampling)

    • Under/down-sampling majority class

      • Options

        • Removing examples from the majority class

        • Tomek links

          • from imblearn.under_sampling import TomekLinks

        • Cluster centroids

          • from imblearn.under_sampling import ClusterCentroids

    • Over-sampling followed by under-sampling

      • Options

        • SMOTE and Tomek links

          • from imblearn.combine import SMOTETomek

    • Stratified sampling

      • Removing the variance of the proportion inside batches

      • Can be useful when batch training a classifier

  • Outliers

    • Interquartile Range (IQR)

      • We calculate this by the difference between the 75th percentile and the 25th percentile. Our aim is to create a threshold beyond the 75th and 25th percentile that in case some instance passes this threshold the instance will be deleted.

  • Features

    • Major class와 minor class의 데이터가 잘 분리되여 있으면 imbalace하여도 문제 없음

      • from sklearn.manifold import TSNE 

      • from sklearn.decomposition import PCA, TruncatedSVD

    • Adding

      • If classes were not well separable: maybe can we find a new additional feature that can help distinguish between the two classes and, so, improve the classifier accuracy

      • Compared to the approaches mentioned in the previous subsection that suggest changing the reality of data, this approach that consists of enriching data with more information from reality is a far better idea when it is possible.

    • Deleting

      • Drop all of the features that have very similar distributions between the two types of transactions

  • Hyperparameters

    • Classification threshold

      • Use the Precision-Recall curve to choose the threshold

  • Cross-validation

    • from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline

      • Resampling happens during CV

AI - Production

GCP

Tip 4 Other

  • Increase the density of x-ticks (python)
    • test = sns.distplot(df.BattlePower_diff, 100, kde=False)
      plt.xticks(rotation=90)
      test.xaxis.set_major_locator(plt.MaxNLocator(200))
  • Automatic timestamp when a cell on the same row gets updated (google sheet)

    function onEdit(e) {
      var ss = SpreadsheetApp.getActiveSheet();
      var r = ss.getActiveCell();
      //1.Change 'Sheet1' to be matching your sheet name
      if (r.getColumn() > 1 && ss.getName()=='Sheet1') { // 2. If Edit is done in any column after Column (A)  And sheet name is Sheet1 then:
    var celladdress ='A'+ r.getRowIndex() 
        ss.getRange(celladdress).setValue(new Date()).setNumberFormat("yyyy-MM-dd hh:mm:ss");
      }
    };


  • Automatic sorting when a cell gets updatd (google sheet)

    function onEdit(event){
      var sheet = event.source.getActiveSheet();
      var editedCell = sheet.getActiveCell();
    
      var columnToSortBy = 4;
      var tableRange = "B3:E9";
    
      if(editedCell.getColumn() == columnToSortBy){   
        var range = sheet.getRange(tableRange);
        range.sort( { column : columnToSortBy } );
      }
    }