- Gradient descent (parameter learning)
- Stochastic gradient descent
- Big data
- Online/streaming learning
- data suffering
- Step size
- Learning curve
- Step size that decreases with iterations is very important
- Coefficient
- Never use the lastest learned coefficients
- Never converge
- Use average
- Never use the lastest learned coefficients
- Regularization
- Mini-batch (suggested)
- batch size = 100 (general)
- batch size = 32, 64, 128 ...
- Stochastic gradient descent
- Conjugate gradient, BFGS, and L-BFGS V.S. gradient descent
- Advantages
- No need to manually pick alpha which is the learning rate
- Often faster than gradient descent
- Disadvantage
- more complex
- Advantages
- Newton's method
- Converge fast
- Normal equation
- Versus gradient descent
- No need to choose alpha
- Do not need to iterate
- Slow if features number is very large
- Feature scaling is not actually necessary
- When X transpose and X will be non-invertible?
- Redundant features
- E.g. x1 = size of feet ^ 2, x2 = size of m ^ 2 -> x1 = (3.28) ^ 2 * x2
- Too many features
- E.g. 100 features and 10 training data
- Solutions
- Delete some features
- Use regularization
- Redundant features
- Versus gradient descent
- Adagrad
- The Adagrad optimizer is one alternative. The key insight of Adagrad is that it modifies the learning rate adaptively for each coefficient in a model, monotonically lowering the effective learning rate. This works great for convex problems but isn't always ideal for the non-convex problem Neural Net training. You can use Adagrad by specifying AdagradOptimizer instead of GradientDescentOptimizer. Note that you may need to use a larger learning rate with Adagrad.
my_optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
- The Adagrad optimizer is one alternative. The key insight of Adagrad is that it modifies the learning rate adaptively for each coefficient in a model, monotonically lowering the effective learning rate. This works great for convex problems but isn't always ideal for the non-convex problem Neural Net training. You can use Adagrad by specifying AdagradOptimizer instead of GradientDescentOptimizer. Note that you may need to use a larger learning rate with Adagrad.
- Adam
- For non-convex optimization problems, Adam is sometimes more efficient than Adagrad. To use Adam, invoke the tf.train.AdamOptimizer method. This method takes several optional hyperparameters as arguments. In a production setting, you should specify and tune the optional hyperparameters carefully
- Ftrl
- For wide network
- Weighting data (custom loss function)
- 2018 Machine learning yearning
- P76
- 2018 Machine learning yearning
- Loss function
- Huber loss
- In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss.
- https://en.wikipedia.org/wiki/Huber_loss
- Huber loss
Monday, May 11, 2020
AI - Optimizer
Tip 4 GCP
- pivot table
- select date, m_name, m_valuefrom test.kpi_dailyunpivot(m_value for m_name in (b2c_signup_cnt, b2b_signup_cnt))
- Format an array (bigQeury)
- format("%T", array_agg(date order by date))
Get the last value (bigQuery)
select array_agg(value order by time desc)[OFFSET(0)] value_last from ( select 1 time, 33 value union all select 2 time, 22 value union all select 3 time, 11 value )
Unnest multiple arrays (bigQuery)
WITH data AS( SELECT 1 n, ['a', 'b'] AS r, [1,2] b, ['a1','b2'] c UNION ALL SELECT 2, ['c', 'd', 'e'], [3,4,5], ['c3','d4','e5'] UNION ALL select 3, ['f'], [6], ['f6'] ) SELECT n, r, b, c FROM data, UNNEST(r) r WITH OFFSET pos1, unnest(b) b WITH OFFSET pos2, unnest(c) c WITH OFFSET pos3 where pos1=pos2 and pos2=pos3
Create dataset, execute a query and write to a table, export table to CSV in GCS (bigQuery)
Using the BigQuery Storage API to download large results (bigQuery)
- https://cloud.google.com/bigquery/docs/pandas-gbq-migration#using_the_to_download_large_results
import pandas sql = "SELECT * FROM `bigquery-public-data.irs_990.irs_990_2012`" # Use the BigQuery Storage API to download results more quickly. df = pandas.read_gbq(sql, dialect='standard', use_bqstorage_api=True)
- https://cloud.google.com/bigquery/docs/pandas-gbq-migration#using_the_to_download_large_results
Readability (bigQuery)
- format(“%d”, 1000) = 1,000
Login streak (bigQuery)
with
tmp
as
(
select
'a'
usn,
date
(
'2019-01-01'
) login_date
union
all
select
'a'
,
date
(
'2019-01-02'
)
union
all
select
'a'
,
date
(
'2019-01-04'
)
union
all
select
'a'
,
date
(
'2019-01-05'
)
union
all
select
'a'
,
date
(
'2019-01-06'
)
union
all
select
'b'
,
date
(
'2019-01-02'
)
union
all
select
'b'
,
date
(
'2019-01-03'
))
, tmp_user_min_login_date
as
(
select
usn,
min
(login_date) start_login_date
from
tmp
group
by
usn)
select
usn,
min
(login_date) std_dt,
max
(login_date) end_dt,
count
(*) cnt
from
(
select
ta.usn, login_date, row_number() over (
order
by
ta.usn, login_date) + date_diff(start_login_date, login_date,
day
) num
from
tmp ta
inner
join
tmp_user_min_login_date tb
on
ta.usn=tb.usn
)
group
by
usn, num
having
cnt > 1
order
by
usn, num
desc
Column number of a table (bigQuery)
select array_length(regexp_extract_all(to_json_string(`netmarble-gameservice-ai.rmt_stonemmotw_ml.feature_20190517`),"\":"))total_columns
from `netmarble-gameservice-ai.rmt_stonemmotw_ml.feature_20190517` limit 1
AI - Imbalanced data
Data
30
https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data
58
https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset
24
Metrics
Confusion metrics, precision, recall, f1
ROC
AUC PR
Maximize precision-recall curve for the less common class
Normalized Gini Coefficient
Business impact
E.g. reducing false positive is more important than reducing false negative
Algorithms
Decision tree
Random forest, xgboost, lightGBM
from sklearn.ensemble import IsolationForest
AutoEncoders
Use latent representations as inputs for another model
GAN
Resampling
Over/up-sampling minority class
Note
Always split into test and train sets before trying oversampling techniques
Options
Adding more copies of the minority class
Synthetic sampling
from imblearn.over_sampling import SMOTE
Adaptive sampling
Borderline-SMOTE
SMOTE 개선
ADASYN (adaptive synthetic sampling)
Under/down-sampling majority class
Options
Removing examples from the majority class
Tomek links
from imblearn.under_sampling import TomekLinks
Cluster centroids
from imblearn.under_sampling import ClusterCentroids
Over-sampling followed by under-sampling
Options
SMOTE and Tomek links
from imblearn.combine import SMOTETomek
Stratified sampling
Removing the variance of the proportion inside batches
Can be useful when batch training a classifier
Outliers
Interquartile Range (IQR)
We calculate this by the difference between the 75th percentile and the 25th percentile. Our aim is to create a threshold beyond the 75th and 25th percentile that in case some instance passes this threshold the instance will be deleted.
Features
Major class와 minor class의 데이터가 잘 분리되여 있으면 imbalace하여도 문제 없음
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD
Adding
If classes were not well separable: maybe can we find a new additional feature that can help distinguish between the two classes and, so, improve the classifier accuracy
Compared to the approaches mentioned in the previous subsection that suggest changing the reality of data, this approach that consists of enriching data with more information from reality is a far better idea when it is possible.
Deleting
Drop all of the features that have very similar distributions between the two types of transactions
Hyperparameters
Classification threshold
Use the Precision-Recall curve to choose the threshold
Cross-validation
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
Resampling happens during CV
Tip 4 Other
- Increase the density of x-ticks (python)
- test = sns.distplot(df.BattlePower_diff, 100, kde=False)
plt.xticks(rotation=90)
test.xaxis.set_major_locator(plt.MaxNLocator(200))
- test = sns.distplot(df.BattlePower_diff, 100, kde=False)
Automatic timestamp when a cell on the same row gets updated (google sheet)
function onEdit(e) { var ss = SpreadsheetApp.getActiveSheet(); var r = ss.getActiveCell(); //1.Change 'Sheet1' to be matching your sheet name if (r.getColumn() > 1 && ss.getName()=='Sheet1') { // 2. If Edit is done in any column after Column (A) And sheet name is Sheet1 then: var celladdress ='A'+ r.getRowIndex() ss.getRange(celladdress).setValue(new Date()).setNumberFormat("yyyy-MM-dd hh:mm:ss"); } };
Automatic sorting when a cell gets updatd (google sheet)
function onEdit(event){ var sheet = event.source.getActiveSheet(); var editedCell = sheet.getActiveCell(); var columnToSortBy = 4; var tableRange = "B3:E9"; if(editedCell.getColumn() == columnToSortBy){ var range = sheet.getRange(tableRange); range.sort( { column : columnToSortBy } ); } }