Mungeol Heo: AI - Optimizer

Gradient descent (parameter learning)
- Stochastic gradient descent
  - Big data
  - Online/streaming learning
  - data suffering
  - Step size
    - Learning curve
    - Step size that decreases with iterations is very important
  - Coefficient
    - Never use the lastest learned coefficients
      - Never converge
    - Use average
  - Regularization
  - Mini-batch (suggested)
    - batch size = 100 (general)
    - batch size = 32, 64, 128 ...
Conjugate gradient, BFGS, and L-BFGS V.S. gradient descent
- Advantages
  - No need to manually pick alpha which is the learning rate
  - Often faster than gradient descent
- Disadvantage
  - more complex
Newton's method
- Converge fast
Normal equation
- Versus gradient descent
  - No need to choose alpha
  - Do not need to iterate
  - Slow if features number is very large
- Feature scaling is not actually necessary
- When X transpose and X will be non-invertible?
  - Redundant features
    - E.g. x1 = size of feet ^ 2, x2 = size of m ^ 2 -> x1 = (3.28) ^ 2 * x2
  - Too many features
    - E.g. 100 features and 10 training data
  - Solutions
    - Delete some features
    - Use regularization
Adagrad
- The Adagrad optimizer is one alternative. The key insight of Adagrad is that it modifies the learning rate adaptively for each coefficient in a model, monotonically lowering the effective learning rate. This works great for convex problems but isn't always ideal for the non-convex problem Neural Net training. You can use Adagrad by specifying AdagradOptimizer instead of GradientDescentOptimizer. Note that you may need to use a larger learning rate with Adagrad.
  - my_optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)
  - my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
Adam
- For non-convex optimization problems, Adam is sometimes more efficient than Adagrad. To use Adam, invoke the tf.train.AdamOptimizer method. This method takes several optional hyperparameters as arguments. In a production setting, you should specify and tune the optional hyperparameters carefully
Ftrl
- For wide network
Weighting data (custom loss function)
- 2018 Machine learning yearning
  - P76
Loss function
- Huber loss
  - In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss.
  - https://en.wikipedia.org/wiki/Huber_loss

Mungeol Heo

Monday, May 11, 2020

AI - Optimizer

No comments:

Post a Comment