Monday, May 11, 2020

AI - Optimizer

  • Gradient descent (parameter learning)
    • Stochastic gradient descent
      • Big data
      • Online/streaming learning
      • data suffering
      • Step size
        • Learning curve
        • Step size that decreases with iterations is very important
      • Coefficient
        • Never use the lastest learned coefficients
          • Never converge
        • Use average
      • Regularization
      • Mini-batch (suggested)
        • batch size = 100 (general)
        • batch size = 32, 64, 128 ...
  • Conjugate gradient, BFGS, and L-BFGS V.S. gradient descent
    • Advantages
      • No need to manually pick alpha which is the learning rate
      • Often faster than gradient descent
    • Disadvantage
      • more complex
  • Newton's method
    • Converge fast
  • Normal equation
    • Versus gradient descent
      • No need to choose alpha
      • Do not need to iterate
      • Slow if features number is very large
    • Feature scaling is not actually necessary
    • When X transpose and X will be non-invertible?
      • Redundant features
        • E.g. x1 = size of feet ^ 2, x2 = size of m ^ 2 -> x1 = (3.28) ^ 2 * x2
      • Too many features
        • E.g. 100 features and 10 training data
      • Solutions
        • Delete some features
        • Use regularization
  • Adagrad
    • The Adagrad optimizer is one alternative. The key insight of Adagrad is that it modifies the learning rate adaptively for each coefficient in a model, monotonically lowering the effective learning rate. This works great for convex problems but isn't always ideal for the non-convex problem Neural Net training. You can use Adagrad by specifying AdagradOptimizer instead of GradientDescentOptimizer. Note that you may need to use a larger learning rate with Adagrad. 
      • my_optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)

      • my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)

  • Adam
    • For non-convex optimization problems, Adam is sometimes more efficient than Adagrad. To use Adam, invoke the tf.train.AdamOptimizer method. This method takes several optional hyperparameters as arguments. In a production setting, you should specify and tune the optional hyperparameters carefully
  • Ftrl
    • For wide network
  • Weighting data (custom loss function)
    • 2018 Machine learning yearning
      • P76
  • Loss function

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.