- Gradient descent (parameter learning)
- Stochastic gradient descent
- Big data
- Online/streaming learning
- data suffering
- Step size
- Learning curve
- Step size that decreases with iterations is very important
- Coefficient
- Never use the lastest learned coefficients
- Never converge
- Use average
- Never use the lastest learned coefficients
- Regularization
- Mini-batch (suggested)
- batch size = 100 (general)
- batch size = 32, 64, 128 ...
- Stochastic gradient descent
- Conjugate gradient, BFGS, and L-BFGS V.S. gradient descent
- Advantages
- No need to manually pick alpha which is the learning rate
- Often faster than gradient descent
- Disadvantage
- more complex
- Advantages
- Newton's method
- Converge fast
- Normal equation
- Versus gradient descent
- No need to choose alpha
- Do not need to iterate
- Slow if features number is very large
- Feature scaling is not actually necessary
- When X transpose and X will be non-invertible?
- Redundant features
- E.g. x1 = size of feet ^ 2, x2 = size of m ^ 2 -> x1 = (3.28) ^ 2 * x2
- Too many features
- E.g. 100 features and 10 training data
- Solutions
- Delete some features
- Use regularization
- Redundant features
- Versus gradient descent
- Adagrad
- The Adagrad optimizer is one alternative. The key insight of Adagrad is that it modifies the learning rate adaptively for each coefficient in a model, monotonically lowering the effective learning rate. This works great for convex problems but isn't always ideal for the non-convex problem Neural Net training. You can use Adagrad by specifying AdagradOptimizer instead of GradientDescentOptimizer. Note that you may need to use a larger learning rate with Adagrad.
my_optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
- The Adagrad optimizer is one alternative. The key insight of Adagrad is that it modifies the learning rate adaptively for each coefficient in a model, monotonically lowering the effective learning rate. This works great for convex problems but isn't always ideal for the non-convex problem Neural Net training. You can use Adagrad by specifying AdagradOptimizer instead of GradientDescentOptimizer. Note that you may need to use a larger learning rate with Adagrad.
- Adam
- For non-convex optimization problems, Adam is sometimes more efficient than Adagrad. To use Adam, invoke the tf.train.AdamOptimizer method. This method takes several optional hyperparameters as arguments. In a production setting, you should specify and tune the optional hyperparameters carefully
- Ftrl
- For wide network
- Weighting data (custom loss function)
- 2018 Machine learning yearning
- P76
- 2018 Machine learning yearning
- Loss function
- Huber loss
- In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss.
- https://en.wikipedia.org/wiki/Huber_loss
- Huber loss
Monday, May 11, 2020
AI - Optimizer
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.