Friday, December 30, 2016

ML - Algorithm

  • Correlation
  • Regularization
    • L1 or lasso
    • L2 or ridge
  • gradient descent (parameter learning)
    • (batch) gradient descent
      • small data
      • choose learning rate
    • stochasitc gradient descent
      • big data
      • online / streaming learning
      • data shuffering
      • step size
        • learning curve
        • step size that decreases with iterations is very important
      • coefficient
        • never use lastet learned coefficients
          • never converge
        • use everage
      • regularization
      • mini batch (suggested)
        • batch size = 100 (general)
    • conjugate gradient, BFGS and L-BFGS V.S. gradient descent
      • advantages
        • no need to manually pick alpha whih is learning rate
        • often fater than gradient descent
      • disadvantage
        • more complex
  • normal equation
    • versus gradient descent
      • no need to choose alpha
      • do not need to iterate
      • slow if features number is very large
    • feature scaling is not actually necessary
    • when X transpose X will be non-invertible?
      • redundant features
        • e.g. x1 = size of feet ^ 2, x2 = size of m ^ 2 -> x1 = (3.28) ^ 2 * x2
      • too many features
        • e.g. 100 features and 10 training data
      • solutions
        • delete some features
        • use regularization
  • linear regression
    • y = continious value (e.g. house price)
  • logistic regression
    • classification
      • y = 1, 0
    • multiclass classification
      • y = 1, 2, 3 ...
      • one-vs-all (one-vs-rest)
  • boosting
    • face detection, malware classification, credit fraud detection, ads click through rate estimation, sales forecasting, ranking webpages for search, higgs boson detection
      • netflex
    • amsemble methods
      • adaBoost
        • basic classification
      • gradient boosting
        • beyond basic classification
      • random forests
        • bagging
        • simpler
        • easier to parallelize
        • typically higher error
  • neural networks
    • try to have same number of hidden units in every layer
      • usually the more the better, but computational
    • random initialization
    • forward propagation
    • compute cost function
    • back propagation
    • gradient checking
  • stemming
    • for text mining / classifier

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.