Thursday, June 15, 2017

Spark 2.X

  • API
  • Configuration
    • Executor cores
      • <= 5
    • Partitions
      • "spark.sql.shuffle.partitions=2001" for big data
      • Have the number of partitions two to three times the number of cores to maximize parallelism
    • GC
      • G1
      • Parallel GC
    • Driver
      • Automatic restart
        • spark.yarn.maxAppAttempts
        • spark.yarn.am.attemptFailuresValidityInterval
    • CBO (since 2.2)
      • spark.sql.cbo.enabled
    • Other
      • spark.memory.fraction
      • spark.memory.storageFraction
      • spark.speculation
      • spark.sql.streaming.schemaInference
  • Development
  • Monitoring
    • Spark has a configurable metrics system
      • Based on Dropwizard Metrics Library
      • Use Graphite/Frafana to dashboard metrics
    • In structured streaming, use StreamingQueryListener (starting apache spark 2.1)
  • Hardware
    • Xeon+FPGA

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.