Mungeol Heo: Spark 2.X

Thursday, June 15, 2017

API
- Prefer dataset API
  - Case class
  - Less code
  - More auto complete
  - Less memory
  - More room for optimization
- However, it depends
  - Slide: https://www.slideshare.net/databricks/demystifying-dataframe-and-dataset-with-kazuaki-ishizaki
Configuration
- Executor cores
  - <= 5
- Partitions
  - "spark.sql.shuffle.partitions=2001" for big data
  - Have the number of partitions two to three times the number of cores to maximize parallelism
- GC
  - G1
  - Parallel GC
- Driver
  - Automatic restart
    - spark.yarn.maxAppAttempts
    - spark.yarn.am.attemptFailuresValidityInterval
- CBO (since 2.2)
  - spark.sql.cbo.enabled
- Other
  - spark.memory.fraction
  - spark.memory.storageFraction
  - spark.speculation
  - spark.sql.streaming.schemaInference
Development
- Bucketing
  - Slide: https://www.slideshare.net/databricks/hive-bucketing-in-apache-spark-with-tejas-patil
- ETL
  - Skip corrupt files
    - spark.sql.files.ignoreCorruptFiles = true
  - Skip corrupt records
    - JSON
      - "_corrupt_record" can be configured via spark.sql.columnNameOfCorruptRecord
    - CSV
Monitoring
- Spark has a configurable metrics system
  - Based on Dropwizard Metrics Library
  - Use Graphite/Frafana to dashboard metrics
- In structured streaming, use StreamingQueryListener (starting apache spark 2.1)
Hardware
- Xeon+FPGA

Mungeol Heo