- API
- Prefer dataset API
- Case class
- Less code
- More auto complete
- Less memory
- More room for optimization
- However, it depends
- Prefer dataset API
- Configuration
- Executor cores
- <= 5
- Partitions
- "spark.sql.shuffle.partitions=2001" for big data
- Have the number of partitions two to three times the number of cores to maximize parallelism
- GC
- G1
- Parallel GC
- Driver
- Automatic restart
- spark.yarn.maxAppAttempts
- spark.yarn.am.attemptFailuresValidityInterval
- Automatic restart
- CBO (since 2.2)
- spark.sql.cbo.enabled
- Other
- spark.memory.fraction
- spark.memory.storageFraction
- spark.speculation
- spark.sql.streaming.schemaInference
- Executor cores
- Development
- Bucketing
- ETL
- Skip corrupt files
- spark.sql.files.ignoreCorruptFiles = true
- Skip corrupt records
- JSON
- "_corrupt_record" can be configured via spark.sql.columnNameOfCorruptRecord
- CSV
- JSON
- Skip corrupt files
- Monitoring
- Spark has a configurable metrics system
- Based on Dropwizard Metrics Library
- Use Graphite/Frafana to dashboard metrics
- In structured streaming, use StreamingQueryListener (starting apache spark 2.1)
- Spark has a configurable metrics system
- Hardware
- Xeon+FPGA
- Xeon+FPGA
Thursday, June 15, 2017
Spark 2.X
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.