- hardware provisioning
- storage systems
- run on the same nodes as HDFS
- yarn
- with a fixed amount memory and cores dedicated to Spark on each node
- run on different nodes in the same local-area network as HDFS
- run computing jobs on different nodes than the storage system
- run on the same nodes as HDFS
- local disks
- 4-8 disks per node
- without RAID(just as separate mount points)
- noatime option
- configurate the spark.local.dir variable to be a comma-separated list of the local disks
- same disks as HDFS, if running HDFS
- memory
- 8 GB - hundreds of GB
- 75% of the memory
- if memory > 200 GB, then run multiple worker JVMs per node
- standalone mode
- conf/spark-env.sh
- SPARK_WORKER_INSTANCES: set the number of workers per node
- SPARK_WORKER_CORES: the number of cores per worker
- conf/spark-env.sh
- standalone mode
- netowrk
- >= 10 gigabit
- CPU cores
- 8-16 cores per machine, or more
- reference
- storage systems
- third-party hadoop distributions
- CDH
- HDP (recommended)
- inheriting cluster configuration
- spark-env.sh
- HADOOP_CONF_DIR
- hdfs-site.xml
- core-site.xml
- HADOOP_CONF_DIR
- spark-env.sh
- reference
- external tools
- cluster-wide monitoring tool
- Gangila
- OS profiling tools
- dstat
- iostat
- iotop
- JVM utilities
- jstack
- jmap
- jstat
- jconsole
- cluster-wide monitoring tool
- optimizationproblemconfiguration
out of memory sysctl -w vm.max_map_count=65535
spark.storage.memoryMapThreshhold 131072too many open files sysctl -w fs.file-max=1000000
spark.shuffle.consolidateFiles true
spark.shuffle.manager sortconnection reset by peer -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=12 -XX:NewRatio=3 -XX:SurvivorRatio=3 error communication with MapOutputTracker spark.akka.askTimeout 120
spark.akka.lookupTimeout 120 - configuration
- 75% of a machine's memory (standalone)
- minimum executor heap size: 8 GB
- maximum executor heap size: 40 GB / under 45 GB (watch GC)
- kryo serialization
- parallel (old) / CMS / G1 GC
- pypy > cpython
- notification
- memory usage is not same as data size (2x, 3x bigger)
- prefer reduceby than groupby
- there are limitations when using python with spark streaming (at least for now)
Monday, November 30, 2015
Spark basis
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.