String 2 BeamRecord (beam)
Option 1
Option 2
Using snappy (hive)
SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.type=BLOCK;
The place storing table statistics (hive)
- MySQL
- select * from TABLE_PARAMS
- select * from PARTITION_PARAMS
Options for specifying a schema (spark)
- Estimates the sizes of java objects (spark)
- https://spark.apache.org/docs/2.1.0/api/scala/#org.apache.spark.util.SizeEstimator$
- E.g.
- import org.apache.spark.util.SizeEstimator
- SizeEstimator.estimate(myRdd)
- SizeEstimator.estimate(myDf)
- SizeEstimator.estimate(myDs)
- Using the desc option in the orderBy API (spark)
orderBy($"count".desc)
orderBy('count.desc)
orderBy(-'count)
- RDB 2 local using sqoop (sqoop)
- Use -jt option
- E.g. sqoop import -jt local --target-dir file:///home/hdfs/temp
- Use -fs and -jt options
- E.g. sqoop import -fs local -jt local
- File file:/hdp/apps/2.6.0.3-8/mapreduce/mapreduce.tar.gz does not exist
- mkdir -p /hdp/apps/2.6.0.3-8/mapreduce
- chown -R hdfs:hadoop /hdp
- cd /hdp/apps/2.6.0.3-8/mapreduce
- hdfs dfs -get /hdp/apps/2.6.0.3-8/mapreduce/mapreduce.tar.gz
- Use -jt option
- Read files in s3a from spark (spark)
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key","XXX")
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","false")
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint","host:port")
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key","XXX")
- spark.read.text("s3a://path/to/the/file")
- Setting the logging level of the ambari-agent.log (ambari)
- cd /etc/ambari-agent/conf
- cp logging.conf.sample logging.conf
- vim logging.conf
[logger_root]
level=WARNING
- Setting the logging level of the hiveserver2.log (hive)
- Ambari web UI -> hive--> config --> advanced hive-log4j --> hive.root.logger=INFO,DRFA
- Push JSON Records (spark)
- val df = temp.toDF("createdAt", "users", "tweet")
- json_rdd = df.toJSON.rdd
json_rdd.foreachPartition ( partition => { // Send records to Kinesis / Kafka })
- How to specify hive tez job name showing at resource manager UI (tez)
- You cannot. At lease, not the full name, because it is hard coded.
- https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
- final TezClient session = TezClient.newBuilder("HIVE-" + sessionId, tezConfig)
- https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
- However you can set the session ID using hive.session.id
- hive --hiveconf hive.session.id=session_id_name
- HIVE-session_id_name
- hive --hiveconf hive.session.id=session_id_name
- You cannot. At lease, not the full name, because it is hard coded.
Wednesday, August 22, 2018
Tip 4 Big Data
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.