Mungeol Heo: Tip 4 Big Data

String 2 BeamRecord (beam)

Option 1

.apply(ParDo.of(new DoFn<String, BeamRecord>() {
    @ProcessElement
    public void processElement(ProcessContext c) {
        //System.out.println(c.element());
        c.output(new BeamRecord(type, c.element()));
    }
}))

Option 2

 .apply(MapElements.via(new SimpleFunction<String, BeamRecord>() {
    public BeamRecord apply(String input) {
        //System.out.println(input);
        return new BeamRecord(type, input);
    }
}))
  
/* which can be expressed as below
.apply(MapElements.via(apply(intput) -> {
                return new BeamRecord(type, input);
        }))
*/

Using snappy (hive)

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

The place storing table statistics (hive)
- MySQL
- select * from TABLE_PARAMS
- select * from PARTITION_PARAMS

Options for specifying a schema (spark)

// 1
val schema = new StructType()  .add("i_logid", IntegerType, false)
  .add("i_logdetailid", IntegerType, false)
  .add("i_logdes", new StructType().add("gamecode", StringType, true), false)
 
// 2
val schema = StructType(
  StructField("i_logid", IntegerType, false) ::
    StructField("i_logdetailid", IntegerType, false) ::
    StructField("i_logdes", new StructType().add("gamecode", StringType, true), false) ::
    Nil
)
 
// 3
case class Des(gamecode: String)
case class Log(i_logid: Int, i_logdetailid: Int, i_logdes: Des)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Log].schema
 
// 4
spark.sql("select get_json_object(lower(cast(value as string)), '$.i_regdatetime') as i_regdatetime from rawData")
 
// 5
val schema = spark.read.table("netmarbles.log_20170813").schema

Estimates the sizes of java objects (spark)
- https://spark.apache.org/docs/2.1.0/api/scala/#org.apache.spark.util.SizeEstimator$
- E.g.
  - import org.apache.spark.util.SizeEstimator
  - SizeEstimator.estimate(myRdd)
  - SizeEstimator.estimate(myDf)
  - SizeEstimator.estimate(myDs)
Using the desc option in the orderBy API (spark)
- orderBy($"count".desc)
- orderBy('count.desc)
- orderBy(-'count)
RDB 2 local using sqoop (sqoop)
- Use -jt option
  - E.g. sqoop import -jt local --target-dir file:///home/hdfs/temp
- Use -fs and -jt options
  - E.g. sqoop import -fs local -jt local
- File file:/hdp/apps/2.6.0.3-8/mapreduce/mapreduce.tar.gz does not exist
  - mkdir -p /hdp/apps/2.6.0.3-8/mapreduce
  - chown -R hdfs:hadoop /hdp
  - cd /hdp/apps/2.6.0.3-8/mapreduce
  - hdfs dfs -get /hdp/apps/2.6.0.3-8/mapreduce/mapreduce.tar.gz
Read files in s3a from spark (spark)
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key","XXX")
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","false")
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint","host:port")
- spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key","XXX")
- spark.read.text("s3a://path/to/the/file")
Setting the logging level of the ambari-agent.log (ambari)
- cd /etc/ambari-agent/conf
- cp logging.conf.sample logging.conf
- vim logging.conf
  - [logger_root]
    level=WARNING
Setting the logging level of the hiveserver2.log (hive)
- Ambari web UI -> hive--> config --> advanced hive-log4j --> hive.root.logger=INFO,DRFA
Push JSON Records (spark)
- val df = temp.toDF("createdAt", "users", "tweet")
- json_rdd = df.toJSON.rdd
- json_rdd.foreachPartition ( partition => { // Send records to Kinesis / Kafka })
How to specify hive tez job name showing at resource manager UI (tez)
- You cannot. At lease, not the full name, because it is hard coded.
  - https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
    - final TezClient session = TezClient.newBuilder("HIVE-" + sessionId, tezConfig)
- However you can set the session ID using hive.session.id
  - hive --hiveconf hive.session.id=session_id_name
    - HIVE-session_id_name

Mungeol Heo

Wednesday, August 22, 2018

Tip 4 Big Data

No comments:

Post a Comment