Wednesday, August 22, 2018

Tip 4 Big Data

  • String 2 BeamRecord (beam)

    • Option 1

      .apply(ParDo.of(new DoFn<String, BeamRecord>() {
          @ProcessElement
          public void processElement(ProcessContext c) {
              //System.out.println(c.element());
              c.output(new BeamRecord(type, c.element()));
          }
      }))
    • Option 2

       .apply(MapElements.via(new SimpleFunction<String, BeamRecord>() {
          public BeamRecord apply(String input) {
              //System.out.println(input);
              return new BeamRecord(type, input);
          }
      }))
        
      /* which can be expressed as below
      .apply(MapElements.via(apply(intput) -> {
                      return new BeamRecord(type, input);
              }))
      */
  • Using snappy (hive)

    • SET hive.exec.compress.output=true;
      SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
      SET mapred.output.compression.type=BLOCK;
  • The place storing table statistics (hive)

    • MySQL
    • select * from TABLE_PARAMS
    • select * from PARTITION_PARAMS
  • Options for specifying a schema (spark)

    // 1
    val schema = new StructType()  .add("i_logid", IntegerType, false)
      .add("i_logdetailid", IntegerType, false)
      .add("i_logdes"new StructType().add("gamecode", StringType, true), false)
     
    // 2
    val schema = StructType(
      StructField("i_logid", IntegerType, false::
        StructField("i_logdetailid", IntegerType, false::
        StructField("i_logdes"new StructType().add("gamecode", StringType, true), false::
        Nil
    )
     
    // 3
    case class Des(gamecode: String)
    case class Log(i_logid: Int, i_logdetailid: Int, i_logdes: Des)
    import org.apache.spark.sql.Encoders
    val schema = Encoders.product[Log].schema
     
    // 4
    spark.sql("select get_json_object(lower(cast(value as string)), '$.i_regdatetime') as i_regdatetime from rawData")
     
    // 5
    val schema = spark.read.table("netmarbles.log_20170813").schema
  • Estimates the sizes of java objects (spark)
  • Using the desc option in the orderBy API (spark)
    • orderBy($"count".desc)

    • orderBy('count.desc)

    • orderBy(-'count)

  • RDB 2 local using sqoop (sqoop)
    • Use -jt option
    • Use -fs and -jt options
      • E.g. sqoop import -fs local -jt local
    • File file:/hdp/apps/2.6.0.3-8/mapreduce/mapreduce.tar.gz does not exist
      • mkdir -p /hdp/apps/2.6.0.3-8/mapreduce
      • chown -R hdfs:hadoop /hdp
      • cd /hdp/apps/2.6.0.3-8/mapreduce
      • hdfs dfs -get /hdp/apps/2.6.0.3-8/mapreduce/mapreduce.tar.gz
  • Read files in s3a from spark (spark)
    • spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key","XXX") 
    • spark.sparkContext.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","false") 
    • spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint","host:port") 
    • spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key","XXX")
    • spark.read.text("s3a://path/to/the/file")
  • Setting the logging level of the ambari-agent.log (ambari)
    • cd /etc/ambari-agent/conf
    • cp logging.conf.sample logging.conf
    • vim logging.conf
      • [logger_root]
        level=WARNING

  • Setting the logging level of the hiveserver2.log (hive)
    • Ambari web UI -> hive--> config --> advanced hive-log4j --> hive.root.logger=INFO,DRFA
  • Push JSON Records (spark)
    • val df = temp.toDF("createdAt", "users", "tweet")
    • json_rdd = df.toJSON.rdd
    • json_rdd.foreachPartition ( partition => { // Send records to Kinesis / Kafka })

  • How to specify hive tez job name showing at resource manager UI (tez)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.