Monday, November 30, 2015

Sqoop setting (CDH)

  1. Creating password file
    1. echo -n password > .password
    2. hdfs dfs -put .password /user/$USER/
  2. Installing the MySQL JDBC driver in CDH
    1. mkdir -p /var/lib/sqoop 
    2. chown sqoop:sqoop /var/lib/sqoop 
    3. chmod 755 /var/lib/sqoop
    4. donwload JDBC dirver from http://dev.mysql.com/downloads/connector/j/5.1.html
    5. sudo cp mysql-connector-java-version/mysql-connector-java-<version>-bin.jar /var/lib/sqoop/

Sqoop - use case example

  1. command
    1. hdfs dfs -rm -r -skipTrash /user/hive/warehouse/member_company;
      sqoop --options-file mysql2hdfs.option --query "SELECT * FROM test WHERE count > 10 and \$CONDITIONS" --target-dir /user/hive/warehouse/member_company
    2. sqoop export --connect jdbc:mysql://10.0.2.a/test?characterEncoding=utf-8 --username root --password 'yourpassword' --table r_input --export-dir /result/r_input --input-fields-terminated-by '\001' --outdir /tmp/sqoop-mungeol/code/
  2. options
    1. import
      --connect
      jdbc:mysql://10.0.1.b:3306/test
      --username
      test
      --password
      tkfkadlselqlt
      --split-by
      mem_idx
      -m
      3
      --outdir
      /tmp/sqoop-mungeol/code/

Spark use case

Baidu real-time security product


Spark test

  1. spark + mariaDB test
    1. SPARK_CLASSPATH=mysql-connector-java-5.1.34-bin.jar bin/pyspark
    2. df = sqlContext.load(source="jdbc", url="jdbc:mysql://10.0.2.a/test?user=root&password=yourpassword", dbtable="r_input_day")
    3. df.first()
  2. spark + elasticsearch test
    1. SPARK_CLASSPATH=/data/elasticsearch-hadoop-2.1.0.Beta4/dist/elasticsearch-hadoop-2.1.0.Beta4.jar ./bin/pyspark
    2. conf = {"es.resource":"sec-team/access-report"}
    3. rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
    4. rdd.first()
  3. spark streaming test
    1. network_wordcount.py
      1. from __future__ import print_function
        import sys
        from pyspark import SparkContext
        from pyspark.streaming import StreamingContext
        if __name__ == "__main__":
            if len(sys.argv) != 3:
                print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)
                exit(-1)
            sc = SparkContext(appName="PythonStreamingNetworkWordCount")
            ssc = StreamingContext(sc, 1)
            lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
            counts = lines.flatMap(lambda line: line.split(" "))\
                          .map(lambda word: (word, 1))\
                          .reduceByKey(lambda a, b: a+b)
            counts.pprint()
            ssc.start()
            ssc.awaitTermination()
    2. nc -lk 9999
    3. spark-submit network_wordcount.py localhost 9999

Spark basis

  1. hardware provisioning
    1. storage systems
      1. run on the same nodes as HDFS 
        1. yarn
        2. with a fixed amount memory and cores dedicated to Spark on each node
      2. run on different nodes in the same local-area network as HDFS
      3. run computing jobs on different nodes than the storage system
    2. local disks
      1. 4-8 disks per node
      2. without RAID(just as separate mount points)
      3. noatime option
      4. configurate the spark.local.dir variable to be a comma-separated list of the local disks
      5. same disks as HDFS, if running HDFS
    3. memory
      1. 8 GB - hundreds of GB
      2. 75% of the memory
      3. if memory > 200 GB, then run multiple worker JVMs per node
        1. standalone mode
          1. conf/spark-env.sh
            1. SPARK_WORKER_INSTANCES: set the number of workers per node
            2. SPARK_WORKER_CORES: the number of cores per worker
    4. netowrk
      1. >= 10 gigabit
    5. CPU cores
      1. 8-16 cores per machine, or more
    6. reference
      1. https://spark.apache.org/docs/latest/hardware-provisioning.html
  2. third-party hadoop distributions
    1. CDH
    2. HDP (recommended)
    3. inheriting cluster configuration
      1. spark-env.sh
        1. HADOOP_CONF_DIR
          1. hdfs-site.xml
          2. core-site.xml
    4. reference
      1. https://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
  3. external tools
    1. cluster-wide monitoring tool
      1. Gangila
    2. OS profiling tools
      1. dstat
      2. iostat
      3. iotop
    3. JVM utilities
      1. jstack
      2. jmap
      3. jstat
      4. jconsole
  4. optimization
    problem
    configuration
    out of memorysysctl -w vm.max_map_count=65535
    spark.storage.memoryMapThreshhold 131072
    too many open filessysctl -w fs.file-max=1000000
    spark.shuffle.consolidateFiles true
    spark.shuffle.manager sort
    connection reset by peer-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=12 -XX:NewRatio=3 -XX:SurvivorRatio=3
    error communication with MapOutputTrackerspark.akka.askTimeout 120
    spark.akka.lookupTimeout 120
  5. configuration
    1. 75% of a machine's memory (standalone)
    2. minimum executor heap size: 8 GB
    3. maximum executor heap size: 40 GB / under 45 GB (watch GC)
    4. kryo serialization
    5. parallel (old) / CMS / G1 GC
    6. pypy > cpython
  6. notification
    1. memory usage is not same as data size (2x, 3x bigger)
    2. prefer reduceby than groupby
    3. there are limitations when using python with spark streaming (at least for now)

Solr - HDFS configuration (HDP)

  1. The following changes only need to be completed for the first Solr node that is started
  2. vim /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_drive_schema_configs/conf/solrconfig.xml
    1.  <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
            <str name="solr.hdfs.home">hdfs://<host:port>/user/solr</str>
            <bool name="solr.hdfs.blockcache.enabled">true</bool>
            <int name="solr.hdfs.blockcache.slab.count">1</int>
            <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
            <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
            <bool name="solr.hdfs.blockcache.read.enabled">true</bool>
            <bool name="solr.hdfs.blockcache.write.enabled">false</bool>
            <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
            <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
            <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
          </directoryFactory>

    2. add <str name="solr.hdfs.confdir">/usr/hdp/current/hadoop-client/conf</str>, if namenode HA is configured
    3. set lockType to hdfs

Kafka test

  1. Cluster Setting
    1. setting zookeeper cluster
    2. process at each cluster node
    3. cd $KAFKA_HOME
    4. vim config/server.properties
    5. edit
      1. broker.id
    6. bin/kafka-server-start.sh config/server.properties & / sudo bin/kafka-server-start.sh -daemon config/test.properties &
  2. Basic Test
    1. bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic rep-test
    2. bin/kafka-topics.sh --list --zookeeper localhost:2181
    3. bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic rep-test
    4. bin/kafka-console-producer.sh --broker-list localhost:9092 --topic rep-test
      test 1
      test 2
      test 3
    5. bin/kafka-console-consumer.sh --zookeeper localhost:2181 --from-beginning --topic rep-test
  3. Fault Tolerance Test
    1. kill the leader
    2. do the basic test again
    3. start the killed server again
    4. do the basic test again

Kafka - configuration example

broker.id=81
port=9092
host.name=10.0.2.a
num.io.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/data/kafka-logs
log.retention.hours=168
log.cleaner.enable=false
zookeeper.connect=localhost:2181,10.0.2.a:2181,10.0.2.b:2181,10.0.2.c:2181
controlled.shutdown.enable=true
auto.leader.rebalance.enable=true
# Replication configurations
num.replica.fetchers=4
replica.fetch.max.bytes=1048576
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=65536
replica.lag.time.max.ms=10000
replica.lag.max.messages=4000
controller.socket.timeout.ms=30000
controller.message.queue.size=10
# Log configuration
num.partitions=8
message.max.bytes=1000000
auto.create.topics.enable=false
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
log.flush.interval.ms=10000
log.flush.interval.messages=20000
log.flush.scheduler.interval.ms=2000
log.roll.hours=168
log.retention.check.interval.ms=300000
log.segment.bytes=1073741824
# ZK configuration
zookeeper.connection.timeout.ms=6000
zookeeper.sync.time.ms=2000
# Socket server configuration
#num.io.threads=8
num.network.threads=8
socket.request.max.bytes=104857600
socket.receive.buffer.bytes=1048576
socket.send.buffer.bytes=1048576
queued.max.requests=16
fetch.purgatory.purge.interval.requests=100
producer.purgatory.purge.interval.requests=100

Kafka basis

  1. important properties
    1. broker
      1. brocerk.id
      2. log.dirs
      3. host.name
      4. zookeeper.connect
      5. controlled.shutdown.enable=true
      6. auto.leader.rebalance.enable=true
    2. consumer
      1. group.id
      2. zookeeper.connect
      3. fetch.message.max.bytes
      4. topic.id
    3. producer
      1. metadata.broker.list
      2. request.required.acks
      3. producer.type
      4. compression.codec
      5. topic.metadata.refresh.interval.ms
      6. batch.num.messages
      7. topic.id
  2. topic-level setting
    1. bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic my-topic --partitions 1 --replication-factor 1 --config max.message.bytes=64000 --config flush.messages=1
    2. bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --config max.message.bytes=128000
    3. bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --deleteConfig max.message.bytes
  3. Operations
    1. bin/kafka-topics.sh --zookeeper zk_host:port/chroot --create --topic my_topic_name --partitions 20 --replication-factor 3 --config x=y
    2. bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --partitions 40
    3. bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --config x=y
    4. bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --deleteConfig x
    5. bin/kafka-topics.sh --zookeeper zk_host:port/chroot --delete --topic my_topic_name
    6. bin/kafka-preferred-replica-election.sh --zookeeper zk_host:port/chroot
    7. bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181 --group test
  4. server production server configuration (from kafka document)
    1. # Replication configurations
      num.replica.fetchers=4
      replica.fetch.max.bytes=1048576
      replica.fetch.wait.max.ms=500
      replica.high.watermark.checkpoint.interval.ms=5000
      replica.socket.timeout.ms=30000
      replica.socket.receive.buffer.bytes=65536
      replica.lag.time.max.ms=10000
      replica.lag.max.messages=4000
      
      controller.socket.timeout.ms=30000
      controller.message.queue.size=10
      
      # Log configuration
      num.partitions=8
      message.max.bytes=1000000
      auto.create.topics.enable=true
      log.index.interval.bytes=4096
      log.index.size.max.bytes=10485760
      log.retention.hours=168
      log.flush.interval.ms=10000
      log.flush.interval.messages=20000
      log.flush.scheduler.interval.ms=2000
      log.roll.hours=168
      log.cleanup.interval.mins=30
      log.segment.bytes=1073741824
      
      # ZK configuration
      zookeeper.connection.timeout.ms=6000
      zookeeper.sync.time.ms=2000
      
      # Socket server configuration
      num.io.threads=8
      num.network.threads=8
      socket.request.max.bytes=104857600
      socket.receive.buffer.bytes=1048576
      socket.send.buffer.bytes=1048576
      queued.max.requests=16
      fetch.purgatory.purge.interval.requests=100
      producer.purgatory.purge.interval.requests=100

  5. Hardware
    1. disk throughput
    2. 8x7200 rpm SATA drives
    3. higher RPM SAS drives
  6. OS setting
    1. file descriptors
    2. max socket buffer size

Hive use case

  1. Software
    1. Hive 0.13.0 SetupHDP 2.1 General Availability
      • Hadoop 2.4.0
      • Tez 0.4.0
      • Hive 0.13.0
      HDP was deployed using Ambari 1.5.1. For the most part, the cluster used the Ambari defaults (except where noted below).  Hive 0.13.0 runs were done using Java 7 (default JVM).
      Tez and MapReduce were tuned to process all queries using 4 GB containers at a target container-to-disk ratio of 2.0. The ratio is important because it minimizes disk thrash and maximizes throughput.
      Other Settings:
      • yarn.nodemanager.resource.memory-mb was set to 49152
      • Default virtual memory for a job’s map-task and reduce-task were set to 4096
      • hive.tez.container.size was set to 4096
      • hive.tez.java.opts was set to -Xmx3800m
      • Tez app masters were given 8 GB
      • mapreduce.map.java.opts and mapreduce.reduce.java.opts were set to -Xmx3800m. This is smaller than 4096 to allow for some garbage collection overhead
      • hive.auto.convert.join.noconditionaltask.size was set to 1252698795
      Note:  this is 1/3 of the Xmx value, about 1.7 GB.
      The following additional optimizations were used for Hive 0.13.0:
    • Vectorized Query enabled
    • ORCFile formatted data
    • Map-join auto conversion enabled
  2. Hardware
    1. 20 physical nodes, each with:
      • 2x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz for total of 16 CPU cores/machine
      • Hyper-threading enabled
      • 256GB RAM per node
      • 6x 4TB WDC WD4000FYYZ-0 drives per node
      • 10 Gigabit interconnect between the nodes
      Notes: Based on the YARN Node Manager’s Memory Resource setting used below, only 48 GB of RAM per node was dedicated to query processing, the remaining 200 GB of RAM were available for system caches and HDFS.
      Linux Configurations:
    • /proc/sys/net/core/somaxconn = 512
    • /proc/sys/vm/dirty_writeback_centisecs = 6000
    • /proc/sys/vm/swappiness = 0
    • /proc/sys/vm/zone_reclaim_mode = 0
    • /sys/kernel/mm/redhat_transparent_hugepage/defrag = never
    • /sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag = no
    • /sys/kernel/mm/transparent_hugepage/khugepaged/defrag = 0

Hive - row_sequence()

  1. add jar /opt/cloudera/parcels/CDH/jars/hive-contrib-0.13.1-cdh5.3.0.jar;
  2. CREATE TEMPORARY FUNCTION row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';

Hive installation (HDP)

  1. mysql-connector-java (skip this step if you have installed it at 'HDP 2.3 installation')
    1. 2015.07.28 -> HDP 2.3 installation -> mysql-connector-java
    2. ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar
  2. RDB configuration
    1. mysql -u root -p
    2. CREATE USER ‘hive’@’localhost’ IDENTIFIED BY ‘hroqkf’;
    3. GRANT ALL PRIVILEGES ON *.* TO 'hive'@'localhost'; 
    4. CREATE USER ‘hive’@’%’ IDENTIFIED BY ‘hroqkf’; 
    5. GRANT ALL PRIVILEGES ON *.* TO 'hive'@'%'; 
    6. CREATE USER 'hive'@'bigdata-dev03.co.kr'IDENTIFIED BY 'hroqkf'; 
      1. be sure the hostname is the host where you installed hive metastore.
    7. GRANT ALL PRIVILEGES ON *.* TO 'hive'@'bigdata-dev03.co.kr'; 
    8. FLUSH PRIVILEGES;
    9. CREATE DATABASE hive;
  3. ambari web -> add service -> choose hive and tez -> assign masters -> assgin slaves and clients
  4. customize services
    1. hive -> advanced -> hive metastore -> hive database -> existing mysql database -> database host, database name, username, password -> test connection
  5. configure identities -> review -> install, start and test -> summary -> complete

Hive HA (HDP)

  1. prerequisite
    1. The relational database that backs the Hive Metastore itself should also be made highly available using best practices defined for the database system in use
  2. metastore
    1. ambari web -> services -> hive -> service antions -> add hive metastore -> choose host -> restart realted services
    2. install RDB client on the host where you installed hive metastore
      1. install MariaDB
        1. http://mungeol-heo.blogspot.kr/2015/07/mariadb.html
    3. RDB setting
      1. mysql -u root -p
      2. CREATE USER 'hive'@'bigdata-dev02.co.kr'IDENTIFIED BY 'hroqkf'; 
        1. be sure the hostname is the host where you installed hive metastore.
      3. GRANT ALL PRIVILEGES ON *.* TO 'hive'@'bigdata-dev02.co.kr'; 
      4. FLUSH PRIVILEGES;
  3. hiveserver2
    1. ambari web -> services -> hive -> service antions -> add hiveserver2 -> choose host -> restart realted services
  4. webhcat
    1. ambari web -> hosts -> click hostname -> add -> webhcat server -> restart realted services

Flume installation

  1. wget http://apache.mirror.cdnetworks.com/flume/1.5.0.1/apache-flume-1.5.0.1-bin.tar.gz
  2. tar xvf apache-flume-1.5.0.1-bin.tar.gz
  3. cd apache-flume-1.5.0.1-bin
  4. cp conf/flume-env.sh.template conf/flume-env.sh
  5. vim conf/flume-env.sh
  6. JAVA_OPTS="-Xms1g -Xmx1g"
  7. cp conf/flume-conf.properties.template conf/flume-conf.properties
  8. vim conf/flume-conf.properties
  9. bin/flume-ng agent -n agent01 -c conf -f conf/flume-conf.properties

Flume - use case example

  1. wiselog access log
    1. log server
      1. ## agent(s): agent01
        ## names of source(s), channel(s) and sink(s)
        agent01.sources = source01
        agent01.channels = channel01
        agent01.sinks = sink01
        
        ## channel01
        agent01.channels.channel01.type = memory
        agent01.channels.channel01.capacity = 100000
        
        ## source01
        agent01.sources.source01.channels = channel01
        agent01.sources.source01.type = spooldir
        agent01.sources.source01.spoolDir = /data/log-collector/spool-dir
        agent01.sources.source01.deletePolicy = immediate
        agent01.sources.source01.basenameHeader = true
        agent01.sources.source01.basenameHeaderKey = type
        agent01.sources.source01.deserializer.maxLineLength = 10240
        agent01.sources.source01.interceptors = interceptor02
        agent01.sources.source01.interceptors.interceptor02.type = static
        agent01.sources.source01.interceptors.interceptor02.key = timestamp
        agent01.sources.source01.interceptors.interceptor02.value = 0
        
        ## sink01
        agent01.sinks.sink01.channel = channel01
        agent01.sinks.sink01.type = avro
        agent01.sinks.sink01.hostname = <hostname>
        agent01.sinks.sink01.port = 4545
        
        ## test
        #agent01.sinks.sink01.channel = channel01
        #agent01.sinks.sink01.type = logger
    2. HDFS server
      1. ## agent(s): agent01
        ## names of source(s), channel(s) and sink(s)
        agent01.sources = source01
        agent01.channels = channel01
        agent01.sinks = sink01
          
        ## channel01
        agent01.channels.channel01.type = memory
        agent01.channels.channel01.capacity = 100000
          
        ## source01
        agent01.sources.source01.channels = channel01
        agent01.sources.source01.type = avro
        agent01.sources.source01.bind = <hostname>
        agent01.sources.source01.port = 4545
        agent01.sources.source01.interceptors = interceptor01 interceptor02
        agent01.sources.source01.interceptors.interceptor01.type = host
        agent01.sources.source01.interceptors.interceptor02.type = regex_extractor
        agent01.sources.source01.interceptors.interceptor02.regex = ^\\d+\\.\\d+.\\d+.\\d+\\s\\[(\\d{2}\\/[a-zA-Z]{3}\\/\\d{4}:\\d{2}:\\d{2}:\\d{2})\\s\\+0900\\]\\s
        agent01.sources.source01.interceptors.interceptor02.serializers = s01
        agent01.sources.source01.interceptors.interceptor02.serializers.s01.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
        agent01.sources.source01.interceptors.interceptor02.serializers.s01.pattern = dd/MMM/yyyy:HH:mm:ss
        agent01.sources.source01.interceptors.interceptor02.serializers.s01.name = timestamp
          
        ## sink01
        agent01.sinks.sink01.channel = channel01
        agent01.sinks.sink01.type = hdfs
        agent01.sinks.sink01.hdfs.path = /log/wiselog/access/%{type}/yyyymmdd=%Y%m%d/hh=%H
        agent01.sinks.sink01.hdfs.filePrefix = %{host}
        agent01.sinks.sink01.hdfs.inUsePrefix = .
        agent01.sinks.sink01.hdfs.rollInterval = 300
        agent01.sinks.sink01.hdfs.rollSize = 0
        agent01.sinks.sink01.hdfs.rollCount = 0
        agent01.sinks.sink01.hdfs.idleTimeout = 60
        agent01.sinks.sink01.hdfs.writeFormat = Text
        agent01.sinks.sink01.hdfs.codeC = gzip
          
        ## test
        #agent01.sinks.sink01.channel = channel01
        #agent01.sinks.sink01.type = logger
  2. apache access log
    1. log server
      1. ## agent(s): agent01
        ## names of source(s), channel(s) and sink(s)
        agent01.sources = source01 source02
        agent01.channels = channel01
        agent01.sinks = sink01
          
        ## channel01
        agent01.channels.channel01.type = memory
        agent01.channels.channel01.capacity = 100000
          
        ## source01
        agent01.sources.source01.channels = channel01
        agent01.sources.source01.type = spooldir
        agent01.sources.source01.spoolDir = /home/log-collector/test
        agent01.sources.source01.deletePolicy = immediate
        agent01.sources.source01.deserializer.maxLineLength = 204800
        agent01.sources.source01.interceptors = interceptor01 interceptor02
        agent01.sources.source01.interceptors.interceptor01.type = static
        agent01.sources.source01.interceptors.interceptor01.key = type
        agent01.sources.source01.interceptors.interceptor01.value = test
        agent01.sources.source01.interceptors.interceptor02.type = static
        agent01.sources.source01.interceptors.interceptor02.key = timestamp
        agent01.sources.source01.interceptors.interceptor02.value = 0
        
        ## source02
        agent01.sources.source02.channels = channel01
        agent01.sources.source02.type = spooldir
        agent01.sources.source02.spoolDir = /home/log-collector/test2
        agent01.sources.source02.deletePolicy = immediate
        agent01.sources.source02.deserializer.maxLineLength = 204800
        agent01.sources.source02.interceptors = interceptor01 interceptor02
        agent01.sources.source02.interceptors.interceptor01.type = static
        agent01.sources.source02.interceptors.interceptor01.key = type
        agent01.sources.source02.interceptors.interceptor01.value = test2
        agent01.sources.source02.interceptors.interceptor02.type = static
        agent01.sources.source02.interceptors.interceptor02.key = timestamp
        agent01.sources.source02.interceptors.interceptor02.value = 0
          
        ## sink01
        agent01.sinks.sink01.channel = channel01
        agent01.sinks.sink01.type = avro
        agent01.sinks.sink01.hostname = 10.0.2.a
        agent01.sinks.sink01.port = 4545
          
        ## test
        #agent01.sinks.sink01.channel = channel01
        #agent01.sinks.sink01.type = logger
    2. HDFS server
      1. ## agent(s): agent01
        ## names of source(s), channel(s) and sink(s)
        agent01.sources = source01
        agent01.channels = channel01
        agent01.sinks = sink01
          
        ## channel01
        agent01.channels.channel01.type = memory
        agent01.channels.channel01.capacity = 100000
          
        ## source01
        agent01.sources.source01.channels = channel01
        agent01.sources.source01.type = avro
        agent01.sources.source01.bind = 10.0.2.a
        agent01.sources.source01.port = 4545
        agent01.sources.source01.interceptors = interceptor01 interceptor02
        agent01.sources.source01.interceptors.interceptor01.type = host
        agent01.sources.source01.interceptors.interceptor02.type = regex_extractor
        agent01.sources.source01.interceptors.interceptor02.regex = ^[\\s\\S]+\\[(\\d{1,2}\\/[A-Z][a-z]{2}\\/\\d{4}:\\d{2}:\\d{2}:\\d{2})\\s\\+0900\\][\\s\\S]+
        agent01.sources.source01.interceptors.interceptor02.serializers = s01
        agent01.sources.source01.interceptors.interceptor02.serializers.s01.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
        agent01.sources.source01.interceptors.interceptor02.serializers.s01.pattern = dd/MMM/yyyy:HH:mm:ss
        agent01.sources.source01.interceptors.interceptor02.serializers.s01.name = timestamp
          
        ## sink01
        agent01.sinks.sink01.channel = channel01
        agent01.sinks.sink01.type = hdfs
        agent01.sinks.sink01.hdfs.path = /log/apache/access/%{type}/yyyymmdd=%Y%m%d/hh=%H
        agent01.sinks.sink01.hdfs.filePrefix = %{host}
        agent01.sinks.sink01.hdfs.inUsePrefix = .
        agent01.sinks.sink01.hdfs.rollInterval = 300
        agent01.sinks.sink01.hdfs.rollSize = 0
        agent01.sinks.sink01.hdfs.rollCount = 0
        agent01.sinks.sink01.hdfs.idleTimeout = 120
        agent01.sinks.sink01.hdfs.writeFormat = Text
        agent01.sinks.sink01.hdfs.codeC = gzip
          
        ## test
        #agent01.sinks.sink01.channel = channel01
        #agent01.sinks.sink01.type = logger

Ambari - enable kerberos (HDP)

  1. prerequisite
    1. Kerberos installation
      1. http://mungeol-heo.blogspot.kr/2015/08/kerberos.html
    2. Java Cryptography Extension (no need if it is installed by ambari)
  2. ambari web -> admin -> kerberos -> enable kerberos -> exisiting MIT KDC -> provide information about the KDC and admin account -> configure kerberos -> install and test kerberos client -> confirm configuration -> stop services -> kerberize cluster -> start and test services

Ambari - configure email notification using sendmail (HDP)

  1. start sendmail on the host where ambari server runs
    1. service sendmail start
  2. ambari web -> alerts -> actions -> manage notifications -> "+"
  3. name
    1. test
  4. groups
    1. all
  5. use default value for others
  6. save
  7. test
    1. select one service/component and stop/start it

Manual Upgrade HDP 2.2 to 2.3

  1. option - local repository setting for HDP 2.2 update (temporary internet access)
    1. cd /etc/yum.repos.d
    2. rm -fambari* HDP*
    3. wget http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.1.0/ambari.repo
    4. wget http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.3.0.0/hdp.repo
    5. yum -y install yum-utils createrepo
    6. install and run web server
      1. HTTPD
    7. mkdir -p /var/www/html/
    8. cd /var/www/html/
    9. mkdir -p ambari/centos6
    10. cd ambari/centos6
    11. reposync -r Updates-ambari-2.1.0
    12. createrepo Updates-ambari-2.1.0
    13. cd /var/www/html/
    14. mkdir -p hdp/centos6
    15. cd hdp/centos6
    16. reposync -r HDP-2.3.0.0
    17. reposync -r HDP-UTILS-1.1.0.20
    18. createrepo HDP-2.3.0.0
    19. createrepo HDP-UTILS-1.1.0.20
    20. edit DocumentRoot and Directory at /usr/local/apache2/conf/httpd.conf or
      1. cd /usr/local/apache2/htdocs
      2. ln -s /var/www/html/ambari ambari
      3. ln -s /var/www/html/hdp hdp
    21. http://hostname/ambari/centos6/Updates-ambari-2.1.0
    22. http://hostname/hdp/centos6/HDP-2.3.0.0
    23. http://hostname/hdp/centos6/HDP-UTILS-1.1.0.20
    24. option (If you have multiple repositories configured in your environment)
      1. yum -y install yum-plugin-priorities
      2. vim /etc/yum/pluginconf.d/priorities.conf
        1. [main]
        2. enabled=1
        3. gpgcheck=0
    25. stop web server at the end
      1. /usr/local/apache2/bin/apachectl -k stop
  2. Upgrading ambari 2.0 to 2.1 (HDP)
    1. http://mungeol-heo.blogspot.kr/2015/11/upgrading-ambari-20-to-21-hdp.html
  3. Upgrading ambari metrics (HDP)
    1. http://mungeol-heo.blogspot.kr/2015/11/upgrading-ambari-metrics-hdp.html
  4. ambari web -> admin > stack and versions > manage versions > + register version
  5. enter 0.0
  6. enter 'http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.3.0.0' for HDP base URL
  7. enter 'http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/centos6' for HDP-UTILS base URL
  8. save
  9. go to dashboard > admin > stack and versions > verions > install packages > ok
  10. record component layoutbrowse to each Service except HDFS and ZooKeeper and perform Stop
    1. http://ambari.server:8080/api/v1/clusters/cluster.name/hosts?fields=host_components
  11. namenode (active) hostlog in to ambari
    1. cd <dfs.namenode.name.dir>
    2. Make sure that only a "/current" directory and no "/previous" directory exists
    3. cp -r current /home/hdfs/
    4. su hdfs
    5. cd
    6. hdfs fsck / -files -blocks -locations > dfs-old-fsck-1.log
    7. hdfs dfsadmin -report > dfs-old-report-1.log
    8. hdfs dfsadmin -safemode enter
    9. hdfs dfsadmin -saveNamespace
    10. hdfs dfsadmin -finalizeUpgrade
  12. Using Ambari Web, stop HDFS service and stop ZooKeeper service
  13. namenode (active) hosthdp-select set all 2.3.0.0-2557 (all hosts)
    1. cd <dfs.namenode.name.dir>/current
    2. hdfs oev -i edits_inprogress_* -o edits.out
    3. Verify edits.out file. It should only have OP_START_LOG_SEGMENT transaction
  14. mkdir -p /work/upgrade_hdp_2
  15. cd /work/upgrade_hdp_2
  16. curl -O https://raw.githubusercontent.com/apache/ambari/branch-2.1/ambari-server/src/main/python/upgradeHelper.py
  17. chmod 777 upgradeHelper.py
  18. curl -O https://raw.githubusercontent.com/apache/ambari/branch-2.1/ambari-server/src/main/resources/upgrade/catalog/UpgradeCatalog_2.2_to_2.3.json
  19. python upgradeHelper.py --hostname $HOSTNAME --user $USERNAME --password $PASSWORD --clustername $CLUSTERNAME --fromStack $FROMSTACK --toStack $TOSTACK --upgradeCatalog UpgradeCatalog_2.2_to_2.3.json update-configs [config-type]curl -O https://raw.githubusercontent.com/apache/ambari/branch-2.1/ambari-server/src/main/resources/upgrade/catalog/UpgradeCatalog_2.2_to_2.3_step2.json
    1. ariable
      Value
      $HOSTNAME
      Ambari Server hostname. This should be the FQDN for the host running the Ambari Server.
      $USERNAME
      Ambari Admin user.
      $PASSWORD
      Password for the user.
      $CLUSTERNAME
      Name of the cluster. This is the name you provided when you installed the cluster with Ambari. Login to Ambari and the name can be found in the upper-left of the Ambari Web screen. This is case-sensitive.
      $FROMSTACK
      The “from” stack. Forexample:2.2
      $TOSTACK
      The “to” stack. Forexample:2.3
      config-type
      Optional: the config-type to upgrade. For example:hdfs-site. By default, all configurations are updated.
  20. upgrade zookeeper
    1. start it at ambari web
    2. mv /etc/zookeeper/conf /etc/zookeeper/conf.saved
    3. ln -s /usr/hdp/current/zookeeper-client/conf /etc/zookeeper/conf
  21. upgrade HDFS
    1. \cp -r /etc/hadoop/conf/* /etc/hadoop/2.3.0.0-2557/0/
    2. mv /etc/hadoop/conf /etc/hadoop/conf.saved
    3. ln -s /usr/hdp/current/hadoop-client/conf /etc/hadoop/conf
    4. su -l hdfs -c "/usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh start namenode -upgrade"
    5. ps -ef | grep -i NameNode
    6. su -l hdfs -c "/usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh start datanode"
    7. ps -ef | grep DataNode
    8. ambari web > services > HDFS > restart all
    9. run service check
    10. su -l hdfs -c "hdfs dfsadmin -safemode get"
      1. safe mode is off
  22. upgrade YARN and MR2note that you may have to upgrade other services to complete the upgrading which depends on your cluster configuration
    1. su -l hdfs -c "hdfs dfs -mkdir -p /hdp/apps/2.3.0.0-2557/mapreduce/"
    2. su -l hdfs -c "hdfs dfs -put /usr/hdp/current/hadoop-client/mapreduce.tar.gz /hdp/apps/2.3.0.0-2557/mapreduce/."
    3. su -l hdfs -c "hdfs dfs -put /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar /hdp/apps/2.3.0.0-2557/mapreduce/."
    4. su -l hdfs -c "hdfs dfs -chown -R hdfs:hadoop /hdp" 
    5. su -l hdfs -c "hdfs dfs -chmod -R 555 /hdp/apps/2.3.0.0-2557/mapreduce" 
    6. su -l hdfs -c "hdfs dfs -chmod -R 444 /hdp/apps/2.3.0.0-2557/mapreduce/mapreduce.tar.gz" 
    7. su -l hdfs -c "hdfs dfs -chmod -R 444 /hdp/apps/2.3.0.0-2557/mapreduce/hadoop-streaming.jar"
    8. su -l yarn -c "yarn resourcemanager -format-state-store"
    9. start YARN and MR2 from ambari web
    10. run service check for YARN and MR2
  23. sudo su -l hdfs -c "hdfs dfsadmin -finalizeUpgrade"
  24. ambari-server set-current --cluster-name=dev --version-display-name=HDP-2.3.0.0
  25. option
    1. If your cluster includes Ranger
      1. cd /work/upgrade_hdp_2
        1. python upgradeHelper.py --hostname $HOSTNAME --user $USERNAME --password $PASSWORD --clustername $CLUSTERNAME --fromStack $FROMSTACK --toStack $TOSTACK --upgradeCatalog UpgradeCatalog_2.2_to_2.3_step2.json update-configs [config-type]

Upgrading ambari 2.0 to 2.1 (HDP)

  1. stop Ambari Metrics
  2. ambari-server stop
  3. all hosts
    1. ambari-agent stop
    2. cd /etc/yum.repos.d
    3. rm -f ambari* HDP*
    4. wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.1.0/ambari.repo -O /etc/yum.repos.d/ambari.repo
  4. yum clean all; yum -y upgrade ambari-server
  5. Confirm there is only one ambari-server*.jar file in /usr/lib/ambari-server. If there is more than one JAR file with name ambari-server*.jar, move all JARs except ambari-server-2.1.0.*.jar to /tmp before proceeding with upgrade
  6. all hosts
    1. yum -y upgrade ambari-agent
    2. rpm -qa | grep ambari-agent
  7. ambari-server upgrade
  8. ambari-server start
  9. ambari-agent start (each host)
  10. http://<your.ambari.server>:8080
  11. restart services
  12. note that you may have to perform other processes to complete the upgrading which depends on your cluster configuration

Upgrading ambari metrics (HDP)

  1. stop ambari metrics
  2. all hosts
    1. yum clean all
    2. yum -y upgrade ambari-metrics-monitor ambari-metrics-hadoop-sink
    3. yum -y upgrade ambari-metrics-collector
  3. start ambari metircs
  4. restart related services

HDP - verifying memory configuration

  1. https://github.com/hortonworks/hdp-configuration-utils
  2. unzip master
  3. cd hdp-configuration-utils-master/2.1
  4. ./hdp-configuration-utils.py -h

HDP - log archival process

#!/bin/sh
# Command Line params
# ACTUAL at this time:
# $1 = Interval Days to remove.
DAY_ARCHIVE_THRESHOLD=${1-30}
#!/bin/bash
# Cleaning hadoop logs older than 30 days in all hadoop related folders on /var/log
LOG_BASE=/var/log
COMPONENTS="accumulo ambari-agent ambari-server falcon hadoop hadoop-hdfs hadoop-mapreduce hadoop-yarn hbase hive hive-cataglog hue knox nagios oozie storm webhcat zookeeper"
echo "Reviewing Logs for $COMPONENTS"
for i in $COMPONENTS; do
if [ -d $LOG_BASE/$i ]; then
echo "Removing logs for $LOG_BASE/$i that are $DAY_ARCHIVE_THRESHOLD (or more) days old"
find $LOG_BASE/$i -mtime +$DAY_ARCHIVE_THRESHOLD -exec rm -f {} \;
popd
else
echo "Component $i logs not found on this server"
fi
done
# Cleanup OS Components
OS_COMPONENTS="messages maillog secure spooler"
for i in $OS_COMPONENTS; do
if [ -d $LOG_BASE/$i ]; then
echo "Removing logs for $LOG_BASE/$i that are $DAY_ARCHIVE_THRESHOLD (or more) days old"
find $LOG_BASE/$i-* -mtime +$DAY_ARCHIVE_THRESHOLD -exec rm -f {} \;
popd
else
echo "Component $i logs not found on this server"
fi
done

HDP - HDFS performance test

  1. sudo su hdfs -l -c 'yarn jar /usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 1000'
  2. sudo su hdfs -l -c 'yarn jar /usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 1000'
  3. sudo su hdfs -l -c 'yarn jar /usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -clean'

HDP - Hadoop HA

  1. namenode HA
    1. Check to make sure you have at least three hosts in your cluster and are running at least three ZooKeeper servers
    2. ambari web -> services -> HDFS -> service actions -> enable namenode HA -> nameservice ID -> select hosts -> review
    3. create checkpoint
      1. login to the namenode host
      2. sudo su hdfs -l -c 'hdfs dfsadmin -safemode enter'
      3. sudo su hdfs -l -c 'hdfs dfsadmin -saveNamespace'
    4. configure components
    5. initialize journalnodes
      1. login to the namenode host
      2. sudo su hdfs -l -c 'hdfs namenode -initializeSharedEdits'
    6. start components
    7. initialize metadata
      1. login to the namenode host
      2. sudo su hdfs -l -c 'hdfs zkfc -formatZK'
      3. login to the additional namenode host
      4. sudo su hdfs -l -c 'hdfs namenode -bootstrapStandby'
    8. finalize HA setup
  2. resourcemanager HA
    1. Check to make sure you have at least three hosts in your cluster and are running at least three ZooKeeper servers
    2. ambari web -> services -> YARN -> service actions -> enable resouremanger HA -> get started -> select hosts -> review -> configure components

HDP 2.3 installation

  1. software requirements
    1. yum, rpm, scp, curl, unzip, tar, wget, OpenSSL v1.01, build 16 or later), python v2.6
    2. option
      1. chmod 755 /usr/bin/yum /bin/rpm /usr/bin/scp /usr/bin/curl /usr/bin/unzip /bin/tar /usr/bin/wget /usr/bin/ssh /usr/bin/python
  2. check java version
    1. Oracle JDK 1.8 64-bit (minimum JDK 1.8_40) (default)
    2. Oracle JDK 1.7 64-bit (minimum JDK 1.7_67)
    3. OpenJDK 8 64-bit
    4. OpenJDK 7 64-bit
  3. ulimit -Sn / ulimit -Hn
    1. 65535
  4. configure FQDN at /etc/hosts, /etc/sysconfig/networks and using hostname command
  5. set up password-less SSH
    1. ssh-keygen
    2. Copy the SSH Public Key (id_rsa.pub) to the root account on your target hosts
      1. option
        1. vim /etc/ssh/sshd_config
        2. PermitRootLogin yes
        3. service sshd restart
        4. vim /etc/ssh/sshd_config
    3. all hosts
      1. cat id_rsa.pub >> ~/.ssh/authorized_keys
    4. option
      1. chmod 700 ~/.ssh 
      2. chmod 600 ~/.ssh/authorized_keys
  6. NTP yum -y install sudo (option)
    1. yum -y install ntp
    2. chkconfig ntpd on
    3. service ntpd start
  7. SELinux
    1. vim /etc/selinux/config
    2. if it is abled
      1. set SELINUX=disabled
      2. command: setenforce 0
    3. if PackageKit is installed
      1. vim /etc/yum/pluginconf.d/refresh-packagekit.conf
      2. enabled=0
  8. umask
    1. umask
      1. 0022
    2. if it is not 0022
      1. vim /etc/profile
      2. umask 022
  9. THP
    1. check
      1. cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
      2. cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
    2. command and /etc/rc.local
      1. echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
      2. echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
  10. permission (option)
    1. ambari
      1. /bin/hostname
      2. /usr/bin/sudo
    2. namenode
      1. /usr/bin/which
      2. /bin/ps
      3. /bin/df
    3. ambari metrics
      1. /usr/bin/gcc
      2. /usr/bin/ld
      3. /etc/centos-release
    4. yarn, MR2
      1. /tmp (check)
        1. /etc/fstab (default)
        2. /proc/mounts (current)
        3. mount -o remount,exec /tmp
      2. /usr/bin/curl
    5. knox
      1. /bin/netstat
    6. all
      1. chmod 755 /bin/hostname /usr/bin/sudo /usr/bin/which /bin/ps /bin/df /usr/bin/gcc /usr/bin/ld /etc/centos-release /usr/bin/curl /bin/netstat
  11. Install ambari 2.1.0 using non-default databases (HDP)
    1. http://mungeol-heo.blogspot.kr/2015/11/install-mariadb-httpmungeol-heo.html
  12. http://<your.ambari.server>:8080 -> admin / admin -> launch install wiizard -> cluster name -> HDP 2.3 -> install options -> confirm hosts -> choose servicesl -> assgin masters -> assign slaves and clients -> customize services -> review -> install, start and test -> summary

Install ambari 2.1.0 using non-default databases (HDP)

  1. install mariaDB
    1. http://mungeol-heo.blogspot.kr/2015/07/mariadb.html
  2. mysql-connector-java
    1. yum -y install mysql-connector-java
      1. use latest version if "option sql_select_limit=default" error occurs
        1. cd /usr/share
        2. mkdir java
        3. cd java
        4. wget http://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-5.1.36.zip
        5. unzip mysql-connector-java-5.1.36.zip
        6. cp mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar .
        7. ln -s mysql-connector-java-5.1.36-bin.jar mysql-connector-java.jar
    2. ls /usr/share/java/mysql-connector-java.jar
    3. chmod 644 /usr/share/java/mysql-connector-java.jar (option)
  3. wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.1.0/ambari.repo -O /etc/yum.repos.d/ambari.repo
  4. yum -y install ambari-server
  5. service postgresql stop
  6. RDB setting
    1. mysql -u root -p
    2. CREATE USER 'ambari'@'%' IDENTIFIED BY 'bigdata';
    3. GRANT ALL PRIVILEGES ON *.* TO 'ambari'@'%';
    4. CREATE USER 'ambari'@'localhost' IDENTIFIED BY 'bigdata';
    5. GRANT ALL PRIVILEGES ON *.* TO 'ambari'@'localhost';
    6. CREATE USER 'ambari'@'bigdata-dev01.co.kr' IDENTIFIED BY 'bigdata';
    7. GRANT ALL PRIVILEGES ON *.* TO 'ambari'@'bigdata-dev01.co.kr';
    8. FLUSH PRIVILEGES;
    9. mysql -u ambari -p
    10. CREATE DATABASE ambari;
    11. USE ambari;
    12. SOURCE /var/lib/ambari-server/resources/Ambari-DDL-MySQL-CREATE.sql;
  7. ambari-server setup
    1. Select y at Enter advanced database configuration
    2. enter 3 (mysql)
    3. select default values for all left
  8. ambari-server start

HDP 2.2 installation

  1. yum -y install java-1.7.0-openjdk (option)
  2. ulimit -Sn / ulimit -Hnconfigure FQDN at /etc/hosts, /etc/sysconfig/networks and using hostname command
    1. 65535
  3. ssh-keygen (passwordless connection)
  4. NTP yum -y install sudo (option)
    1. yum -y install ntp
    2. chkconfig ntpd on
    3. service ntpd start
  5. THP
    1. check
      1. cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
      2. cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
    2. command and /etc/rc.local
      1. echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
      2. echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
  6. permission (option)
    1. ambari
      1. /bin/hostname
      2. /usr/bin/sudo
    2. namenode
      1. /usr/bin/which
      2. /bin/ps
      3. /bin/df
    3. ambari metrics
      1. /usr/bin/gcc
      2. /usr/bin/ld
      3. /etc/centos-release
    4. yarn, MR2
      1. /tmp (check)
        1. /etc/fstab (default)
        2. /proc/mounts (current)
        3. mount -o remount,exec /tmp
      2. /usr/bin/curl
    5. knox
      1. /bin/netstat
  7. Install ambari 2.0.1 (HDP)
    1. http://mungeol-heo.blogspot.kr/2015/11/install-ambari-201-hdp.html
  8. install and run services from ambari web

Install ambari 2.0.1 (HDP)

  1. wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.0.1/ambari.repo -O /etc/yum.repos.d/ambari.repo
  2. yum -y install ambari-server
  3. ambari-server setup
  4. ambari-server start