Mungeol Heo: 2015

Monday, November 30, 2015

Sqoop setting (CDH)

Creating password file
1. echo -n password > .password
2. hdfs dfs -put .password /user/$USER/
Installing the MySQL JDBC driver in CDH
1. mkdir -p /var/lib/sqoop
2. chown sqoop:sqoop /var/lib/sqoop
3. chmod 755 /var/lib/sqoop
4. donwload JDBC dirver from http://dev.mysql.com/downloads/connector/j/5.1.html
5. sudo cp mysql-connector-java-version/mysql-connector-java-<version>-bin.jar /var/lib/sqoop/

Sqoop - use case example

command

hdfs dfs -rm -r -skipTrash /user/hive/warehouse/member_company;
sqoop --options-file mysql2hdfs.option --query "SELECT * FROM test WHERE count > 10 and \$CONDITIONS" --target-dir /user/hive/warehouse/member_company

sqoop export --connect jdbc:mysql://10.0.2.a/test?characterEncoding=utf-8 --username root --password 'yourpassword' --table r_input --export-dir /result/r_input --input-fields-terminated-by '\001' --outdir /tmp/sqoop-mungeol/code/

options

import
--connect
jdbc:mysql://10.0.1.b:3306/test
--username
test
--password
tkfkadlselqlt
--split-by
mem_idx
-m
3
--outdir
/tmp/sqoop-mungeol/code/

Spark test

spark + mariaDB test
1. SPARK_CLASSPATH=mysql-connector-java-5.1.34-bin.jar bin/pyspark
2. df = sqlContext.load(source="jdbc", url="jdbc:mysql://10.0.2.a/test?user=root&password=yourpassword", dbtable="r_input_day")
3. df.first()
spark + elasticsearch test
1. SPARK_CLASSPATH=/data/elasticsearch-hadoop-2.1.0.Beta4/dist/elasticsearch-hadoop-2.1.0.Beta4.jar ./bin/pyspark
2. conf = {"es.resource":"sec-team/access-report"}
3. rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
4. rdd.first()

spark streaming test

network_wordcount.py

from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="PythonStreamingNetworkWordCount")
    ssc = StreamingContext(sc, 1)
    lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
    counts = lines.flatMap(lambda line: line.split(" "))\
                  .map(lambda word: (word, 1))\
                  .reduceByKey(lambda a, b: a+b)
    counts.pprint()
    ssc.start()
    ssc.awaitTermination()

nc -lk 9999
spark-submit network_wordcount.py localhost 9999

Spark basis

hardware provisioning
1. storage systems
  1. run on the same nodes as HDFS
    1. yarn
    2. with a fixed amount memory and cores dedicated to Spark on each node
  2. run on different nodes in the same local-area network as HDFS
  3. run computing jobs on different nodes than the storage system
2. local disks
  1. 4-8 disks per node
  2. without RAID(just as separate mount points)
  3. noatime option
  4. configurate the spark.local.dir variable to be a comma-separated list of the local disks
  5. same disks as HDFS, if running HDFS
3. memory
  1. 8 GB - hundreds of GB
  2. 75% of the memory
  3. if memory > 200 GB, then run multiple worker JVMs per node
    1. standalone mode
      1. conf/spark-env.sh
        
        SPARK_WORKER_INSTANCES: set the number of workers per node
        
        SPARK_WORKER_CORES: the number of cores per worker
4. netowrk
  1. >= 10 gigabit
5. CPU cores
  1. 8-16 cores per machine, or more
6. reference
  1. https://spark.apache.org/docs/latest/hardware-provisioning.html
third-party hadoop distributions
1. CDH
2. HDP (recommended)
3. inheriting cluster configuration
  1. spark-env.sh
    1. HADOOP_CONF_DIR
      1. hdfs-site.xml
      2. core-site.xml
4. reference
  1. https://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
external tools
1. cluster-wide monitoring tool
  1. Gangila
2. OS profiling tools
  1. dstat
  2. iostat
  3. iotop
3. JVM utilities
  1. jstack
  2. jmap
  3. jstat
  4. jconsole

optimization

problem	configuration
out of memory	sysctl -w vm.max_map_count=65535 spark.storage.memoryMapThreshhold 131072
too many open files	sysctl -w fs.file-max=1000000 spark.shuffle.consolidateFiles true spark.shuffle.manager sort
connection reset by peer	-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=12 -XX:NewRatio=3 -XX:SurvivorRatio=3
error communication with MapOutputTracker	spark.akka.askTimeout 120 spark.akka.lookupTimeout 120

configuration
1. 75% of a machine's memory (standalone)
2. minimum executor heap size: 8 GB
3. maximum executor heap size: 40 GB / under 45 GB (watch GC)
4. kryo serialization
5. parallel (old) / CMS / G1 GC
6. pypy > cpython
notification
1. memory usage is not same as data size (2x, 3x bigger)
2. prefer reduceby than groupby
3. there are limitations when using python with spark streaming (at least for now)

Solr - HDFS configuration (HDP)

The following changes only need to be completed for the first Solr node that is started

vim /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_drive_schema_configs/conf/solrconfig.xml

 <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
      <str name="solr.hdfs.home">hdfs://<host:port>/user/solr</str>
      <bool name="solr.hdfs.blockcache.enabled">true</bool>
      <int name="solr.hdfs.blockcache.slab.count">1</int>
      <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
      <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
      <bool name="solr.hdfs.blockcache.read.enabled">true</bool>
      <bool name="solr.hdfs.blockcache.write.enabled">false</bool>
      <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
      <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
      <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
    </directoryFactory>

add <str name="solr.hdfs.confdir">/usr/hdp/current/hadoop-client/conf</str>, if namenode HA is configured
set lockType to hdfs

Kafka test

Cluster Setting
1. setting zookeeper cluster
2. process at each cluster node
3. cd $KAFKA_HOME
4. vim config/server.properties
5. edit
  1. broker.id
6. bin/kafka-server-start.sh config/server.properties & / sudo bin/kafka-server-start.sh -daemon config/test.properties &
Basic Test
1. bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic rep-test
2. bin/kafka-topics.sh --list --zookeeper localhost:2181
3. bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic rep-test
4. bin/kafka-console-producer.sh --broker-list localhost:9092 --topic rep-test
  test 1
  test 2
  test 3
5. bin/kafka-console-consumer.sh --zookeeper localhost:2181 --from-beginning --topic rep-test
Fault Tolerance Test
1. kill the leader
2. do the basic test again
3. start the killed server again
4. do the basic test again

Kafka - configuration example

broker.id=81
port=9092
host.name=10.0.2.a
num.io.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/data/kafka-logs
log.retention.hours=168
log.cleaner.enable=false
zookeeper.connect=localhost:2181,10.0.2.a:2181,10.0.2.b:2181,10.0.2.c:2181
controlled.shutdown.enable=true
auto.leader.rebalance.enable=true
# Replication configurations
num.replica.fetchers=4
replica.fetch.max.bytes=1048576
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=65536
replica.lag.time.max.ms=10000
replica.lag.max.messages=4000
controller.socket.timeout.ms=30000
controller.message.queue.size=10
# Log configuration
num.partitions=8
message.max.bytes=1000000
auto.create.topics.enable=false
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
log.flush.interval.ms=10000
log.flush.interval.messages=20000
log.flush.scheduler.interval.ms=2000
log.roll.hours=168
log.retention.check.interval.ms=300000
log.segment.bytes=1073741824
# ZK configuration
zookeeper.connection.timeout.ms=6000
zookeeper.sync.time.ms=2000
# Socket server configuration
#num.io.threads=8
num.network.threads=8
socket.request.max.bytes=104857600
socket.receive.buffer.bytes=1048576
socket.send.buffer.bytes=1048576
queued.max.requests=16
fetch.purgatory.purge.interval.requests=100
producer.purgatory.purge.interval.requests=100

Kafka basis

important properties
1. broker
  1. brocerk.id
  2. log.dirs
  3. host.name
  4. zookeeper.connect
  5. controlled.shutdown.enable=true
  6. auto.leader.rebalance.enable=true
2. consumer
  1. group.id
  2. zookeeper.connect
  3. fetch.message.max.bytes
  4. topic.id
3. producer
  1. metadata.broker.list
  2. request.required.acks
  3. producer.type
  4. compression.codec
  5. topic.metadata.refresh.interval.ms
  6. batch.num.messages
  7. topic.id
topic-level setting
1. bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic my-topic --partitions 1 --replication-factor 1 --config max.message.bytes=64000 --config flush.messages=1
2. bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --config max.message.bytes=128000
3. bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --deleteConfig max.message.bytes
Operations
1. bin/kafka-topics.sh --zookeeper zk_host:port/chroot --create --topic my_topic_name --partitions 20 --replication-factor 3 --config x=y
2. bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --partitions 40
3. bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --config x=y
4. bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --deleteConfig x
5. bin/kafka-topics.sh --zookeeper zk_host:port/chroot --delete --topic my_topic_name
6. bin/kafka-preferred-replica-election.sh --zookeeper zk_host:port/chroot
7. bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181 --group test

server production server configuration (from kafka document)

# Replication configurations
num.replica.fetchers=4
replica.fetch.max.bytes=1048576
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.socket.timeout.ms=30000
replica.socket.receive.buffer.bytes=65536
replica.lag.time.max.ms=10000
replica.lag.max.messages=4000

controller.socket.timeout.ms=30000
controller.message.queue.size=10

# Log configuration
num.partitions=8
message.max.bytes=1000000
auto.create.topics.enable=true
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
log.retention.hours=168
log.flush.interval.ms=10000
log.flush.interval.messages=20000
log.flush.scheduler.interval.ms=2000
log.roll.hours=168
log.cleanup.interval.mins=30
log.segment.bytes=1073741824

# ZK configuration
zookeeper.connection.timeout.ms=6000
zookeeper.sync.time.ms=2000

# Socket server configuration
num.io.threads=8
num.network.threads=8
socket.request.max.bytes=104857600
socket.receive.buffer.bytes=1048576
socket.send.buffer.bytes=1048576
queued.max.requests=16
fetch.purgatory.purge.interval.requests=100
producer.purgatory.purge.interval.requests=100

Hardware
1. disk throughput
2. 8x7200 rpm SATA drives
3. higher RPM SAS drives
OS setting
1. file descriptors
2. max socket buffer size

Hive use case

Software
1. Hive 0.13.0 SetupHDP 2.1 General Availability
  - Hadoop 2.4.0
  - Tez 0.4.0
  - Hive 0.13.0
  HDP was deployed using Ambari 1.5.1. For the most part, the cluster used the Ambari defaults (except where noted below). Hive 0.13.0 runs were done using Java 7 (default JVM).
  
  Tez and MapReduce were tuned to process all queries using 4 GB containers at a target container-to-disk ratio of 2.0. The ratio is important because it minimizes disk thrash and maximizes throughput.
  
  Other Settings:
  - yarn.nodemanager.resource.memory-mb was set to 49152
  - Default virtual memory for a job’s map-task and reduce-task were set to 4096
  - hive.tez.container.size was set to 4096
  - hive.tez.java.opts was set to -Xmx3800m
  - Tez app masters were given 8 GB
  - mapreduce.map.java.opts and mapreduce.reduce.java.opts were set to -Xmx3800m. This is smaller than 4096 to allow for some garbage collection overhead
  - hive.auto.convert.join.noconditionaltask.size was set to 1252698795
  Note: this is 1/3 of the Xmx value, about 1.7 GB.
  
  The following additional optimizations were used for Hive 0.13.0:
- Vectorized Query enabled
- ORCFile formatted data
- Map-join auto conversion enabled
Hardware
1. 20 physical nodes, each with:
  - 2x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz for total of 16 CPU cores/machine
  - Hyper-threading enabled
  - 256GB RAM per node
  - 6x 4TB WDC WD4000FYYZ-0 drives per node
  - 10 Gigabit interconnect between the nodes
  Notes: Based on the YARN Node Manager’s Memory Resource setting used below, only 48 GB of RAM per node was dedicated to query processing, the remaining 200 GB of RAM were available for system caches and HDFS.
  
  Linux Configurations:
- /proc/sys/net/core/somaxconn = 512
- /proc/sys/vm/dirty_writeback_centisecs = 6000
- /proc/sys/vm/swappiness = 0
- /proc/sys/vm/zone_reclaim_mode = 0
- /sys/kernel/mm/redhat_transparent_hugepage/defrag = never
- /sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag = no
- /sys/kernel/mm/transparent_hugepage/khugepaged/defrag = 0

Hive - row_sequence()

add jar /opt/cloudera/parcels/CDH/jars/hive-contrib-0.13.1-cdh5.3.0.jar;
CREATE TEMPORARY FUNCTION row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';

Hive installation (HDP)

mysql-connector-java (skip this step if you have installed it at 'HDP 2.3 installation')
1. 2015.07.28 -> HDP 2.3 installation -> mysql-connector-java
2. ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar
RDB configuration
1. mysql -u root -p
2. CREATE USER ‘hive’@’localhost’ IDENTIFIED BY ‘hroqkf’;
3. GRANT ALL PRIVILEGES ON *.* TO 'hive'@'localhost';
4. CREATE USER ‘hive’@’%’ IDENTIFIED BY ‘hroqkf’;
5. GRANT ALL PRIVILEGES ON *.* TO 'hive'@'%';
6. CREATE USER 'hive'@'bigdata-dev03.co.kr'IDENTIFIED BY 'hroqkf';
  1. be sure the hostname is the host where you installed hive metastore.
7. GRANT ALL PRIVILEGES ON *.* TO 'hive'@'bigdata-dev03.co.kr';
8. FLUSH PRIVILEGES;
9. CREATE DATABASE hive;
ambari web -> add service -> choose hive and tez -> assign masters -> assgin slaves and clients
customize services

hive -> advanced -> hive metastore -> hive database -> existing mysql database -> database host, database name, username, password -> test connection

configure identities -> review -> install, start and test -> summary -> complete

Hive HA (HDP)

prerequisite
1. The relational database that backs the Hive Metastore itself should also be made highly available using best practices defined for the database system in use
metastore
1. ambari web -> services -> hive -> service antions -> add hive metastore -> choose host -> restart realted services
2. install RDB client on the host where you installed hive metastore
  1. install MariaDB
3. RDB setting
  1. mysql -u root -p
  2. CREATE USER 'hive'@'bigdata-dev02.co.kr'IDENTIFIED BY 'hroqkf';
    1. be sure the hostname is the host where you installed hive metastore.
  3. GRANT ALL PRIVILEGES ON *.* TO 'hive'@'bigdata-dev02.co.kr';
  4. FLUSH PRIVILEGES;
hiveserver2
1. ambari web -> services -> hive -> service antions -> add hiveserver2 -> choose host -> restart realted services
webhcat
1. ambari web -> hosts -> click hostname -> add -> webhcat server -> restart realted services

Flume installation

wget http://apache.mirror.cdnetworks.com/flume/1.5.0.1/apache-flume-1.5.0.1-bin.tar.gz
tar xvf apache-flume-1.5.0.1-bin.tar.gz
cd apache-flume-1.5.0.1-bin
cp conf/flume-env.sh.template conf/flume-env.sh
vim conf/flume-env.sh
JAVA_OPTS="-Xms1g -Xmx1g"
cp conf/flume-conf.properties.template conf/flume-conf.properties
vim conf/flume-conf.properties
bin/flume-ng agent -n agent01 -c conf -f conf/flume-conf.properties

Flume - use case example

wiselog access log

log server

## agent(s): agent01
## names of source(s), channel(s) and sink(s)
agent01.sources = source01
agent01.channels = channel01
agent01.sinks = sink01

## channel01
agent01.channels.channel01.type = memory
agent01.channels.channel01.capacity = 100000

## source01
agent01.sources.source01.channels = channel01
agent01.sources.source01.type = spooldir
agent01.sources.source01.spoolDir = /data/log-collector/spool-dir
agent01.sources.source01.deletePolicy = immediate
agent01.sources.source01.basenameHeader = true
agent01.sources.source01.basenameHeaderKey = type
agent01.sources.source01.deserializer.maxLineLength = 10240
agent01.sources.source01.interceptors = interceptor02
agent01.sources.source01.interceptors.interceptor02.type = static
agent01.sources.source01.interceptors.interceptor02.key = timestamp
agent01.sources.source01.interceptors.interceptor02.value = 0

## sink01
agent01.sinks.sink01.channel = channel01
agent01.sinks.sink01.type = avro
agent01.sinks.sink01.hostname = <hostname>
agent01.sinks.sink01.port = 4545

## test
#agent01.sinks.sink01.channel = channel01
#agent01.sinks.sink01.type = logger

HDFS server

## agent(s): agent01
## names of source(s), channel(s) and sink(s)
agent01.sources = source01
agent01.channels = channel01
agent01.sinks = sink01
  
## channel01
agent01.channels.channel01.type = memory
agent01.channels.channel01.capacity = 100000
  
## source01
agent01.sources.source01.channels = channel01
agent01.sources.source01.type = avro
agent01.sources.source01.bind = <hostname>
agent01.sources.source01.port = 4545
agent01.sources.source01.interceptors = interceptor01 interceptor02
agent01.sources.source01.interceptors.interceptor01.type = host
agent01.sources.source01.interceptors.interceptor02.type = regex_extractor
agent01.sources.source01.interceptors.interceptor02.regex = ^\\d+\\.\\d+.\\d+.\\d+\\s\\[(\\d{2}\\/[a-zA-Z]{3}\\/\\d{4}:\\d{2}:\\d{2}:\\d{2})\\s\\+0900\\]\\s
agent01.sources.source01.interceptors.interceptor02.serializers = s01
agent01.sources.source01.interceptors.interceptor02.serializers.s01.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
agent01.sources.source01.interceptors.interceptor02.serializers.s01.pattern = dd/MMM/yyyy:HH:mm:ss
agent01.sources.source01.interceptors.interceptor02.serializers.s01.name = timestamp
  
## sink01
agent01.sinks.sink01.channel = channel01
agent01.sinks.sink01.type = hdfs
agent01.sinks.sink01.hdfs.path = /log/wiselog/access/%{type}/yyyymmdd=%Y%m%d/hh=%H
agent01.sinks.sink01.hdfs.filePrefix = %{host}
agent01.sinks.sink01.hdfs.inUsePrefix = .
agent01.sinks.sink01.hdfs.rollInterval = 300
agent01.sinks.sink01.hdfs.rollSize = 0
agent01.sinks.sink01.hdfs.rollCount = 0
agent01.sinks.sink01.hdfs.idleTimeout = 60
agent01.sinks.sink01.hdfs.writeFormat = Text
agent01.sinks.sink01.hdfs.codeC = gzip
  
## test
#agent01.sinks.sink01.channel = channel01
#agent01.sinks.sink01.type = logger

apache access log

log server

## agent(s): agent01
## names of source(s), channel(s) and sink(s)
agent01.sources = source01 source02
agent01.channels = channel01
agent01.sinks = sink01
  
## channel01
agent01.channels.channel01.type = memory
agent01.channels.channel01.capacity = 100000
  
## source01
agent01.sources.source01.channels = channel01
agent01.sources.source01.type = spooldir
agent01.sources.source01.spoolDir = /home/log-collector/test
agent01.sources.source01.deletePolicy = immediate
agent01.sources.source01.deserializer.maxLineLength = 204800
agent01.sources.source01.interceptors = interceptor01 interceptor02
agent01.sources.source01.interceptors.interceptor01.type = static
agent01.sources.source01.interceptors.interceptor01.key = type
agent01.sources.source01.interceptors.interceptor01.value = test
agent01.sources.source01.interceptors.interceptor02.type = static
agent01.sources.source01.interceptors.interceptor02.key = timestamp
agent01.sources.source01.interceptors.interceptor02.value = 0

## source02
agent01.sources.source02.channels = channel01
agent01.sources.source02.type = spooldir
agent01.sources.source02.spoolDir = /home/log-collector/test2
agent01.sources.source02.deletePolicy = immediate
agent01.sources.source02.deserializer.maxLineLength = 204800
agent01.sources.source02.interceptors = interceptor01 interceptor02
agent01.sources.source02.interceptors.interceptor01.type = static
agent01.sources.source02.interceptors.interceptor01.key = type
agent01.sources.source02.interceptors.interceptor01.value = test2
agent01.sources.source02.interceptors.interceptor02.type = static
agent01.sources.source02.interceptors.interceptor02.key = timestamp
agent01.sources.source02.interceptors.interceptor02.value = 0
  
## sink01
agent01.sinks.sink01.channel = channel01
agent01.sinks.sink01.type = avro
agent01.sinks.sink01.hostname = 10.0.2.a
agent01.sinks.sink01.port = 4545
  
## test
#agent01.sinks.sink01.channel = channel01
#agent01.sinks.sink01.type = logger

HDFS server

## agent(s): agent01
## names of source(s), channel(s) and sink(s)
agent01.sources = source01
agent01.channels = channel01
agent01.sinks = sink01
  
## channel01
agent01.channels.channel01.type = memory
agent01.channels.channel01.capacity = 100000
  
## source01
agent01.sources.source01.channels = channel01
agent01.sources.source01.type = avro
agent01.sources.source01.bind = 10.0.2.a
agent01.sources.source01.port = 4545
agent01.sources.source01.interceptors = interceptor01 interceptor02
agent01.sources.source01.interceptors.interceptor01.type = host
agent01.sources.source01.interceptors.interceptor02.type = regex_extractor
agent01.sources.source01.interceptors.interceptor02.regex = ^[\\s\\S]+\\[(\\d{1,2}\\/[A-Z][a-z]{2}\\/\\d{4}:\\d{2}:\\d{2}:\\d{2})\\s\\+0900\\][\\s\\S]+
agent01.sources.source01.interceptors.interceptor02.serializers = s01
agent01.sources.source01.interceptors.interceptor02.serializers.s01.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
agent01.sources.source01.interceptors.interceptor02.serializers.s01.pattern = dd/MMM/yyyy:HH:mm:ss
agent01.sources.source01.interceptors.interceptor02.serializers.s01.name = timestamp
  
## sink01
agent01.sinks.sink01.channel = channel01
agent01.sinks.sink01.type = hdfs
agent01.sinks.sink01.hdfs.path = /log/apache/access/%{type}/yyyymmdd=%Y%m%d/hh=%H
agent01.sinks.sink01.hdfs.filePrefix = %{host}
agent01.sinks.sink01.hdfs.inUsePrefix = .
agent01.sinks.sink01.hdfs.rollInterval = 300
agent01.sinks.sink01.hdfs.rollSize = 0
agent01.sinks.sink01.hdfs.rollCount = 0
agent01.sinks.sink01.hdfs.idleTimeout = 120
agent01.sinks.sink01.hdfs.writeFormat = Text
agent01.sinks.sink01.hdfs.codeC = gzip
  
## test
#agent01.sinks.sink01.channel = channel01
#agent01.sinks.sink01.type = logger

Ambari - enable kerberos (HDP)

prerequisite
1. Kerberos installation
2. Java Cryptography Extension (no need if it is installed by ambari)
ambari web -> admin -> kerberos -> enable kerberos -> exisiting MIT KDC -> provide information about the KDC and admin account -> configure kerberos -> install and test kerberos client -> confirm configuration -> stop services -> kerberize cluster -> start and test services

Ambari - configure email notification using sendmail (HDP)

start sendmail on the host where ambari server runs
1. service sendmail start
ambari web -> alerts -> actions -> manage notifications -> "+"
name
1. test
groups
1. all
use default value for others
save
test
1. select one service/component and stop/start it

Manual Upgrade HDP 2.2 to 2.3

option - local repository setting for HDP 2.2 update (temporary internet access)
1. cd /etc/yum.repos.d
2. rm -fambari* HDP*
3. wget http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.1.0/ambari.repo
4. wget http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.3.0.0/hdp.repo
5. yum -y install yum-utils createrepo
6. install and run web server
  1. HTTPD
7. mkdir -p /var/www/html/
8. cd /var/www/html/
9. mkdir -p ambari/centos6
10. cd ambari/centos6
11. reposync -r Updates-ambari-2.1.0
12. createrepo Updates-ambari-2.1.0
13. cd /var/www/html/
14. mkdir -p hdp/centos6
15. cd hdp/centos6
16. reposync -r HDP-2.3.0.0
17. reposync -r HDP-UTILS-1.1.0.20
18. createrepo HDP-2.3.0.0
19. createrepo HDP-UTILS-1.1.0.20
20. edit DocumentRoot and Directory at /usr/local/apache2/conf/httpd.conf or
  1. cd /usr/local/apache2/htdocs
  2. ln -s /var/www/html/ambari ambari
  3. ln -s /var/www/html/hdp hdp
21. http://hostname/ambari/centos6/Updates-ambari-2.1.0
22. http://hostname/hdp/centos6/HDP-2.3.0.0
23. http://hostname/hdp/centos6/HDP-UTILS-1.1.0.20
24. option (If you have multiple repositories configured in your environment)
  1. yum -y install yum-plugin-priorities
  2. vim /etc/yum/pluginconf.d/priorities.conf
    1. [main]
    2. enabled=1
    3. gpgcheck=0
25. stop web server at the end
  1. /usr/local/apache2/bin/apachectl -k stop
Upgrading ambari 2.0 to 2.1 (HDP)

http://mungeol-heo.blogspot.kr/2015/11/upgrading-ambari-20-to-21-hdp.html

Upgrading ambari metrics (HDP)

http://mungeol-heo.blogspot.kr/2015/11/upgrading-ambari-metrics-hdp.html

ambari web -> admin > stack and versions > manage versions > + register version
enter 0.0
enter 'http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.3.0.0' for HDP base URL
enter 'http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/centos6' for HDP-UTILS base URL
save
go to dashboard > admin > stack and versions > verions > install packages > ok
record component layoutbrowse to each Service except HDFS and ZooKeeper and perform Stop
1. http://ambari.server:8080/api/v1/clusters/cluster.name/hosts?fields=host_components
namenode (active) hostlog in to ambari
1. cd <dfs.namenode.name.dir>
2. Make sure that only a "/current" directory and no "/previous" directory exists
3. cp -r current /home/hdfs/
4. su hdfs
5. cd
6. hdfs fsck / -files -blocks -locations > dfs-old-fsck-1.log
7. hdfs dfsadmin -report > dfs-old-report-1.log
8. hdfs dfsadmin -safemode enter
9. hdfs dfsadmin -saveNamespace
10. hdfs dfsadmin -finalizeUpgrade
Using Ambari Web, stop HDFS service and stop ZooKeeper service
namenode (active) hosthdp-select set all 2.3.0.0-2557 (all hosts)
1. cd <dfs.namenode.name.dir>/current
2. hdfs oev -i edits_inprogress_* -o edits.out
3. Verify edits.out file. It should only have OP_START_LOG_SEGMENT transaction
mkdir -p /work/upgrade_hdp_2
cd /work/upgrade_hdp_2
curl -O https://raw.githubusercontent.com/apache/ambari/branch-2.1/ambari-server/src/main/python/upgradeHelper.py
chmod 777 upgradeHelper.py
curl -O https://raw.githubusercontent.com/apache/ambari/branch-2.1/ambari-server/src/main/resources/upgrade/catalog/UpgradeCatalog_2.2_to_2.3.json

python upgradeHelper.py --hostname $HOSTNAME --user $USERNAME --password $PASSWORD --clustername $CLUSTERNAME --fromStack $FROMSTACK --toStack $TOSTACK --upgradeCatalog UpgradeCatalog_2.2_to_2.3.json update-configs [config-type]curl -O https://raw.githubusercontent.com/apache/ambari/branch-2.1/ambari-server/src/main/resources/upgrade/catalog/UpgradeCatalog_2.2_to_2.3_step2.json

ariable	Value
$HOSTNAME	Ambari Server hostname. This should be the FQDN for the host running the Ambari Server.
$USERNAME	Ambari Admin user.
$PASSWORD	Password for the user.
$CLUSTERNAME	Name of the cluster. This is the name you provided when you installed the cluster with Ambari. Login to Ambari and the name can be found in the upper-left of the Ambari Web screen. This is case-sensitive.
$FROMSTACK	The “from” stack. Forexample:2.2
$TOSTACK	The “to” stack. Forexample:2.3
config-type	Optional: the config-type to upgrade. For example:hdfs-site. By default, all configurations are updated.

upgrade zookeeper
1. start it at ambari web
2. mv /etc/zookeeper/conf /etc/zookeeper/conf.saved
3. ln -s /usr/hdp/current/zookeeper-client/conf /etc/zookeeper/conf
upgrade HDFS
1. \cp -r /etc/hadoop/conf/* /etc/hadoop/2.3.0.0-2557/0/
2. mv /etc/hadoop/conf /etc/hadoop/conf.saved
3. ln -s /usr/hdp/current/hadoop-client/conf /etc/hadoop/conf
4. su -l hdfs -c "/usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh start namenode -upgrade"
5. ps -ef | grep -i NameNode
6. su -l hdfs -c "/usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh start datanode"
7. ps -ef | grep DataNode
8. ambari web > services > HDFS > restart all
9. run service check
10. su -l hdfs -c "hdfs dfsadmin -safemode get"
  1. safe mode is off
upgrade YARN and MR2note that you may have to upgrade other services to complete the upgrading which depends on your cluster configuration
1. su -l hdfs -c "hdfs dfs -mkdir -p /hdp/apps/2.3.0.0-2557/mapreduce/"
2. su -l hdfs -c "hdfs dfs -put /usr/hdp/current/hadoop-client/mapreduce.tar.gz /hdp/apps/2.3.0.0-2557/mapreduce/."
3. su -l hdfs -c "hdfs dfs -put /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar /hdp/apps/2.3.0.0-2557/mapreduce/."
4. su -l hdfs -c "hdfs dfs -chown -R hdfs:hadoop /hdp"
5. su -l hdfs -c "hdfs dfs -chmod -R 555 /hdp/apps/2.3.0.0-2557/mapreduce"
6. su -l hdfs -c "hdfs dfs -chmod -R 444 /hdp/apps/2.3.0.0-2557/mapreduce/mapreduce.tar.gz"
7. su -l hdfs -c "hdfs dfs -chmod -R 444 /hdp/apps/2.3.0.0-2557/mapreduce/hadoop-streaming.jar"
8. su -l yarn -c "yarn resourcemanager -format-state-store"
9. start YARN and MR2 from ambari web
10. run service check for YARN and MR2
sudo su -l hdfs -c "hdfs dfsadmin -finalizeUpgrade"
ambari-server set-current --cluster-name=dev --version-display-name=HDP-2.3.0.0
option
1. If your cluster includes Ranger
  1. cd /work/upgrade_hdp_2
    1. python upgradeHelper.py --hostname $HOSTNAME --user $USERNAME --password $PASSWORD --clustername $CLUSTERNAME --fromStack $FROMSTACK --toStack $TOSTACK --upgradeCatalog UpgradeCatalog_2.2_to_2.3_step2.json update-configs [config-type]

Upgrading ambari 2.0 to 2.1 (HDP)

stop Ambari Metrics
ambari-server stop
all hosts
1. ambari-agent stop
2. cd /etc/yum.repos.d
3. rm -f ambari* HDP*
4. wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.1.0/ambari.repo -O /etc/yum.repos.d/ambari.repo
yum clean all; yum -y upgrade ambari-server
Confirm there is only one ambari-server*.jar file in /usr/lib/ambari-server. If there is more than one JAR file with name ambari-server*.jar, move all JARs except ambari-server-2.1.0.*.jar to /tmp before proceeding with upgrade
all hosts
1. yum -y upgrade ambari-agent
2. rpm -qa | grep ambari-agent
ambari-server upgrade
ambari-server start
ambari-agent start (each host)
http://<your.ambari.server>:8080
restart services
note that you may have to perform other processes to complete the upgrading which depends on your cluster configuration

Upgrading ambari metrics (HDP)

stop ambari metrics
all hosts
1. yum clean all
2. yum -y upgrade ambari-metrics-monitor ambari-metrics-hadoop-sink
3. yum -y upgrade ambari-metrics-collector
start ambari metircs
restart related services

HDP - verifying memory configuration

https://github.com/hortonworks/hdp-configuration-utils
unzip master
cd hdp-configuration-utils-master/2.1
./hdp-configuration-utils.py -h

HDP - log archival process

#!/bin/sh

# Command Line params

# ACTUAL at this time:
# $1 = Interval Days to remove.
DAY_ARCHIVE_THRESHOLD=${1-30}

#!/bin/bash

# Cleaning hadoop logs older than 30 days in all hadoop related folders on /var/log

LOG_BASE=/var/log

COMPONENTS="accumulo ambari-agent ambari-server falcon hadoop hadoop-hdfs hadoop-mapreduce hadoop-yarn hbase hive hive-cataglog hue knox nagios oozie storm webhcat zookeeper"

echo "Reviewing Logs for $COMPONENTS"
for i in $COMPONENTS; do
if [ -d $LOG_BASE/$i ]; then
echo "Removing logs for $LOG_BASE/$i that are $DAY_ARCHIVE_THRESHOLD (or more) days old"
find $LOG_BASE/$i -mtime +$DAY_ARCHIVE_THRESHOLD -exec rm -f {} \;
popd
else
echo "Component $i logs not found on this server"
fi
done

# Cleanup OS Components
OS_COMPONENTS="messages maillog secure spooler"
for i in $OS_COMPONENTS; do
if [ -d $LOG_BASE/$i ]; then
echo "Removing logs for $LOG_BASE/$i that are $DAY_ARCHIVE_THRESHOLD (or more) days old"
find $LOG_BASE/$i-* -mtime +$DAY_ARCHIVE_THRESHOLD -exec rm -f {} \;
popd
else
echo "Component $i logs not found on this server"
fi
done

HDP - HDFS performance test

sudo su hdfs -l -c 'yarn jar /usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 1000'
sudo su hdfs -l -c 'yarn jar /usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 1000'
sudo su hdfs -l -c 'yarn jar /usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -clean'

HDP - Hadoop HA

namenode HA
1. Check to make sure you have at least three hosts in your cluster and are running at least three ZooKeeper servers
2. ambari web -> services -> HDFS -> service actions -> enable namenode HA -> nameservice ID -> select hosts -> review
3. create checkpoint
  1. login to the namenode host
  2. sudo su hdfs -l -c 'hdfs dfsadmin -safemode enter'
  3. sudo su hdfs -l -c 'hdfs dfsadmin -saveNamespace'
4. configure components
5. initialize journalnodes
  1. login to the namenode host
  2. sudo su hdfs -l -c 'hdfs namenode -initializeSharedEdits'
6. start components
7. initialize metadata
  1. login to the namenode host
  2. sudo su hdfs -l -c 'hdfs zkfc -formatZK'
  3. login to the additional namenode host
  4. sudo su hdfs -l -c 'hdfs namenode -bootstrapStandby'
8. finalize HA setup
resourcemanager HA
1. Check to make sure you have at least three hosts in your cluster and are running at least three ZooKeeper servers
2. ambari web -> services -> YARN -> service actions -> enable resouremanger HA -> get started -> select hosts -> review -> configure components

HDP 2.3 installation

software requirements
1. yum, rpm, scp, curl, unzip, tar, wget, OpenSSL v1.01, build 16 or later), python v2.6
2. option
  1. chmod 755 /usr/bin/yum /bin/rpm /usr/bin/scp /usr/bin/curl /usr/bin/unzip /bin/tar /usr/bin/wget /usr/bin/ssh /usr/bin/python
check java version
1. Oracle JDK 1.8 64-bit (minimum JDK 1.8_40) (default)
2. Oracle JDK 1.7 64-bit (minimum JDK 1.7_67)
3. OpenJDK 8 64-bit
4. OpenJDK 7 64-bit
ulimit -Sn / ulimit -Hn
1. 65535
configure FQDN at /etc/hosts, /etc/sysconfig/networks and using hostname command
set up password-less SSH
1. ssh-keygen
2. Copy the SSH Public Key (id_rsa.pub) to the root account on your target hosts
  1. option
    1. vim /etc/ssh/sshd_config
    2. PermitRootLogin yes
    3. service sshd restart
    4. vim /etc/ssh/sshd_config
3. all hosts
  1. cat id_rsa.pub >> ~/.ssh/authorized_keys
4. option
  1. chmod 700 ~/.ssh
  2. chmod 600 ~/.ssh/authorized_keys
NTP yum -y install sudo (option)
1. yum -y install ntp
2. chkconfig ntpd on
3. service ntpd start
SELinux
1. vim /etc/selinux/config
2. if it is abled
  1. set SELINUX=disabled
  2. command: setenforce 0
3. if PackageKit is installed
  1. vim /etc/yum/pluginconf.d/refresh-packagekit.conf
  2. enabled=0
umask
1. umask
  1. 0022
2. if it is not 0022
  1. vim /etc/profile
  2. umask 022
THP
1. check
  1. cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
  2. cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
2. command and /etc/rc.local
  1. echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
  2. echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
permission (option)
1. ambari
  1. /bin/hostname
  2. /usr/bin/sudo
2. namenode
  1. /usr/bin/which
  2. /bin/ps
  3. /bin/df
3. ambari metrics
  1. /usr/bin/gcc
  2. /usr/bin/ld
  3. /etc/centos-release
4. yarn, MR2
  1. /tmp (check)
    1. /etc/fstab (default)
    2. /proc/mounts (current)
    3. mount -o remount,exec /tmp
  2. /usr/bin/curl
5. knox
  1. /bin/netstat
6. all
  1. chmod 755 /bin/hostname /usr/bin/sudo /usr/bin/which /bin/ps /bin/df /usr/bin/gcc /usr/bin/ld /etc/centos-release /usr/bin/curl /bin/netstat
Install ambari 2.1.0 using non-default databases (HDP)

http://mungeol-heo.blogspot.kr/2015/11/install-mariadb-httpmungeol-heo.html

http://<your.ambari.server>:8080 -> admin / admin -> launch install wiizard -> cluster name -> HDP 2.3 -> install options -> confirm hosts -> choose servicesl -> assgin masters -> assign slaves and clients -> customize services -> review -> install, start and test -> summary

Install ambari 2.1.0 using non-default databases (HDP)

install mariaDB

http://mungeol-heo.blogspot.kr/2015/07/mariadb.html

mysql-connector-java
1. yum -y install mysql-connector-java
  1. use latest version if "option sql_select_limit=default" error occurs
    1. cd /usr/share
    2. mkdir java
    3. cd java
    4. wget http://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-5.1.36.zip
    5. unzip mysql-connector-java-5.1.36.zip
    6. cp mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar .
    7. ln -s mysql-connector-java-5.1.36-bin.jar mysql-connector-java.jar
2. ls /usr/share/java/mysql-connector-java.jar
3. chmod 644 /usr/share/java/mysql-connector-java.jar (option)
wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.1.0/ambari.repo -O /etc/yum.repos.d/ambari.repo
yum -y install ambari-server
service postgresql stop
RDB setting
1. mysql -u root -p
2. CREATE USER 'ambari'@'%' IDENTIFIED BY 'bigdata';
3. GRANT ALL PRIVILEGES ON *.* TO 'ambari'@'%';
4. CREATE USER 'ambari'@'localhost' IDENTIFIED BY 'bigdata';
5. GRANT ALL PRIVILEGES ON *.* TO 'ambari'@'localhost';
6. CREATE USER 'ambari'@'bigdata-dev01.co.kr' IDENTIFIED BY 'bigdata';
7. GRANT ALL PRIVILEGES ON *.* TO 'ambari'@'bigdata-dev01.co.kr';
8. FLUSH PRIVILEGES;
9. mysql -u ambari -p
10. CREATE DATABASE ambari;
11. USE ambari;
12. SOURCE /var/lib/ambari-server/resources/Ambari-DDL-MySQL-CREATE.sql;
ambari-server setup
1. Select y at Enter advanced database configuration
2. enter 3 (mysql)
3. select default values for all left
ambari-server start

HDP 2.2 installation

yum -y install java-1.7.0-openjdk (option)
ulimit -Sn / ulimit -Hnconfigure FQDN at /etc/hosts, /etc/sysconfig/networks and using hostname command
1. 65535
ssh-keygen (passwordless connection)
NTP yum -y install sudo (option)
1. yum -y install ntp
2. chkconfig ntpd on
3. service ntpd start
THP
1. check
  1. cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
  2. cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
2. command and /etc/rc.local
  1. echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
  2. echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
permission (option)
1. ambari
  1. /bin/hostname
  2. /usr/bin/sudo
2. namenode
  1. /usr/bin/which
  2. /bin/ps
  3. /bin/df
3. ambari metrics
  1. /usr/bin/gcc
  2. /usr/bin/ld
  3. /etc/centos-release
4. yarn, MR2
  1. /tmp (check)
    1. /etc/fstab (default)
    2. /proc/mounts (current)
    3. mount -o remount,exec /tmp
  2. /usr/bin/curl
5. knox
  1. /bin/netstat
Install ambari 2.0.1 (HDP)

http://mungeol-heo.blogspot.kr/2015/11/install-ambari-201-hdp.html

install and run services from ambari web

Install ambari 2.0.1 (HDP)

wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.0.1/ambari.repo -O /etc/yum.repos.d/ambari.repo
yum -y install ambari-server
ambari-server setup
ambari-server start

Friday, November 27, 2015

Machine learning tools

confusion matrix
1. https://en.wikipedia.org/wiki/Confusion_matrix
2. python function
  1. http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
  2. http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
cross-validation
standard error of the mean
1. https://en.wikipedia.org/wiki/Standard_error#Standard_error_of_the_mean
2. python function
  1. http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.sem.html
one hot encoding
1. https://en.wikipedia.org/wiki/One-hot
2. python function
  1. http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html