- Creating password file
- echo -n password > .password
- hdfs dfs -put .password /user/$USER/
- Installing the MySQL JDBC driver in CDH
- mkdir -p /var/lib/sqoop
- chown sqoop:sqoop /var/lib/sqoop
- chmod 755 /var/lib/sqoop
- donwload JDBC dirver from http://dev.mysql.com/downloads/connector/j/5.1.html
- sudo cp mysql-connector-java-version/mysql-connector-java-<version>-bin.jar /var/lib/sqoop/
Monday, November 30, 2015
Sqoop setting (CDH)
Sqoop - use case example
- command
hdfs dfs -rm -r -skipTrash /user/hive/warehouse/member_company; sqoop --options-file mysql2hdfs.option --query "SELECT * FROM test WHERE count > 10 and \$CONDITIONS" --target-dir /user/hive/warehouse/member_company
sqoop export --connect jdbc:mysql://10.0.2.a/test?characterEncoding=utf-8 --username root --password 'yourpassword' --table r_input --export-dir /result/r_input --input-fields-terminated-by '\001' --outdir /tmp/sqoop-mungeol/code/
- options
import --connect jdbc:mysql://10.0.1.b:3306/test --username test --password tkfkadlselqlt --split-by mem_idx -m 3 --outdir /tmp/sqoop-mungeol/code/
Spark test
- spark + mariaDB test
- SPARK_CLASSPATH=mysql-connector-java-5.1.34-bin.jar bin/pyspark
- df = sqlContext.load(source="jdbc", url="jdbc:mysql://10.0.2.a/test?user=root&password=yourpassword", dbtable="r_input_day")
- df.first()
- spark + elasticsearch test
- SPARK_CLASSPATH=/data/elasticsearch-hadoop-2.1.0.Beta4/dist/elasticsearch-hadoop-2.1.0.Beta4.jar ./bin/pyspark
- conf = {"es.resource":"sec-team/access-report"}
- rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
- rdd.first()
- spark streaming test
- network_wordcount.py
from __future__ import print_function import sys from pyspark import SparkContext from pyspark.streaming import StreamingContext if __name__ == "__main__": if len(sys.argv) != 3: print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr) exit(-1) sc = SparkContext(appName="PythonStreamingNetworkWordCount") ssc = StreamingContext(sc, 1) lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) counts = lines.flatMap(lambda line: line.split(" "))\ .map(lambda word: (word, 1))\ .reduceByKey(lambda a, b: a+b) counts.pprint() ssc.start() ssc.awaitTermination()
- nc -lk 9999
- spark-submit network_wordcount.py localhost 9999
Spark basis
- hardware provisioning
- storage systems
- run on the same nodes as HDFS
- yarn
- with a fixed amount memory and cores dedicated to Spark on each node
- run on different nodes in the same local-area network as HDFS
- run computing jobs on different nodes than the storage system
- run on the same nodes as HDFS
- local disks
- 4-8 disks per node
- without RAID(just as separate mount points)
- noatime option
- configurate the spark.local.dir variable to be a comma-separated list of the local disks
- same disks as HDFS, if running HDFS
- memory
- 8 GB - hundreds of GB
- 75% of the memory
- if memory > 200 GB, then run multiple worker JVMs per node
- standalone mode
- conf/spark-env.sh
- SPARK_WORKER_INSTANCES: set the number of workers per node
- SPARK_WORKER_CORES: the number of cores per worker
- conf/spark-env.sh
- standalone mode
- netowrk
- >= 10 gigabit
- CPU cores
- 8-16 cores per machine, or more
- reference
- storage systems
- third-party hadoop distributions
- CDH
- HDP (recommended)
- inheriting cluster configuration
- spark-env.sh
- HADOOP_CONF_DIR
- hdfs-site.xml
- core-site.xml
- HADOOP_CONF_DIR
- spark-env.sh
- reference
- external tools
- cluster-wide monitoring tool
- Gangila
- OS profiling tools
- dstat
- iostat
- iotop
- JVM utilities
- jstack
- jmap
- jstat
- jconsole
- cluster-wide monitoring tool
- optimizationproblemconfiguration
out of memory sysctl -w vm.max_map_count=65535
spark.storage.memoryMapThreshhold 131072too many open files sysctl -w fs.file-max=1000000
spark.shuffle.consolidateFiles true
spark.shuffle.manager sortconnection reset by peer -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=12 -XX:NewRatio=3 -XX:SurvivorRatio=3 error communication with MapOutputTracker spark.akka.askTimeout 120
spark.akka.lookupTimeout 120 - configuration
- 75% of a machine's memory (standalone)
- minimum executor heap size: 8 GB
- maximum executor heap size: 40 GB / under 45 GB (watch GC)
- kryo serialization
- parallel (old) / CMS / G1 GC
- pypy > cpython
- notification
- memory usage is not same as data size (2x, 3x bigger)
- prefer reduceby than groupby
- there are limitations when using python with spark streaming (at least for now)
Solr - HDFS configuration (HDP)
- The following changes only need to be completed for the first Solr node that is started
- vim /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_drive_schema_configs/conf/solrconfig.xml
<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory"> <str name="solr.hdfs.home">hdfs://<host:port>/user/solr</str> <bool name="solr.hdfs.blockcache.enabled">true</bool> <int name="solr.hdfs.blockcache.slab.count">1</int> <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool> <int name="solr.hdfs.blockcache.blocksperbank">16384</int> <bool name="solr.hdfs.blockcache.read.enabled">true</bool> <bool name="solr.hdfs.blockcache.write.enabled">false</bool> <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool> <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int> <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int> </directoryFactory>
- add <str name="solr.hdfs.confdir">/usr/hdp/current/hadoop-client/conf</str>, if namenode HA is configured
- set lockType to hdfs
Kafka test
- Cluster Setting
- setting zookeeper cluster
- process at each cluster node
- cd $KAFKA_HOME
- vim config/server.properties
- edit
- bin/kafka-server-start.sh config/server.properties & / sudo bin/kafka-server-start.sh -daemon config/test.properties &
- Basic Test
- bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic rep-test
- bin/kafka-topics.sh --list --zookeeper localhost:2181
- bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic rep-test
- bin/kafka-console-producer.sh --broker-list localhost:9092 --topic rep-test
test 1
test 2
test 3 - bin/kafka-console-consumer.sh --zookeeper localhost:2181 --from-beginning --topic rep-test
- Fault Tolerance Test
- kill the leader
- do the basic test again
- start the killed server again
- do the basic test again
Kafka - configuration example
broker.id=81 port=9092 host.name=10.0.2.a num.io.threads=8 socket.send.buffer.bytes=1048576 socket.receive.buffer.bytes=1048576 socket.request.max.bytes=104857600 log.dirs=/data/kafka-logs log.retention.hours=168 log.cleaner.enable=false zookeeper.connect=localhost:2181,10.0.2.a:2181,10.0.2.b:2181,10.0.2.c:2181 controlled.shutdown.enable=true auto.leader.rebalance.enable=true # Replication configurations num.replica.fetchers=4 replica.fetch.max.bytes=1048576 replica.fetch.wait.max.ms=500 replica.high.watermark.checkpoint.interval.ms=5000 replica.socket.timeout.ms=30000 replica.socket.receive.buffer.bytes=65536 replica.lag.time.max.ms=10000 replica.lag.max.messages=4000 controller.socket.timeout.ms=30000 controller.message.queue.size=10 # Log configuration num.partitions=8 message.max.bytes=1000000 auto.create.topics.enable=false log.index.interval.bytes=4096 log.index.size.max.bytes=10485760 log.flush.interval.ms=10000 log.flush.interval.messages=20000 log.flush.scheduler.interval.ms=2000 log.roll.hours=168 log.retention.check.interval.ms=300000 log.segment.bytes=1073741824 # ZK configuration zookeeper.connection.timeout.ms=6000 zookeeper.sync.time.ms=2000 # Socket server configuration #num.io.threads=8 num.network.threads=8 socket.request.max.bytes=104857600 socket.receive.buffer.bytes=1048576 socket.send.buffer.bytes=1048576 queued.max.requests=16 fetch.purgatory.purge.interval.requests=100 producer.purgatory.purge.interval.requests=100
Kafka basis
- important properties
- broker
- brocerk.id
- log.dirs
- host.name
- zookeeper.connect
- controlled.shutdown.enable=true
- auto.leader.rebalance.enable=true
- consumer
- producer
- metadata.broker.list
- request.required.acks
- producer.type
- compression.codec
- topic.metadata.refresh.interval.ms
- batch.num.messages
- topic.id
- broker
- topic-level setting
- bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic my-topic --partitions 1 --replication-factor 1 --config max.message.bytes=64000 --config flush.messages=1
- bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --config max.message.bytes=128000
- bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --deleteConfig max.message.bytes
- Operations
- bin/kafka-topics.sh --zookeeper zk_host:port/chroot --create --topic my_topic_name --partitions 20 --replication-factor 3 --config x=y
- bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --partitions 40
- bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --config x=y
- bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --deleteConfig x
- bin/kafka-topics.sh --zookeeper zk_host:port/chroot --delete --topic my_topic_name
- bin/kafka-preferred-replica-election.sh --zookeeper zk_host:port/chroot
- bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zkconnect localhost:2181 --group test
- server production server configuration (from kafka document)
# Replication configurations num.replica.fetchers=4 replica.fetch.max.bytes=1048576 replica.fetch.wait.max.ms=500 replica.high.watermark.checkpoint.interval.ms=5000 replica.socket.timeout.ms=30000 replica.socket.receive.buffer.bytes=65536 replica.lag.time.max.ms=10000 replica.lag.max.messages=4000 controller.socket.timeout.ms=30000 controller.message.queue.size=10 # Log configuration num.partitions=8 message.max.bytes=1000000 auto.create.topics.enable=true log.index.interval.bytes=4096 log.index.size.max.bytes=10485760 log.retention.hours=168 log.flush.interval.ms=10000 log.flush.interval.messages=20000 log.flush.scheduler.interval.ms=2000 log.roll.hours=168 log.cleanup.interval.mins=30 log.segment.bytes=1073741824 # ZK configuration zookeeper.connection.timeout.ms=6000 zookeeper.sync.time.ms=2000 # Socket server configuration num.io.threads=8 num.network.threads=8 socket.request.max.bytes=104857600 socket.receive.buffer.bytes=1048576 socket.send.buffer.bytes=1048576 queued.max.requests=16 fetch.purgatory.purge.interval.requests=100 producer.purgatory.purge.interval.requests=100
- Hardware
- disk throughput
- 8x7200 rpm SATA drives
- higher RPM SAS drives
- OS setting
- file descriptors
- max socket buffer size
Hive use case
- Software
- Hive 0.13.0 SetupHDP 2.1 General Availability
- Hadoop 2.4.0
- Tez 0.4.0
- Hive 0.13.0
HDP was deployed using Ambari 1.5.1. For the most part, the cluster used the Ambari defaults (except where noted below). Hive 0.13.0 runs were done using Java 7 (default JVM).Tez and MapReduce were tuned to process all queries using 4 GB containers at a target container-to-disk ratio of 2.0. The ratio is important because it minimizes disk thrash and maximizes throughput.Other Settings:- yarn.nodemanager.resource.memory-mb was set to 49152
- Default virtual memory for a job’s map-task and reduce-task were set to 4096
- hive.tez.container.size was set to 4096
- hive.tez.java.opts was set to -Xmx3800m
- Tez app masters were given 8 GB
- mapreduce.map.java.opts and mapreduce.reduce.java.opts were set to -Xmx3800m. This is smaller than 4096 to allow for some garbage collection overhead
- hive.auto.convert.join.noconditionaltask.size was set to 1252698795
Note: this is 1/3 of the Xmx value, about 1.7 GB.The following additional optimizations were used for Hive 0.13.0:
- Vectorized Query enabled
- ORCFile formatted data
- Map-join auto conversion enabled
- Hive 0.13.0 SetupHDP 2.1 General Availability
- Hardware
- 20 physical nodes, each with:
- 2x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz for total of 16 CPU cores/machine
- Hyper-threading enabled
- 256GB RAM per node
- 6x 4TB WDC WD4000FYYZ-0 drives per node
- 10 Gigabit interconnect between the nodes
Notes: Based on the YARN Node Manager’s Memory Resource setting used below, only 48 GB of RAM per node was dedicated to query processing, the remaining 200 GB of RAM were available for system caches and HDFS.Linux Configurations:
- /proc/sys/net/core/somaxconn = 512
- /proc/sys/vm/dirty_writeback_centisecs = 6000
- /proc/sys/vm/swappiness = 0
- /proc/sys/vm/zone_reclaim_mode = 0
- /sys/kernel/mm/redhat_transparent_hugepage/defrag = never
- /sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag = no
- /sys/kernel/mm/transparent_hugepage/khugepaged/defrag = 0
- 20 physical nodes, each with:
Hive - row_sequence()
- add jar /opt/cloudera/parcels/CDH/jars/hive-contrib-0.13.1-cdh5.3.0.jar;
- CREATE TEMPORARY FUNCTION row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
Hive installation (HDP)
- mysql-connector-java (skip this step if you have installed it at 'HDP 2.3 installation')
- 2015.07.28 -> HDP 2.3 installation -> mysql-connector-java
- ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar
- RDB configuration
- mysql -u root -p
- CREATE USER ‘hive’@’localhost’ IDENTIFIED BY ‘hroqkf’;
- GRANT ALL PRIVILEGES ON *.* TO 'hive'@'localhost';
- CREATE USER ‘hive’@’%’ IDENTIFIED BY ‘hroqkf’;
- GRANT ALL PRIVILEGES ON *.* TO 'hive'@'%';
- CREATE USER 'hive'@'bigdata-dev03.co.kr'IDENTIFIED BY 'hroqkf';
- be sure the hostname is the host where you installed hive metastore.
- GRANT ALL PRIVILEGES ON *.* TO 'hive'@'bigdata-dev03.co.kr';
- FLUSH PRIVILEGES;
- CREATE DATABASE hive;
- ambari web -> add service -> choose hive and tez -> assign masters -> assgin slaves and clients
- customize services
- hive -> advanced -> hive metastore -> hive database -> existing mysql database -> database host, database name, username, password -> test connection
- configure identities -> review -> install, start and test -> summary -> complete
Hive HA (HDP)
- prerequisite
- The relational database that backs the Hive Metastore itself should also be made highly available using best practices defined for the database system in use
- metastore
- ambari web -> services -> hive -> service antions -> add hive metastore -> choose host -> restart realted services
- install RDB client on the host where you installed hive metastore
- install MariaDB
- http://mungeol-heo.blogspot.kr/2015/07/mariadb.html
- RDB setting
- mysql -u root -p
- CREATE USER 'hive'@'bigdata-dev02.co.kr'IDENTIFIED BY 'hroqkf';
- be sure the hostname is the host where you installed hive metastore.
- GRANT ALL PRIVILEGES ON *.* TO 'hive'@'bigdata-dev02.co.kr';
- FLUSH PRIVILEGES;
- ambari web -> services -> hive -> service antions -> add hive metastore -> choose host -> restart realted services
- hiveserver2
- ambari web -> services -> hive -> service antions -> add hiveserver2 -> choose host -> restart realted services
- webhcat
- ambari web -> hosts -> click hostname -> add -> webhcat server -> restart realted services
Flume installation
- wget http://apache.mirror.cdnetworks.com/flume/1.5.0.1/apache-flume-1.5.0.1-bin.tar.gz
- tar xvf apache-flume-1.5.0.1-bin.tar.gz
- cd apache-flume-1.5.0.1-bin
- cp conf/flume-env.sh.template conf/flume-env.sh
- vim conf/flume-env.sh
- JAVA_OPTS="-Xms1g -Xmx1g"
- cp conf/flume-conf.properties.template conf/flume-conf.properties
- vim conf/flume-conf.properties
- bin/flume-ng agent -n agent01 -c conf -f conf/flume-conf.properties
Flume - use case example
- wiselog access log
- log server
## agent(s): agent01 ## names of source(s), channel(s) and sink(s) agent01.sources = source01 agent01.channels = channel01 agent01.sinks = sink01 ## channel01 agent01.channels.channel01.type = memory agent01.channels.channel01.capacity = 100000 ## source01 agent01.sources.source01.channels = channel01 agent01.sources.source01.type = spooldir agent01.sources.source01.spoolDir = /data/log-collector/spool-dir agent01.sources.source01.deletePolicy = immediate agent01.sources.source01.basenameHeader = true agent01.sources.source01.basenameHeaderKey = type agent01.sources.source01.deserializer.maxLineLength = 10240 agent01.sources.source01.interceptors = interceptor02 agent01.sources.source01.interceptors.interceptor02.type = static agent01.sources.source01.interceptors.interceptor02.key = timestamp agent01.sources.source01.interceptors.interceptor02.value = 0 ## sink01 agent01.sinks.sink01.channel = channel01 agent01.sinks.sink01.type = avro agent01.sinks.sink01.hostname = <hostname> agent01.sinks.sink01.port = 4545 ## test #agent01.sinks.sink01.channel = channel01 #agent01.sinks.sink01.type = logger
- HDFS server
## agent(s): agent01 ## names of source(s), channel(s) and sink(s) agent01.sources = source01 agent01.channels = channel01 agent01.sinks = sink01 ## channel01 agent01.channels.channel01.type = memory agent01.channels.channel01.capacity = 100000 ## source01 agent01.sources.source01.channels = channel01 agent01.sources.source01.type = avro agent01.sources.source01.bind = <hostname> agent01.sources.source01.port = 4545 agent01.sources.source01.interceptors = interceptor01 interceptor02 agent01.sources.source01.interceptors.interceptor01.type = host agent01.sources.source01.interceptors.interceptor02.type = regex_extractor agent01.sources.source01.interceptors.interceptor02.regex = ^\\d+\\.\\d+.\\d+.\\d+\\s\\[(\\d{2}\\/[a-zA-Z]{3}\\/\\d{4}:\\d{2}:\\d{2}:\\d{2})\\s\\+0900\\]\\s agent01.sources.source01.interceptors.interceptor02.serializers = s01 agent01.sources.source01.interceptors.interceptor02.serializers.s01.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer agent01.sources.source01.interceptors.interceptor02.serializers.s01.pattern = dd/MMM/yyyy:HH:mm:ss agent01.sources.source01.interceptors.interceptor02.serializers.s01.name = timestamp ## sink01 agent01.sinks.sink01.channel = channel01 agent01.sinks.sink01.type = hdfs agent01.sinks.sink01.hdfs.path = /log/wiselog/access/%{type}/yyyymmdd=%Y%m%d/hh=%H agent01.sinks.sink01.hdfs.filePrefix = %{host} agent01.sinks.sink01.hdfs.inUsePrefix = . agent01.sinks.sink01.hdfs.rollInterval = 300 agent01.sinks.sink01.hdfs.rollSize = 0 agent01.sinks.sink01.hdfs.rollCount = 0 agent01.sinks.sink01.hdfs.idleTimeout = 60 agent01.sinks.sink01.hdfs.writeFormat = Text agent01.sinks.sink01.hdfs.codeC = gzip ## test #agent01.sinks.sink01.channel = channel01 #agent01.sinks.sink01.type = logger
- log server
- apache access log
- log server
## agent(s): agent01 ## names of source(s), channel(s) and sink(s) agent01.sources = source01 source02 agent01.channels = channel01 agent01.sinks = sink01 ## channel01 agent01.channels.channel01.type = memory agent01.channels.channel01.capacity = 100000 ## source01 agent01.sources.source01.channels = channel01 agent01.sources.source01.type = spooldir agent01.sources.source01.spoolDir = /home/log-collector/test agent01.sources.source01.deletePolicy = immediate agent01.sources.source01.deserializer.maxLineLength = 204800 agent01.sources.source01.interceptors = interceptor01 interceptor02 agent01.sources.source01.interceptors.interceptor01.type = static agent01.sources.source01.interceptors.interceptor01.key = type agent01.sources.source01.interceptors.interceptor01.value = test agent01.sources.source01.interceptors.interceptor02.type = static agent01.sources.source01.interceptors.interceptor02.key = timestamp agent01.sources.source01.interceptors.interceptor02.value = 0 ## source02 agent01.sources.source02.channels = channel01 agent01.sources.source02.type = spooldir agent01.sources.source02.spoolDir = /home/log-collector/test2 agent01.sources.source02.deletePolicy = immediate agent01.sources.source02.deserializer.maxLineLength = 204800 agent01.sources.source02.interceptors = interceptor01 interceptor02 agent01.sources.source02.interceptors.interceptor01.type = static agent01.sources.source02.interceptors.interceptor01.key = type agent01.sources.source02.interceptors.interceptor01.value = test2 agent01.sources.source02.interceptors.interceptor02.type = static agent01.sources.source02.interceptors.interceptor02.key = timestamp agent01.sources.source02.interceptors.interceptor02.value = 0 ## sink01 agent01.sinks.sink01.channel = channel01 agent01.sinks.sink01.type = avro agent01.sinks.sink01.hostname = 10.0.2.a agent01.sinks.sink01.port = 4545 ## test #agent01.sinks.sink01.channel = channel01 #agent01.sinks.sink01.type = logger
- HDFS server
## agent(s): agent01 ## names of source(s), channel(s) and sink(s) agent01.sources = source01 agent01.channels = channel01 agent01.sinks = sink01 ## channel01 agent01.channels.channel01.type = memory agent01.channels.channel01.capacity = 100000 ## source01 agent01.sources.source01.channels = channel01 agent01.sources.source01.type = avro agent01.sources.source01.bind = 10.0.2.a agent01.sources.source01.port = 4545 agent01.sources.source01.interceptors = interceptor01 interceptor02 agent01.sources.source01.interceptors.interceptor01.type = host agent01.sources.source01.interceptors.interceptor02.type = regex_extractor agent01.sources.source01.interceptors.interceptor02.regex = ^[\\s\\S]+\\[(\\d{1,2}\\/[A-Z][a-z]{2}\\/\\d{4}:\\d{2}:\\d{2}:\\d{2})\\s\\+0900\\][\\s\\S]+ agent01.sources.source01.interceptors.interceptor02.serializers = s01 agent01.sources.source01.interceptors.interceptor02.serializers.s01.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer agent01.sources.source01.interceptors.interceptor02.serializers.s01.pattern = dd/MMM/yyyy:HH:mm:ss agent01.sources.source01.interceptors.interceptor02.serializers.s01.name = timestamp ## sink01 agent01.sinks.sink01.channel = channel01 agent01.sinks.sink01.type = hdfs agent01.sinks.sink01.hdfs.path = /log/apache/access/%{type}/yyyymmdd=%Y%m%d/hh=%H agent01.sinks.sink01.hdfs.filePrefix = %{host} agent01.sinks.sink01.hdfs.inUsePrefix = . agent01.sinks.sink01.hdfs.rollInterval = 300 agent01.sinks.sink01.hdfs.rollSize = 0 agent01.sinks.sink01.hdfs.rollCount = 0 agent01.sinks.sink01.hdfs.idleTimeout = 120 agent01.sinks.sink01.hdfs.writeFormat = Text agent01.sinks.sink01.hdfs.codeC = gzip ## test #agent01.sinks.sink01.channel = channel01 #agent01.sinks.sink01.type = logger
- log server
Ambari - enable kerberos (HDP)
- prerequisite
- Kerberos installation
- http://mungeol-heo.blogspot.kr/2015/08/kerberos.html
- Java Cryptography Extension (no need if it is installed by ambari)
- ambari web -> admin -> kerberos -> enable kerberos -> exisiting MIT KDC -> provide information about the KDC and admin account -> configure kerberos -> install and test kerberos client -> confirm configuration -> stop services -> kerberize cluster -> start and test services
Ambari - configure email notification using sendmail (HDP)
- start sendmail on the host where ambari server runs
- service sendmail start
- ambari web -> alerts -> actions -> manage notifications -> "+"
- name
- test
- groups
- all
- use default value for others
- save
- test
- select one service/component and stop/start it
Manual Upgrade HDP 2.2 to 2.3
- option - local repository setting for HDP 2.2 update (temporary internet access)
- cd /etc/yum.repos.d
- rm -fambari* HDP*
- wget http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.1.0/ambari.repo
- wget http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.3.0.0/hdp.repo
- yum -y install yum-utils createrepo
- install and run web server
- mkdir -p /var/www/html/
- cd /var/www/html/
- mkdir -p ambari/centos6
- cd ambari/centos6
- reposync -r Updates-ambari-2.1.0
- createrepo Updates-ambari-2.1.0
- cd /var/www/html/
- mkdir -p hdp/centos6
- cd hdp/centos6
- reposync -r HDP-2.3.0.0
- reposync -r HDP-UTILS-1.1.0.20
- createrepo HDP-2.3.0.0
- createrepo HDP-UTILS-1.1.0.20
- edit DocumentRoot and Directory at /usr/local/apache2/conf/httpd.conf or
- cd /usr/local/apache2/htdocs
- ln -s /var/www/html/ambari ambari
- ln -s /var/www/html/hdp hdp
- http://hostname/ambari/centos6/Updates-ambari-2.1.0
- http://hostname/hdp/centos6/HDP-2.3.0.0
- http://hostname/hdp/centos6/HDP-UTILS-1.1.0.20
- option (If you have multiple repositories configured in your environment)
- yum -y install yum-plugin-priorities
- vim /etc/yum/pluginconf.d/priorities.conf
- [main]
- enabled=1
- gpgcheck=0
- stop web server at the end
- /usr/local/apache2/bin/apachectl -k stop
- Upgrading ambari 2.0 to 2.1 (HDP)
- http://mungeol-heo.blogspot.kr/2015/11/upgrading-ambari-20-to-21-hdp.html
- Upgrading ambari metrics (HDP)
- http://mungeol-heo.blogspot.kr/2015/11/upgrading-ambari-metrics-hdp.html
- ambari web -> admin > stack and versions > manage versions > + register version
- enter 0.0
- enter 'http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.3.0.0' for HDP base URL
- enter 'http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/centos6' for HDP-UTILS base URL
- save
- go to dashboard > admin > stack and versions > verions > install packages > ok
- record component layoutbrowse to each Service except HDFS and ZooKeeper and perform Stop
- namenode (active) hostlog in to ambari
- cd <dfs.namenode.name.dir>
- Make sure that only a "/current" directory and no "/previous" directory exists
- cp -r current /home/hdfs/
- su hdfs
- cd
- hdfs fsck / -files -blocks -locations > dfs-old-fsck-1.log
- hdfs dfsadmin -report > dfs-old-report-1.log
- hdfs dfsadmin -safemode enter
- hdfs dfsadmin -saveNamespace
- hdfs dfsadmin -finalizeUpgrade
- Using Ambari Web, stop HDFS service and stop ZooKeeper service
- namenode (active) hosthdp-select set all 2.3.0.0-2557 (all hosts)
- cd <dfs.namenode.name.dir>/current
- hdfs oev -i edits_inprogress_* -o edits.out
- Verify edits.out file. It should only have OP_START_LOG_SEGMENT transaction
- mkdir -p /work/upgrade_hdp_2
- cd /work/upgrade_hdp_2
- curl -O https://raw.githubusercontent.com/apache/ambari/branch-2.1/ambari-server/src/main/python/upgradeHelper.py
- chmod 777 upgradeHelper.py
- curl -O https://raw.githubusercontent.com/apache/ambari/branch-2.1/ambari-server/src/main/resources/upgrade/catalog/UpgradeCatalog_2.2_to_2.3.json
- python upgradeHelper.py --hostname $HOSTNAME --user $USERNAME --password $PASSWORD --clustername $CLUSTERNAME --fromStack $FROMSTACK --toStack $TOSTACK --upgradeCatalog UpgradeCatalog_2.2_to_2.3.json update-configs [config-type]curl -O https://raw.githubusercontent.com/apache/ambari/branch-2.1/ambari-server/src/main/resources/upgrade/catalog/UpgradeCatalog_2.2_to_2.3_step2.json
- ariableValue$HOSTNAMEAmbari Server hostname. This should be the FQDN for the host running the Ambari Server.$USERNAMEAmbari Admin user.$PASSWORDPassword for the user.$CLUSTERNAMEName of the cluster. This is the name you provided when you installed the cluster with Ambari. Login to Ambari and the name can be found in the upper-left of the Ambari Web screen. This is case-sensitive.$FROMSTACKThe “from” stack. Forexample:2.2$TOSTACKThe “to” stack. Forexample:2.3config-typeOptional: the config-type to upgrade. For example:hdfs-site. By default, all configurations are updated.
- upgrade zookeeper
- start it at ambari web
- mv /etc/zookeeper/conf /etc/zookeeper/conf.saved
- ln -s /usr/hdp/current/zookeeper-client/conf /etc/zookeeper/conf
- upgrade HDFS
- \cp -r /etc/hadoop/conf/* /etc/hadoop/2.3.0.0-2557/0/
- mv /etc/hadoop/conf /etc/hadoop/conf.saved
- ln -s /usr/hdp/current/hadoop-client/conf /etc/hadoop/conf
- su -l hdfs -c "/usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh start namenode -upgrade"
- ps -ef | grep -i NameNode
- su -l hdfs -c "/usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh start datanode"
- ps -ef | grep DataNode
- ambari web > services > HDFS > restart all
- run service check
- su -l hdfs -c "hdfs dfsadmin -safemode get"
- safe mode is off
- upgrade YARN and MR2note that you may have to upgrade other services to complete the upgrading which depends on your cluster configuration
- su -l hdfs -c "hdfs dfs -mkdir -p /hdp/apps/2.3.0.0-2557/mapreduce/"
- su -l hdfs -c "hdfs dfs -put /usr/hdp/current/hadoop-client/mapreduce.tar.gz /hdp/apps/2.3.0.0-2557/mapreduce/."
- su -l hdfs -c "hdfs dfs -put /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar /hdp/apps/2.3.0.0-2557/mapreduce/."
- su -l hdfs -c "hdfs dfs -chown -R hdfs:hadoop /hdp"
- su -l hdfs -c "hdfs dfs -chmod -R 555 /hdp/apps/2.3.0.0-2557/mapreduce"
- su -l hdfs -c "hdfs dfs -chmod -R 444 /hdp/apps/2.3.0.0-2557/mapreduce/mapreduce.tar.gz"
- su -l hdfs -c "hdfs dfs -chmod -R 444 /hdp/apps/2.3.0.0-2557/mapreduce/hadoop-streaming.jar"
- su -l yarn -c "yarn resourcemanager -format-state-store"
- start YARN and MR2 from ambari web
- run service check for YARN and MR2
- sudo su -l hdfs -c "hdfs dfsadmin -finalizeUpgrade"
- ambari-server set-current --cluster-name=dev --version-display-name=HDP-2.3.0.0
- option
- If your cluster includes Ranger
- cd /work/upgrade_hdp_2
- python upgradeHelper.py --hostname $HOSTNAME --user $USERNAME --password $PASSWORD --clustername $CLUSTERNAME --fromStack $FROMSTACK --toStack $TOSTACK --upgradeCatalog UpgradeCatalog_2.2_to_2.3_step2.json update-configs [config-type]
- cd /work/upgrade_hdp_2
- If your cluster includes Ranger
Upgrading ambari 2.0 to 2.1 (HDP)
- stop Ambari Metrics
- ambari-server stop
- all hosts
- ambari-agent stop
- cd /etc/yum.repos.d
- rm -f ambari* HDP*
- wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.1.0/ambari.repo -O /etc/yum.repos.d/ambari.repo
- yum clean all; yum -y upgrade ambari-server
- Confirm there is only one ambari-server*.jar file in /usr/lib/ambari-server. If there is more than one JAR file with name ambari-server*.jar, move all JARs except ambari-server-2.1.0.*.jar to /tmp before proceeding with upgrade
- all hosts
- yum -y upgrade ambari-agent
- rpm -qa | grep ambari-agent
- ambari-server upgrade
- ambari-server start
- ambari-agent start (each host)
- http://<your.ambari.server>:8080
- restart services
- note that you may have to perform other processes to complete the upgrading which depends on your cluster configuration
Upgrading ambari metrics (HDP)
- stop ambari metrics
- all hosts
- yum clean all
- yum -y upgrade ambari-metrics-monitor ambari-metrics-hadoop-sink
- yum -y upgrade ambari-metrics-collector
- start ambari metircs
- restart related services
HDP - verifying memory configuration
- https://github.com/hortonworks/hdp-configuration-utils
- unzip master
- cd hdp-configuration-utils-master/2.1
- ./hdp-configuration-utils.py -h
HDP - log archival process
#!/bin/sh
# Command Line params
# ACTUAL at this time:
# $1 = Interval Days to remove.
DAY_ARCHIVE_THRESHOLD=${1-30}
# $1 = Interval Days to remove.
DAY_ARCHIVE_THRESHOLD=${1-30}
#!/bin/bash
# Cleaning hadoop logs older than 30 days in all hadoop related folders on /var/log
LOG_BASE=/var/log
COMPONENTS="accumulo ambari-agent ambari-server falcon hadoop hadoop-hdfs hadoop-mapreduce hadoop-yarn hbase hive hive-cataglog hue knox nagios oozie storm webhcat zookeeper"
echo "Reviewing Logs for $COMPONENTS"
for i in $COMPONENTS; do
if [ -d $LOG_BASE/$i ]; then
echo "Removing logs for $LOG_BASE/$i that are $DAY_ARCHIVE_THRESHOLD (or more) days old"
find $LOG_BASE/$i -mtime +$DAY_ARCHIVE_THRESHOLD -exec rm -f {} \;
popd
else
echo "Component $i logs not found on this server"
fi
done
for i in $COMPONENTS; do
if [ -d $LOG_BASE/$i ]; then
echo "Removing logs for $LOG_BASE/$i that are $DAY_ARCHIVE_THRESHOLD (or more) days old"
find $LOG_BASE/$i -mtime +$DAY_ARCHIVE_THRESHOLD -exec rm -f {} \;
popd
else
echo "Component $i logs not found on this server"
fi
done
# Cleanup OS Components
OS_COMPONENTS="messages maillog secure spooler"
for i in $OS_COMPONENTS; do
if [ -d $LOG_BASE/$i ]; then
echo "Removing logs for $LOG_BASE/$i that are $DAY_ARCHIVE_THRESHOLD (or more) days old"
find $LOG_BASE/$i-* -mtime +$DAY_ARCHIVE_THRESHOLD -exec rm -f {} \;
popd
else
echo "Component $i logs not found on this server"
fi
done
OS_COMPONENTS="messages maillog secure spooler"
for i in $OS_COMPONENTS; do
if [ -d $LOG_BASE/$i ]; then
echo "Removing logs for $LOG_BASE/$i that are $DAY_ARCHIVE_THRESHOLD (or more) days old"
find $LOG_BASE/$i-* -mtime +$DAY_ARCHIVE_THRESHOLD -exec rm -f {} \;
popd
else
echo "Component $i logs not found on this server"
fi
done
HDP - HDFS performance test
- sudo su hdfs -l -c 'yarn jar /usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 1000'
- sudo su hdfs -l -c 'yarn jar /usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 1000'
- sudo su hdfs -l -c 'yarn jar /usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -clean'
HDP - Hadoop HA
- namenode HA
- Check to make sure you have at least three hosts in your cluster and are running at least three ZooKeeper servers
- ambari web -> services -> HDFS -> service actions -> enable namenode HA -> nameservice ID -> select hosts -> review
- create checkpoint
- login to the namenode host
- sudo su hdfs -l -c 'hdfs dfsadmin -safemode enter'
- sudo su hdfs -l -c 'hdfs dfsadmin -saveNamespace'
- configure components
- initialize journalnodes
- login to the namenode host
- sudo su hdfs -l -c 'hdfs namenode -initializeSharedEdits'
- start components
- initialize metadata
- login to the namenode host
- sudo su hdfs -l -c 'hdfs zkfc -formatZK'
- login to the additional namenode host
- sudo su hdfs -l -c 'hdfs namenode -bootstrapStandby'
- finalize HA setup
- resourcemanager HA
- Check to make sure you have at least three hosts in your cluster and are running at least three ZooKeeper servers
- ambari web -> services -> YARN -> service actions -> enable resouremanger HA -> get started -> select hosts -> review -> configure components
HDP 2.3 installation
- software requirements
- yum, rpm, scp, curl, unzip, tar, wget, OpenSSL v1.01, build 16 or later), python v2.6
- option
- chmod 755 /usr/bin/yum /bin/rpm /usr/bin/scp /usr/bin/curl /usr/bin/unzip /bin/tar /usr/bin/wget /usr/bin/ssh /usr/bin/python
- check java version
- Oracle JDK 1.8 64-bit (minimum JDK 1.8_40) (default)
- Oracle JDK 1.7 64-bit (minimum JDK 1.7_67)
- OpenJDK 8 64-bit
- OpenJDK 7 64-bit
- ulimit -Sn / ulimit -Hn
- 65535
- configure FQDN at /etc/hosts, /etc/sysconfig/networks and using hostname command
- set up password-less SSH
- ssh-keygen
- Copy the SSH Public Key (id_rsa.pub) to the root account on your target hosts
- option
- vim /etc/ssh/sshd_config
- PermitRootLogin yes
- service sshd restart
- vim /etc/ssh/sshd_config
- option
- all hosts
- cat id_rsa.pub >> ~/.ssh/authorized_keys
- option
- chmod 700 ~/.ssh
- chmod 600 ~/.ssh/authorized_keys
- NTP yum -y install sudo (option)
- yum -y install ntp
- chkconfig ntpd on
- service ntpd start
- SELinux
- vim /etc/selinux/config
- if it is abled
- set SELINUX=disabled
- command: setenforce 0
- if PackageKit is installed
- vim /etc/yum/pluginconf.d/refresh-packagekit.conf
- enabled=0
- umask
- umask
- 0022
- if it is not 0022
- vim /etc/profile
- umask 022
- umask
- THP
- check
- cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
- cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
- command and /etc/rc.local
- echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
- echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
- check
- permission (option)
- ambari
- /bin/hostname
- /usr/bin/sudo
- namenode
- /usr/bin/which
- /bin/ps
- /bin/df
- ambari metrics
- /usr/bin/gcc
- /usr/bin/ld
- /etc/centos-release
- yarn, MR2
- /tmp (check)
- /etc/fstab (default)
- /proc/mounts (current)
- mount -o remount,exec /tmp
- /usr/bin/curl
- /tmp (check)
- knox
- /bin/netstat
- all
- chmod 755 /bin/hostname /usr/bin/sudo /usr/bin/which /bin/ps /bin/df /usr/bin/gcc /usr/bin/ld /etc/centos-release /usr/bin/curl /bin/netstat
- ambari
- Install ambari 2.1.0 using non-default databases (HDP)
- http://mungeol-heo.blogspot.kr/2015/11/install-mariadb-httpmungeol-heo.html
- http://<your.ambari.server>:8080 -> admin / admin -> launch install wiizard -> cluster name -> HDP 2.3 -> install options -> confirm hosts -> choose servicesl -> assgin masters -> assign slaves and clients -> customize services -> review -> install, start and test -> summary
Install ambari 2.1.0 using non-default databases (HDP)
- install mariaDB
- http://mungeol-heo.blogspot.kr/2015/07/mariadb.html
- mysql-connector-java
- yum -y install mysql-connector-java
- use latest version if "option sql_select_limit=default" error occurs
- cd /usr/share
- mkdir java
- cd java
- wget http://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-5.1.36.zip
- unzip mysql-connector-java-5.1.36.zip
- cp mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar .
- ln -s mysql-connector-java-5.1.36-bin.jar mysql-connector-java.jar
- use latest version if "option sql_select_limit=default" error occurs
- ls /usr/share/java/mysql-connector-java.jar
- chmod 644 /usr/share/java/mysql-connector-java.jar (option)
- yum -y install mysql-connector-java
- wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.1.0/ambari.repo -O /etc/yum.repos.d/ambari.repo
- yum -y install ambari-server
- service postgresql stop
- RDB setting
- mysql -u root -p
- CREATE USER 'ambari'@'%' IDENTIFIED BY 'bigdata';
- GRANT ALL PRIVILEGES ON *.* TO 'ambari'@'%';
- CREATE USER 'ambari'@'localhost' IDENTIFIED BY 'bigdata';
- GRANT ALL PRIVILEGES ON *.* TO 'ambari'@'localhost';
- CREATE USER 'ambari'@'bigdata-dev01.co.kr' IDENTIFIED BY 'bigdata';
- GRANT ALL PRIVILEGES ON *.* TO 'ambari'@'bigdata-dev01.co.kr';
- FLUSH PRIVILEGES;
- mysql -u ambari -p
- CREATE DATABASE ambari;
- USE ambari;
- SOURCE /var/lib/ambari-server/resources/Ambari-DDL-MySQL-CREATE.sql;
- ambari-server setup
- Select y at Enter advanced database configuration
- enter 3 (mysql)
- select default values for all left
- ambari-server start
HDP 2.2 installation
- yum -y install java-1.7.0-openjdk (option)
- ulimit -Sn / ulimit -Hnconfigure FQDN at /etc/hosts, /etc/sysconfig/networks and using hostname command
- 65535
- ssh-keygen (passwordless connection)
- NTP yum -y install sudo (option)
- yum -y install ntp
- chkconfig ntpd on
- service ntpd start
- THP
- check
- cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
- cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
- command and /etc/rc.local
- echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
- echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
- check
- permission (option)
- ambari
- /bin/hostname
- /usr/bin/sudo
- namenode
- /usr/bin/which
- /bin/ps
- /bin/df
- ambari metrics
- /usr/bin/gcc
- /usr/bin/ld
- /etc/centos-release
- yarn, MR2
- /tmp (check)
- /etc/fstab (default)
- /proc/mounts (current)
- mount -o remount,exec /tmp
- /usr/bin/curl
- /tmp (check)
- knox
- /bin/netstat
- ambari
- Install ambari 2.0.1 (HDP)
- http://mungeol-heo.blogspot.kr/2015/11/install-ambari-201-hdp.html
- install and run services from ambari web
Install ambari 2.0.1 (HDP)
- wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.0.1/ambari.repo -O /etc/yum.repos.d/ambari.repo
- yum -y install ambari-server
- ambari-server setup
- ambari-server start
Friday, November 27, 2015
Machine learning tools
- confusion matrix
- cross-validation
- https://en.wikipedia.org/wiki/Cross-validation_(statistics)
- python function
- k-fold
- leave-one-out
- standard error of the mean
- one hot encoding
Subscribe to:
Posts (Atom)