Friday, July 31, 2015

Solution 4 Error (HDP)

  • Note: just try to use the solution(s) addressed below to solve the corresponding error
  1. 2015.07.31
    1. no response from namenode UI / 50070 is binded to private IP
      1. ambari web -> HDFS -> configs -> custom core-site -> add property
        1. key: dfs.namenode.http-bind-host
        2. value: 0.0.0.0
      2. save it and restart related services
      3. note that there are 'dfs.namenode.rpc-bind-host', 'dfs.namenode.servicerpc-bind-host' and 'dfs.namenode.https-bind-host' properties which can solve similar issue
  2. 2015.07.30
    1. root is not allowed to impersonate <username>
      1. ambari web -> HDFS -> configs -> custom core-site -> add property
        1. key: hadoop.proxyuser.root.groups
        2. value: *
        3. key: hadoop.proxyuser.root.hosts
        4. value: *
      2. save it and restart related services
      3. note that you should change root to the user name who runs/submits the service/job

Wednesday, July 29, 2015

Log Format

  1. 2015.07.08
    1. 잘 알려진 로그 포맷
      로그 포맷
      상세 내용
      W3C 확장 로그 파일 포맷(Extended Log File Format)http://www.w3.org/TR/WD-logfile.html
      아파치 액세스 로그(access log)http://httpd.apache.org/docs/current/logs.html
      시스코(SDEE/CIDEE)http://www.cisco.com/c/en/us/td/docs/security/ips/specs/CIDEE_Specification.html
      ArcSight 공통 이벤트 포맷(Common Event Format)CommonEventFormat.pdf
      syslogRFC3195, RFC5424
      IDMEF, XML 기반 포맷RFC4765


    2. 일반적인 필드 집합
      1. 날짜/시간(표준시간대)
      2. 로그 항목 유형
      3. 로그를 생성한 시스템
      4. 생성한 애플리케이션이나 컴포넌트
      5. 성공적인 실패 행위 암시
      6. 로그 메시지의 심각도, 우선순위, 중요성
      7. 사용자 행위와 관련한 로그의 경우 사용자 명도 포함
    3. 로깅 범주
      1. 로깅의 5W
        1. 발생 사건
        2. 발생 시기
        3. 발생 장소
        4. 관련된 사람
        5. 관련 장소
      2. 기타
        1. 정보를 획득한 장소
        2. 이 사건이 실제 발생했는지 확신하는 방법
        3. 영향받은 것
        4. 향후 발생할 사건
        5. 신경 써야 할 그 외 발생 사건
        6. 해야 할 일

LogStash 4 Test

  1. 2015.01.19
    1. wiselog 설정
      1. input {
          file {
            type => "access"
            path => "/home/mungeol/access.log"
          }
        }
        filter {
                if [type] == "access" {
                        grok {
                                match => { "message" => "%{IPORHOST:clientip} \[%{HTTPDATE:timestamp}\] \"(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})\" \"(?<referrer>.+)\" %{NUMBER:response} (?:%{NUMBER:bytes}|-) %{QS:agent} %{QS:cookie}" }
                        }
                        grok {
                                match => { "cookie" => "PCID=(?<pcid>\d+);" }
                        }
                        grok {
                                match => { "cookie" => "UID=(?<uid>\d+);" }
                        }
                        grok {
                                match => { "cookie" => "n_ss=(?<n_ss>\d+.\d+);" }
                        }
                        grok {
                                match => { "cookie" => "n_cs=(?<n_cs>.+);" }
                        }
                }
                date {
                        match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
                }
        }
        output {
                elasticsearch {
                        cluster => "dev"
                        index => "wiselog_test"
                        protocol => "http"
                        workers => 4
                }
                #stdout { codec => rubydebug }
        }
    2. kafka output 설정
      1. input {
          file {
            type => "apache-access"
            path => "/home/mungeol/access-test"
          }
        #  file {
        #    type => "apache-error"
        #    path => "/home/weblog/test/data/test-error"
        #  }
        }
        #filter {
        #  if [type] == "apache-access" {
        #      grok {
        #        match => { "message" => "%{COMBINEDAPACHELOG}" }
        #      }
        #    date {
        #      match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
        #    }
        #  }
        #  if [type] == "apache-error" {
        #      grok {
        #        match => { "message" => "%{APACHEERRORLOG}" }
        #        patterns_dir => ["/var/lib/logstash/etc/grok"]
        #      }
        #  }
        #  if [clientip]  {
        #    geoip {
        #      source => "clientip"
        #      target => "geoip"
        #      remove_field => ["[geoip][ip]", "[geoip][country_code3]", "[geoip][country_name]", "[geoip][continent_code]", "[geoip][region_name]",
        #                       "[geoip][city_name]", "[geoip][latitude]", "[geoip][longitude]", "[geoip][timezone]", "[geoip][real_region_name]"]
        #    }
        #  }
        #}
        output {
                kafka {
                        broker_list => "10.0.2.81:9092,10.0.2.82:9092,10.0.2.83:9092"
                        topic_id => "access-test-02"
                        topic_metadata_refresh_interval_ms => 30000
                        request_required_acks => 0
        #               producer_type => "async"
                }
        }
    3. kafka input 설정
      1. input {
                kafka {
                        zk_connect => "localhost:2181,10.0.2.81:2181,10.0.2.82:2181,10.0.2.83:2181"
                        topic_id => "access-test-07"
                        type => "apache-access"
        queue_size => 200
        fetch_message_max_bytes => 2097152
                }
        }
        filter {
          if [type] == "apache-access" {
              grok {
                match => { "message" => "%{COMBINEDAPACHELOG}" }
              }
            date {
              match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
            }
          }
        #  if [type] == "apache-error" {
        #      grok {
        #        match => { "message" => "%{APACHEERRORLOG}" }
        #        patterns_dir => ["/var/lib/logstash/etc/grok"]
        #      }
        #  }
        #  if [clientip]  {
        #    geoip {
        #      source => "clientip"
        #      target => "geoip"
        #      remove_field => ["[geoip][ip]", "[geoip][country_code3]", "[geoip][country_name]", "[geoip][continent_code]", "[geoip][region_name]",
        #                       "[geoip][city_name]", "[geoip][latitude]", "[geoip][longitude]", "[geoip][timezone]", "[geoip][real_region_name]"]
        #    }
        #  }
        }
        output {
          elasticsearch {
            cluster => "dev"
            index => "test"
            protocol => "http"
                workers => 4
          }
        #stdout { codec => rubydebug }
        }

      1. input {
          file {
                type => "access-report"
            path => "/home/mungeol/workspace/securityTeam/sec_report"
          }
        }
        filter {
                if [type] == "access-report" {
                        csv {
                                columns => ["id","name","department","date","ip","request"]
                        }
                        date {
                                match => [ "date", "yyyy-MM-dd HH:mm:ss" ]
                        }
                }
        }
        output {
          elasticsearch {
            cluster => "dev"
            index => "sec-team"
            protocol => "http"
          }
        #stdout { codec => rubydebug }
        }
  2. ~ 2015.01.01
    1. 소개
      1. 현재 아파치 access 로그가 생성되는 방법을 고려하여 LogStash로 ElasticSearch 클러스터에 로그를 변환하여 전송하는 테스트 진행
    2. 실시간 전송
      1. /home/weblog/test/data/test-access 사용
      2. 같은 형식으로 가상 로그 서버에 로그 생성
      3. 설정 파일
        1. input {
            file {
              path => "/home/weblog/test/data/test-access"
            }
          }
          filter {
            if [path] =~ "access" {
              mutate { replace => { type => "apache_access" } }
              grok {
                match => { "message" => "%{COMBINEDAPACHELOG}" }
              }
              date {
                match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
              }
            } else if [path] =~ "error" {
              mutate { replace => { type => "apache_error" } }
            } else {
              mutate { replace => { type => "random_logs" } }
            }
          }
          output {
            elasticsearch {
              host => "211.49.227.177"
              protocol => "http"
            }
          }
      4. /home/weblog/test/data/test-access에 새로운 로그가 추가 될때마다 실시간으로 ElasticSearch 클러스터에 인덱싱

    3. 시간 단위 전송
      1. /home/weblog/test/data/test-access.082117 형식 사용
      2. 같은 형식으로 가상 로그 서버에 로그 생성
      3. 설정 파일
        1. input {
            file {
              path => "/home/weblog/test/data/test-access.*"
            }
          }
          filter {
            if [path] =~ "access" {
              mutate { replace => { type => "apache_access" } }
              grok {
                match => { "message" => "%{COMBINEDAPACHELOG}" }
              }
              date {
                match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
              }
            } else if [path] =~ "error" {
              mutate { replace => { type => "apache_error" } }
            } else {
              mutate { replace => { type => "random_logs" } }
            }
          }
          output {
            elasticsearch {
              host => "211.49.227.177"
              protocol => "http"
            }
          }
      4. /home/weblog/test/data/ 폴더에 새로운 'test-access.*' 형식의 로그가 추가 될때마다 ElasticSearch 클러스터에 인덱싱

    4. elasticsearch output 설정
      1. input {
          file {
            path => "/home/weblog/test/data/test-access.*"
          }
        }
        filter {
          if [path] =~ "access" {
            mutate { replace => { type => "apache_access" } }
            grok {
              match => { "message" => "%{COMBINEDAPACHELOG}" }
            }
            date {
              match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
            }
          } else if [path] =~ "error" {
            mutate { replace => { type => "apache_error" } }
          } else {
            mutate { replace => { type => "random_logs" } }
          }
        }
        output {
          elasticsearch {
            cluster => "elasticsearch"
            host => "211.49.227.177"
            index => "apache-%{+YYYY.MM}"
            protocol => "http"
          }
        stdout { codec => rubydebug }
        }
        

    5. IP 정보에서 geo 정보 추출
      1. geo 정보에서 bettermap과 map 정보에 사용할 정보만 남기고 기타 정보 삭제
        1. input {
            file {
              type => "apache-access"
              path => "/home/weblog/test/data/test-access"
            }
            file {
              type => "apache-error"
              path => "/home/weblog/test/data/test-error"
            }
          }
          filter {
            if [type] == "apache-access" {
                grok {
                  match => { "message" => "%{COMBINEDAPACHELOG}" }
                }
            }
            if [type] == "apache-error" {
                grok {
                  #match => { "message" => "%{APACHEERRORLOG}" }
                  #patterns_dir => ["/var/lib/logstash/etc/grok"]
                }
            }
            if [clientip]  {
              geoip {
                source => "clientip"
                target => "geoip"
                remove_field => ["[geoip][ip]", "[geoip][country_code3]", "[geoip][country_name]", "[geoip][continent_code]", "[geoip][region_name]",
                                 "[geoip][city_name]", "[geoip][latitude]", "[geoip][longitude]", "[geoip][timezone]", "[geoip][real_region_name]"]
              }
            }
          }
          output {
            elasticsearch {
              cluster => "elasticsearch"
              host => "211.49.227.177"
              index => "apache-%{+YYYY.MM}"
              protocol => "http"
            }
          stdout { codec => rubydebug }
          }
    6. MySQL slow query log 파싱
      1. input {
          file {
            type => "mysql-slow"
            path => "/DBLog/dbmaster-slow.log"
          }
        }
        filter {
          if [message] =~ "# Time: " {
            drop {}
          }
          grok {
            match => {
              message => [
                "^# User@Host: %{USER:user}(?:\[[^\]]+\])?\s+@\s+%{HOST:host}?\s+\[%{IP:ip}?\]",
                "^# Query_time: %{NUMBER:query_time:float}\s+Lock_time: %{NUMBER:lock_time:float} Rows_sent: %{NUMBER:rows_sent:int} \s*Rows_examined: %{NUMBER:rows_examined:float}",
                "^SET timestamp=%{NUMBER:timestamp};",
                "%{GREEDYDATA:query}"
              ]
            }
          }
          multiline {
            pattern => "# User"
            negate => false
            what => "next"
          }
          multiline {
            pattern => "^#"
            negate => true
            what => "previous"
          }
          date {
            match => [ "timestamp", "UNIX" ]
          }
          mutate {
            remove_field => [ "timestamp" ]
          }
        }
        output {
          elasticsearch {
            cluster => "elasticsearch"
            host => "211.49.227.177"
            index => "mysql-%{+YYYY.MM}"
            protocol => "http"
          }
        #  stdout { codec => rubydebug }
        }
        

    7. elasticsearch-river / RabbitMQ 사용
      1. input {
          file {
            type => "apache-access"
            path => "/home/mungeol/test-access*"
          }
        #  file {
        #    type => "apache-error"
        #    path => "/home/weblog/test/data/test-error"
        #  }
        }
        filter {
          if [type] == "apache-access" {
              grok {
                match => { "message" => "%{COMBINEDAPACHELOG}" }
              }
            date {
              match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
            }
          }
        #  if [type] == "apache-error" {
        #      grok {
        #        match => { "message" => "%{APACHEERRORLOG}" }
        #        patterns_dir => ["/var/lib/logstash/etc/grok"]
        #      }
        #  }
          if [clientip]  {
            geoip {
              source => "clientip"
              target => "geoip"
              remove_field => ["[geoip][ip]", "[geoip][country_code3]", "[geoip][country_name]", "[geoip][continent_code]", "[geoip][region_name]",
                               "[geoip][city_name]", "[geoip][latitude]", "[geoip][longitude]", "[geoip][timezone]", "[geoip][real_region_name]"]
            }
          }
        }
        output {
        #  elasticsearch {
        #    cluster => "dev"
        #    host => "10.0.2.83"
        #    index => "apache-test"
        #    protocol => "http"
        #  }
        elasticsearch_river {
          es_host => "10.0.2.82"
          rabbitmq_host => "10.0.2.81"
          index => "apache-test"
          user => "test"
          password => "test"
        }
        #stdout { codec => rubydebug }
        }
    8. logstash-kafka
      1. bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 3 --topic access-test
      2. producer
        1. input {
            file {
              type => "apache-access"
              path => "/home/mungeol/access-test"
            }
          }
          filter {
            if [type] == "apache-access" {
                grok {
                  #match => { "message" => "%{COMBINEDAPACHELOG}" }
                  match => { "message" => "%{COMBINEDAPACHELOG} %{QS:cookie}" }
                  match => { "cookie" => "UID=(?<mem_idx>\d+);" }
                  break_on_match => false
                }
              date {
                locale => en
                match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
              }
            }
            if [clientip]  {
              geoip {
                source => "clientip"
                target => "geoip"
                remove_field => ["[geoip][ip]", "[geoip][country_code3]", "[geoip][country_name]", "[geoip][continent_code]", "[geoip][region_name]",
                                 "[geoip][city_name]", "[geoip][latitude]", "[geoip][longitude]", "[geoip][timezone]", "[geoip][real_region_name]"]
              }
            }
          }
          output {
            kafka {
              topic_id => "access-test"
              broker_list => "10.0.2.83:9092"
            }
          #stdout { codec => rubydebug }
          }
      3. consumer
        1. input {
            kafka {
              topic_id => "access-test"
            }
          }
          output {
            elasticsearch {
              cluster => "dev"
              index => "test-%{+YYYY.MM.dd}"
              protocol => "http"
            }
          #stdout { codec => rubydebug }
          }
      4. http://x.x.x.x:8080/#/topics

Kibana 4 Test

  1. 2014.11.07
    1. basic authentication (temporal method)
      1. sudo su -c 'rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm'
      2. sudo yum install nginx
      3. sudo yum -y install httpd-tools
      4. sudo htpasswd -c /etc/nginx/conf.d/.htpasswd tester
      5. cd /etc/nginx/conf.d
      6. sudo vim kibana.conf
        1. server {
                  listen       9201;
                  server_name  hadoopdev-03;
                  charset utf-8;
                  
                  location / {
                      auth_basic "Restricted";
                      auth_basic_user_file /etc/nginx/conf.d/.htpasswd;
                      root   /opt/kibana;
                      index  index.html index.htm;
                  }
          }
      7. cd /opt/
      8. sudo wget https://download.elasticsearch.org/kibana/kibana/kibana-3.1.1.zip
      9. sudo unzip kibana-3.1.1.zip
      10. cd kibana-3.1.1/
      11. sudo vim config.js
      12. elasticsearch: "http://x.x.x.x:9200",
      13. cd ..
      14. sudo ln -s kibana-3.1.1/ kibana
      15. sudo service nginx start
      16. http://x.x.x.x:9201/#/dashboard/elasticsearch/Security_Team

Hive 4 Test

  1. 2015.03.23
    1. row_sequence()
      1. add jar /opt/cloudera/parcels/CDH/jars/hive-contrib-0.13.1-cdh5.3.0.jar;
      2. CREATE TEMPORARY FUNCTION row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
  2. 2015.02.11
    1. Software
      1. Hive 0.13.0 SetupHDP 2.1 General Availability
        • Hadoop 2.4.0
        • Tez 0.4.0
        • Hive 0.13.0
        HDP was deployed using Ambari 1.5.1. For the most part, the cluster used the Ambari defaults (except where noted below).  Hive 0.13.0 runs were done using Java 7 (default JVM).
        Tez and MapReduce were tuned to process all queries using 4 GB containers at a target container-to-disk ratio of 2.0. The ratio is important because it minimizes disk thrash and maximizes throughput.
        Other Settings:
        • yarn.nodemanager.resource.memory-mb was set to 49152
        • Default virtual memory for a job’s map-task and reduce-task were set to 4096
        • hive.tez.container.size was set to 4096
        • hive.tez.java.opts was set to -Xmx3800m
        • Tez app masters were given 8 GB
        • mapreduce.map.java.opts and mapreduce.reduce.java.opts were set to -Xmx3800m. This is smaller than 4096 to allow for some garbage collection overhead
        • hive.auto.convert.join.noconditionaltask.size was set to 1252698795
        Note:  this is 1/3 of the Xmx value, about 1.7 GB.
        The following additional optimizations were used for Hive 0.13.0:
        • Vectorized Query enabled
        • ORCFile formatted data
        • Map-join auto conversion enabled
    2. Hardware
      1. 20 physical nodes, each with:
        • 2x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz for total of 16 CPU cores/machine
        • Hyper-threading enabled
        • 256GB RAM per node
        • 6x 4TB WDC WD4000FYYZ-0 drives per node
        • 10 Gigabit interconnect between the nodes
        Notes: Based on the YARN Node Manager’s Memory Resource setting used below, only 48 GB of RAM per node was dedicated to query processing, the remaining 200 GB of RAM were available for system caches and HDFS.
        Linux Configurations:
        • /proc/sys/net/core/somaxconn = 512
        • /proc/sys/vm/dirty_writeback_centisecs = 6000
        • /proc/sys/vm/swappiness = 0
        • /proc/sys/vm/zone_reclaim_mode = 0
        • /sys/kernel/mm/redhat_transparent_hugepage/defrag = never
        • /sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag = no
        • /sys/kernel/mm/transparent_hugepage/khugepaged/defrag = 0

ES 4 Test

  1. 2015.03.03
    1. bashrc
      1. export INNERIP=`hostname -i`
        export ES_HEAP_SIZE=8g
        export ES_CLASSPATH=/etc/hadoop/conf:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop/.//*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/./:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-yarn/.//*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/.//*
    2. configuration
      1. cluster.name: test
      2. node.name: ${HOSTNAME}
      3. transport.host: ${INNERIP}
      4. discovery.zen.ping.multicast.enabled: false
      5. discovery.zen.ping.unicast.hosts: ["10.0.2.81", "10.0.2.82", "10.0.2.83"]
      6. indices.fielddata.cache.size: 40%
  2. 2015.03.02
    1. snapshot and restore
      1. repository register
        1. PUT _snapshot/hdfs
          {
          "type": "hdfs",
          "settings": {
          "path": "/backup/elasticsearch"
          }
          }
      2. repository verification
        1. POST _snapshot/hdfs/_verify
      3. snapshot
        1. PUT _snapshot/hdfs/20150302
      4. monitoring snapshot/restore progress
        1. GET _snapshot/hdfs/20150302/_status
        2. GET _snapshot/hdfs/20150302
      5. snapshot information and status
        1. GET _snapshot/hdfs/20150302
        2. GET _snapshot/hdfs/_all 
        3. GET _snapshot/_status 
        4. GET _snapshot/hdfs/_status 
        5. GET _snapshot/hdfs/20150302/_status
      6. restore
        1. POST _snapshot/hdfs/20150302/_restore
      7. snapshot deletion / stopping currently running snapshot and restore operations
        1. DELETE _snapshot/hdfs/20150302
      8. repository deletion
        1. DELETE _snapshot/hdfs
      9. reference
        1. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html
    2. rolling update
      1. Disable shard reallocation
        1. curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" : { "cluster.routing.allocation.enable" : "none" } }'
      2. Shut down a single node within the cluster
        1. curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
      3. Confirm that all shards are correctly reallocated to the remaining running nodes
      4. Download newest version
      5. Extract the zip or tarball to a new directory
      6. Copy the configuration files from the old Elasticsearch installation’s config directory to the new Elasticsearch installation’s config directory
      7. Move data files from the old Elasticsesarch installation’s data directory
      8. Install plugins
      9. Start the now upgraded node
      10. Confirm that it joins the cluster
      11. Re-enable shard reallocation
        1. curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" : { "cluster.routing.allocation.enable" : "all" } }'
      12. Observe that all shards are properly allocated on all nodes
      13. Repeat this process for all remaining nodes
      14. Reference
        1. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-upgrade.html#rolling-upgrades
  3. 2015.02.13
    1. MySQL Slow Query Log Mapping
      1. PUT msql-2015
        {
          "mappings": {
            "log": {
              "properties": {
                "@timestamp": {
                  "type": "date",
                  "format": "dateOptionalTime"
                },
                "@version": {
                  "type": "string"
                },
                "host": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "ip": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "lock_time": {
                  "type": "double"
                },
                "message": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "query": {
                  "type": "string"
                },
                "query_time": {
                  "type": "double"
                },
                "rows_examined": {
                  "type": "double"
                },
                "rows_sent": {
                  "type": "double"
                },
                "type": {
                  "type": "string"
                },
                "user": {
                  "type": "string"
                }
              }
            }
          }
        }
    2. MySQL Slow Query Dump Mapping
      1. PUT msqld-2015
        {
          "mappings": {
            "dump": {
              "properties": {
                "@timestamp": {
                  "type": "date",
                  "format": "dateOptionalTime"
                },
                "@version": {
                  "type": "string"
                },
                "count": {
                  "type": "double"
                },
                "host": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "ip": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "lock": {
                  "type": "double"
                },
                "message": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "query": {
                  "type": "string"
                },
                "rows": {
                  "type": "double"
                },
                "time": {
                  "type": "double"
                },
                "type": {
                  "type": "string"
                },
                "user": {
                  "type": "string"
                }
              }
            }
          }
        }
  4. 2015.02.12
    1. MySQL Slow Query Log & Dump Mappings
      1. PUT msqld-2015
        {
          "mappings": {
            "log": {
              "properties": {
                "@timestamp": {
                  "type": "date",
                  "format": "dateOptionalTime"
                },
                "@version": {
                  "type": "string"
                },
                "host": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "ip": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "lock_time": {
                  "type": "double"
                },
                "message": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "query": {
                  "type": "string"
                },
                "query_time": {
                  "type": "double"
                },
                "rows_examined": {
                  "type": "double"
                },
                "rows_sent": {
                  "type": "double"
                },
                "type": {
                  "type": "string"
                },
                "user": {
                  "type": "string"
                }
              }
            },
            "dump": {
              "properties": {
                "@timestamp": {
                  "type": "date",
                  "format": "dateOptionalTime"
                },
                "@version": {
                  "type": "string"
                },
                "count": {
                  "type": "double"
                },
                "host": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "ip": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "lock": {
                  "type": "double"
                },
                "message": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "query": {
                  "type": "string"
                },
                "rows": {
                  "type": "double"
                },
                "time": {
                  "type": "double"
                },
                "type": {
                  "type": "string"
                },
                "user": {
                  "type": "string"
                }
              }
            }
          }
        }
  5. 2015.01.26
    1. bashrc
      1. export ES_HEAP_SIZE=9g
  6. 2015.01.19
    1. 재시작 스크립트
      1. curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
        sleep 1s
        curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
        sleep 1s
        bin/elasticsearch -d
        sleep 3s
        curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'
    2. 설정
      1. cluster.name: dev
        node.name: "dev01"
        discovery.zen.ping.multicast.enabled: false
        discovery.zen.ping.unicast.hosts: ["10.0.2.81", "10.0.2.82", "10.0.2.83"]
        indices.fielddata.cache.size: 40%
  7. ~ 2015.01.01
    1. Settings
      1. curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"indices.recovery.concurrent_streams":6,"indices.recovery.max_bytes_per_sec":"50mb"}}'
      2. ES_HEAP_SIZE: 50% (< 32g)
      3. indices.fielddata.cache.size: 40%
    2. Scenarios
      1. heap size setting
      2. export ES_HEAP_SIZE=9g
      3. curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
      4. curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
      5. bin/elasticsearch -d
      6. curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'
      7. TBD

Curator 4 Test

  1. 2015.02.23
    1. 7일 전 '.marvel-*' index 삭제
      1. 최신 버전에서 -p 대신 --prefix 사용하여야 함
        10 0 * * * /usr/bin/curator delete --prefix .marvel- --older-than 7
  2. ~ 2015.02.23
    1. 개발 존
      1. 3일 전 '.marvel-%Y.%m.%d' index 삭제
        10 0 * * * /usr/bin/curator delete -p .marvel- --older-than 3

MariaDB

  1. 2015.07.24
    1. Installing MariaDB 10 with YUM
      1. vim /etc/yum.repos.d/MariaDB.repo
        1. # MariaDB 10.0 CentOS repository list - created 2015-07-24 02:31 UTC 
        2. http://mariadb.org/mariadb/repositories/
        3. [mariadb] 
        4. name = MariaDB 
        5. baseurl = http://yum.mariadb.org/10.0/centos6-amd64
        6. gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB
        7. gpgcheck=1
      2. yum -y install MariaDB-server MariaDB-client
      3. /etc/init.d/mysql start
  2. 2015.03.13
    1. installation
      1. groupadd mysql 
      2. useradd -g mysql mysql 
      3. cd /usr/local 
      4. download binary tarball from https://downloads.mariadb.org/
      5. tar -zxvpf /path-to/mariadb-VERSION-OS.tar.gz 
      6. ln -s mariadb-VERSION-OS mysql 
      7. cd mysql 
      8. ./scripts/mysql_install_db --user=mysql 
        1. solution for the error
          1. libaio.so.1: cannot open shared object file
            1. yum -y install libaio
      9. chown -R root . 
      10. chown -R mysql data
      11. export PATH=$PATH:/usr/local/mysql/bin/
      12. cd support-files/ 
      13. cp mysql.server /etc/init.d/mysql 
      14. chmod +x /etc/init.d/mysql
      15. chkconfig --add mysql 
      16. chkconfig --level 345 mysql on
      17. service mysql start / stop
    2. management
      1. set root password
        1. mysql -u root
        2. use mysql;
        3. update user set password=PASSWORD("new-password") where User='root';
        4. flush privileges;
        5. exit;
      2. granting user connections from remote hosts
        1. GRANT ALL PRIVILEGES ON *.* TO 'root'@'remote host' IDENTIFIED BY 'password' WITH GRANT OPTION;
    3. reference
      1. https://mariadb.com/kb/en/mariadb/documentation/

Spark MLlib

  1. Common algorithms
    1. classification
    2. regression
    3. clustering
    4. collaborative filtering
    5. dimensionality reduction
  2. data science algorithms (python)
    Module - Classificationpyspark.mllib.classification
    Linear Support Vector MachineThe SVMWithSGD class
    Naive BayesThe NaiveBayes class
    Logistic regressionThe LogisticRegressionWithSGD class
    Module - Regressionpyspark.mllib.regression
    Linear least squaresThe LinearRegressionWithSGD class
    Ridge regressionLinear least squares with L2 regularization, it is implemented with the RidgeRegressionWithSGD class
    LassoLinear least squares with L1 regularization, it is implemented with the LassoWithSGD class
    Module - Clusteringpyspark.mllib.clustering
    K-Meansthe KMeans class
    Module - Recommendationpyspark.mllib.recommendation
    Alternating Least Squaresthe ALS class

Scikit-learn

  1. Scikit-learn’s algorithms
    1. Support Vector Machines 
      1. Classification
      2. Regression
      3. Outlier detection
    2. Logistic Regression 
    3. Naive Bayes 
    4. Random Forests 
    5. Gradient Boosting 
    6. K-means
  2. Scikit-learn Algorithm Cheat-sheet
  3. Classification
    1. Support Vector Classification
      1. sklearn.svm.SVC
      2. sklearn.svm.NuSVC
      3. sklearn.svm.LinearSVC
    2. Naive Bayes for Classification
      1. sklearn.naive_bayes.GaussianNB
      2. sklearn.naive_bayes.MultinomialNB
      3. sklearn.naive_bayes.BernoulliNB
    3. Nearest neighbors
  4. Regression
    1. Support Vector Regression
      1. sklearn.svm.SVR
      2. sklearn.svm.NuSVR
    2. Decision Trees
  5. Clustering
    1. sklearn.cluster.KMeans
    2. sklearn.cluster.AffinityPropagation
    3. sklearn.cluster.DBSCAN
    4. sklearn.cluster.Ward
    5. sklearn.cluster.MeanShift
    6. sklearn.cluster.SpectralClustering

Machine Learning

Mahouthttps://mahout.apache.org/
Spark MLlibhttp://spark.apache.org/mllib/
Scikit-learnhttp://scikit-learn.org/stable/
NLTKhttp://www.nltk.org/
PyBrainhttp://pybrain.org/
mlpyhttp://mlpy.sourceforge.net/
PyMLhttp://pyml.sourceforge.net/
Orangehttp://orange.biolab.si/
PyMVPAhttp://www.pymvpa.org/

elasticsearch-repository-hdfs

  1. Introduction
    1. elasticsearch-repository-hdfs plugin allows Elasticsearch 1.4 to use hdfs file-system as a repository for snapshot/restore.
  2. Installation
    1. version information
      1. CDH: 5.3.0
      2. ealsticsearch: 1.4.2
      3. elasticsearch-repository-hdfs: 2.1.0.Beta3-light
        1. note that the stable version which is 2.0.2 did not work right before I am writing this blog.
        2. check https://groups.google.com/forum/#!msg/elasticsearch/CZy1oJpKHyc/1uvoMbI5r5sJ
    2. hadoop installed at the same node
      1. append the output of "hadoop classpath" commnad to ES_CLASSPATH
      2. example
        1. hadoop classpath
        2. export ES_CLASSPATH=/etc/hadoop/conf:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop/.//*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/./:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-yarn/.//*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/.//*
      3. install plugin at each node and restart it
        1. bin/plugin -i elasticsearch/elasticsearch-repository-hdfs/2.1.0.Beta3-light
    3. no hadoop installed at the same node
      1. install plugin at each node and restart it
        1. bin/plugin -i elasticsearch/elasticsearch-repository-hdfs/2.1.0.Beta3-hadoop2
    4. repository register
      1. exmaple
        1. PUT _snapshot/hdfs
          {
          "type": "hdfs",
          "settings": {
          "path": "/backup/elasticsearch"
          }
          }
    5. verification
      1. POST _snapshot/hdfs/_verify
  3. Configuration
    1. uri: "hdfs://<host>:<port>/" # optional - Hadoop file-system URI
    2. path: "some/path" # required - path with the file-system where data is stored/loaded
    3. load_defaults: "true" # optional - whether to load the default Hadoop configuration (default) or not
    4. conf_location: "extra-cfg.xml" # optional - Hadoop configuration XML to be loaded (use commas for multi values)
    5. conf.<key> : "<value>" # optional - 'inlined' key=value added to the Hadoop configuration 
    6. concurrent_streams: 5 # optional - the number of concurrent streams (defaults to 5) 
    7. compress: "false" # optional - whether to compress the data or not (default) 
    8. chunk_size: "10mb" # optional - chunk size (disabled by default)
  4. Reference
    1. https://github.com/elasticsearch/elasticsearch-hadoop/tree/master/repository-hdfs

Apache Spark

  1. 2015.07.09
    1. hardware provisioning
      1. storage systems
        1. run on the same nodes as HDFS 
          1. yarn
          2. with a fixed amount memory and cores dedicated to Spark on each node
        2. run on different nodes in the same local-area network as HDFS
        3. run computing jobs on different nodes than the storage system
      2. local disks
        1. 4-8 disks per node
        2. without RAID(just as separate mount points)
        3. noatime option
        4. configurate the spark.local.dir variable to be a comma-separated list of the local disks
        5. same disks as HDFS, if running HDFS
      3. memory
        1. 8 GB - hundreds of GB
        2. 75% of the memory
        3. if memory > 200 GB, then run multiple worker JVMs per node
          1. standalone mode
            1. conf/spark-env.sh
              1. SPARK_WORKER_INSTANCES: set the number of workers per node
              2. SPARK_WORKER_CORES: the number of cores per worker
      4. netowrk
        1. >= 10 gigabit
      5. CPU cores
        1. 8-16 cores per machine, or more
      6. reference
        1. https://spark.apache.org/docs/latest/hardware-provisioning.html
    2. third-party hadoop distributions
      1. CDH
      2. HDP (recommended)
      3. inheriting cluster configuration
        1. spark-env.sh
          1. HADOOP_CONF_DIR
            1. hdfs-site.xml
            2. core-site.xml
      4. reference
        1. https://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
  2. 2015.06.04
    1. external tools
      1. cluster-wide monitoring tool
        1. Gangila
      2. OS profiling tools
        1. dstat
        2. iostat
        3. iotop
      3. JVM utilities
        1. jstack
        2. jmap
        3. jstat
        4. jconsole
  3. 2015.05.15
    1. overview
      1. Apache Spark is a fast and general-purpose cluster computing system. 
      2. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs. 
      3. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
    2. quick start
      1. Quick Start
    3. modules
      1. Spark Streaming
      2. Spark SQL and DataFrames
      3. MLlib
      4. GraphX
      5. Bagel (Pregel on Spark)
    4. reference
      1. https://spark.apache.org
  4. 2015.05.14
    1. optimization
      problem
      configuration
      out of memorysysctl -w vm.max_map_count=65535
      spark.storage.memoryMapThreshhold 131072
      too many open filessysctl -w fs.file-max=1000000
      spark.shuffle.consolidateFiles true
      spark.shuffle.manager sort
      connection reset by peer-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=12 -XX:NewRatio=3 -XX:SurvivorRatio=3
      error communication with MapOutputTrackerspark.akka.askTimeout 120
      spark.akka.lookupTimeout 120
    2. use case
      1. Baidu real-time security product
  5. 2015.05.12
    1. configuration
      1. 75% of a machine's memory (standalone)
      2. minimum executor heap size: 8 GB
      3. maximum executor heap size: 40 GB / under 45 GB (watch GC)
      4. kryo serialization
      5. parallel (old) / CMS / G1 GC
      6. pypy > cpython
    2. notification
      1. memory usage is not same as data size (2x, 3x bigger)
      2. prefer reduceby than groupby
      3. there are limitations when using python with spark streaming (at least for now)