Mungeol Heo: 2015-07-26

Friday, July 31, 2015

Solution 4 Error (HDP)

Note: just try to use the solution(s) addressed below to solve the corresponding error

2015.07.31
1. no response from namenode UI / 50070 is binded to private IP
  1. ambari web -> HDFS -> configs -> custom core-site -> add property
    1. key: dfs.namenode.http-bind-host
    2. value: 0.0.0.0
  2. save it and restart related services
  3. note that there are 'dfs.namenode.rpc-bind-host', 'dfs.namenode.servicerpc-bind-host' and 'dfs.namenode.https-bind-host' properties which can solve similar issue
2015.07.30
1. root is not allowed to impersonate <username>
  1. ambari web -> HDFS -> configs -> custom core-site -> add property
    1. key: hadoop.proxyuser.root.groups
    2. value: *
    3. key: hadoop.proxyuser.root.hosts
    4. value: *
  2. save it and restart related services
  3. note that you should change root to the user name who runs/submits the service/job

Wednesday, July 29, 2015

Log Format

2015.07.08

잘 알려진 로그 포맷

로그 포맷	상세 내용
W3C 확장 로그 파일 포맷(Extended Log File Format)	http://www.w3.org/TR/WD-logfile.html
아파치 액세스 로그(access log)	http://httpd.apache.org/docs/current/logs.html
시스코(SDEE/CIDEE)	http://www.cisco.com/c/en/us/td/docs/security/ips/specs/CIDEE_Specification.html
ArcSight 공통 이벤트 포맷(Common Event Format)	CommonEventFormat.pdf
syslog	RFC3195, RFC5424
IDMEF, XML 기반 포맷	RFC4765

일반적인 필드 집합
1. 날짜/시간(표준시간대)
2. 로그 항목 유형
3. 로그를 생성한 시스템
4. 생성한 애플리케이션이나 컴포넌트
5. 성공적인 실패 행위 암시
6. 로그 메시지의 심각도, 우선순위, 중요성
7. 사용자 행위와 관련한 로그의 경우 사용자 명도 포함
로깅 범주
1. 로깅의 5W
  1. 발생 사건
  2. 발생 시기
  3. 발생 장소
  4. 관련된 사람
  5. 관련 장소
2. 기타
  1. 정보를 획득한 장소
  2. 이 사건이 실제 발생했는지 확신하는 방법
  3. 영향받은 것
  4. 향후 발생할 사건
  5. 신경 써야 할 그 외 발생 사건
  6. 해야 할 일

LogStash 4 Test

2015.01.19

wiselog 설정

input {
  file {
    type => "access"
    path => "/home/mungeol/access.log"
  }
}
filter {
        if [type] == "access" {
                grok {
                        match => { "message" => "%{IPORHOST:clientip} \[%{HTTPDATE:timestamp}\] \"(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})\" \"(?<referrer>.+)\" %{NUMBER:response} (?:%{NUMBER:bytes}|-) %{QS:agent} %{QS:cookie}" }
                }
                grok {
                        match => { "cookie" => "PCID=(?<pcid>\d+);" }
                }
                grok {
                        match => { "cookie" => "UID=(?<uid>\d+);" }
                }
                grok {
                        match => { "cookie" => "n_ss=(?<n_ss>\d+.\d+);" }
                }
                grok {
                        match => { "cookie" => "n_cs=(?<n_cs>.+);" }
                }
        }
        date {
                match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
        }
}
output {
        elasticsearch {
                cluster => "dev"
                index => "wiselog_test"
                protocol => "http"
                workers => 4
        }
        #stdout { codec => rubydebug }
}

kafka output 설정

input {
  file {
    type => "apache-access"
    path => "/home/mungeol/access-test"
  }
#  file {
#    type => "apache-error"
#    path => "/home/weblog/test/data/test-error"
#  }
}
#filter {
#  if [type] == "apache-access" {
#      grok {
#        match => { "message" => "%{COMBINEDAPACHELOG}" }
#      }
#    date {
#      match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
#    }
#  }
#  if [type] == "apache-error" {
#      grok {
#        match => { "message" => "%{APACHEERRORLOG}" }
#        patterns_dir => ["/var/lib/logstash/etc/grok"]
#      }
#  }
#  if [clientip]  {
#    geoip {
#      source => "clientip"
#      target => "geoip"
#      remove_field => ["[geoip][ip]", "[geoip][country_code3]", "[geoip][country_name]", "[geoip][continent_code]", "[geoip][region_name]",
#                       "[geoip][city_name]", "[geoip][latitude]", "[geoip][longitude]", "[geoip][timezone]", "[geoip][real_region_name]"]
#    }
#  }
#}
output {
        kafka {
                broker_list => "10.0.2.81:9092,10.0.2.82:9092,10.0.2.83:9092"
                topic_id => "access-test-02"
                topic_metadata_refresh_interval_ms => 30000
                request_required_acks => 0
#               producer_type => "async"
        }
}

kafka input 설정

input {
        kafka {
                zk_connect => "localhost:2181,10.0.2.81:2181,10.0.2.82:2181,10.0.2.83:2181"
                topic_id => "access-test-07"
                type => "apache-access"
queue_size => 200
fetch_message_max_bytes => 2097152
        }
}
filter {
  if [type] == "apache-access" {
      grok {
        match => { "message" => "%{COMBINEDAPACHELOG}" }
      }
    date {
      match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
  }
#  if [type] == "apache-error" {
#      grok {
#        match => { "message" => "%{APACHEERRORLOG}" }
#        patterns_dir => ["/var/lib/logstash/etc/grok"]
#      }
#  }
#  if [clientip]  {
#    geoip {
#      source => "clientip"
#      target => "geoip"
#      remove_field => ["[geoip][ip]", "[geoip][country_code3]", "[geoip][country_name]", "[geoip][continent_code]", "[geoip][region_name]",
#                       "[geoip][city_name]", "[geoip][latitude]", "[geoip][longitude]", "[geoip][timezone]", "[geoip][real_region_name]"]
#    }
#  }
}
output {
  elasticsearch {
    cluster => "dev"
    index => "test"
    protocol => "http"
        workers => 4
  }
#stdout { codec => rubydebug }
}

input {
  file {
        type => "access-report"
    path => "/home/mungeol/workspace/securityTeam/sec_report"
  }
}
filter {
        if [type] == "access-report" {
                csv {
                        columns => ["id","name","department","date","ip","request"]
                }
                date {
                        match => [ "date", "yyyy-MM-dd HH:mm:ss" ]
                }
        }
}
output {
  elasticsearch {
    cluster => "dev"
    index => "sec-team"
    protocol => "http"
  }
#stdout { codec => rubydebug }
}

~ 2015.01.01

소개
1. 현재 아파치 access 로그가 생성되는 방법을 고려하여 LogStash로 ElasticSearch 클러스터에 로그를 변환하여 전송하는 테스트 진행

실시간 전송

/home/weblog/test/data/test-access 사용
같은 형식으로 가상 로그 서버에 로그 생성

설정 파일

input {
  file {
    path => "/home/weblog/test/data/test-access"
  }
}
filter {
  if [path] =~ "access" {
    mutate { replace => { type => "apache_access" } }
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
      match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
  } else if [path] =~ "error" {
    mutate { replace => { type => "apache_error" } }
  } else {
    mutate { replace => { type => "random_logs" } }
  }
}
output {
  elasticsearch {
    host => "211.49.227.177"
    protocol => "http"
  }
}

/home/weblog/test/data/test-access에 새로운 로그가 추가 될때마다 실시간으로 ElasticSearch 클러스터에 인덱싱

시간 단위 전송

/home/weblog/test/data/test-access.082117 형식 사용
같은 형식으로 가상 로그 서버에 로그 생성

설정 파일

input {
  file {
    path => "/home/weblog/test/data/test-access.*"
  }
}
filter {
  if [path] =~ "access" {
    mutate { replace => { type => "apache_access" } }
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
      match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
  } else if [path] =~ "error" {
    mutate { replace => { type => "apache_error" } }
  } else {
    mutate { replace => { type => "random_logs" } }
  }
}
output {
  elasticsearch {
    host => "211.49.227.177"
    protocol => "http"
  }
}

/home/weblog/test/data/ 폴더에 새로운 'test-access.*' 형식의 로그가 추가 될때마다 ElasticSearch 클러스터에 인덱싱

elasticsearch output 설정

input {
  file {
    path => "/home/weblog/test/data/test-access.*"
  }
}
filter {
  if [path] =~ "access" {
    mutate { replace => { type => "apache_access" } }
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
      match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
  } else if [path] =~ "error" {
    mutate { replace => { type => "apache_error" } }
  } else {
    mutate { replace => { type => "random_logs" } }
  }
}
output {
  elasticsearch {
    cluster => "elasticsearch"
    host => "211.49.227.177"
    index => "apache-%{+YYYY.MM}"
    protocol => "http"
  }
stdout { codec => rubydebug }
}

IP 정보에서 geo 정보 추출

geo 정보에서 bettermap과 map 정보에 사용할 정보만 남기고 기타 정보 삭제

input {
  file {
    type => "apache-access"
    path => "/home/weblog/test/data/test-access"
  }
  file {
    type => "apache-error"
    path => "/home/weblog/test/data/test-error"
  }
}
filter {
  if [type] == "apache-access" {
      grok {
        match => { "message" => "%{COMBINEDAPACHELOG}" }
      }
  }
  if [type] == "apache-error" {
      grok {
        #match => { "message" => "%{APACHEERRORLOG}" }
        #patterns_dir => ["/var/lib/logstash/etc/grok"]
      }
  }
  if [clientip]  {
    geoip {
      source => "clientip"
      target => "geoip"
      remove_field => ["[geoip][ip]", "[geoip][country_code3]", "[geoip][country_name]", "[geoip][continent_code]", "[geoip][region_name]",
                       "[geoip][city_name]", "[geoip][latitude]", "[geoip][longitude]", "[geoip][timezone]", "[geoip][real_region_name]"]
    }
  }
}
output {
  elasticsearch {
    cluster => "elasticsearch"
    host => "211.49.227.177"
    index => "apache-%{+YYYY.MM}"
    protocol => "http"
  }
stdout { codec => rubydebug }
}

MySQL slow query log 파싱

input {
  file {
    type => "mysql-slow"
    path => "/DBLog/dbmaster-slow.log"
  }
}
filter {
  if [message] =~ "# Time: " {
    drop {}
  }
  grok {
    match => {
      message => [
        "^# User@Host: %{USER:user}(?:\[[^\]]+\])?\s+@\s+%{HOST:host}?\s+\[%{IP:ip}?\]",
        "^# Query_time: %{NUMBER:query_time:float}\s+Lock_time: %{NUMBER:lock_time:float} Rows_sent: %{NUMBER:rows_sent:int} \s*Rows_examined: %{NUMBER:rows_examined:float}",
        "^SET timestamp=%{NUMBER:timestamp};",
        "%{GREEDYDATA:query}"
      ]
    }
  }
  multiline {
    pattern => "# User"
    negate => false
    what => "next"
  }
  multiline {
    pattern => "^#"
    negate => true
    what => "previous"
  }
  date {
    match => [ "timestamp", "UNIX" ]
  }
  mutate {
    remove_field => [ "timestamp" ]
  }
}
output {
  elasticsearch {
    cluster => "elasticsearch"
    host => "211.49.227.177"
    index => "mysql-%{+YYYY.MM}"
    protocol => "http"
  }
#  stdout { codec => rubydebug }
}

elasticsearch-river / RabbitMQ 사용

input {
  file {
    type => "apache-access"
    path => "/home/mungeol/test-access*"
  }
#  file {
#    type => "apache-error"
#    path => "/home/weblog/test/data/test-error"
#  }
}
filter {
  if [type] == "apache-access" {
      grok {
        match => { "message" => "%{COMBINEDAPACHELOG}" }
      }
    date {
      match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
  }
#  if [type] == "apache-error" {
#      grok {
#        match => { "message" => "%{APACHEERRORLOG}" }
#        patterns_dir => ["/var/lib/logstash/etc/grok"]
#      }
#  }
  if [clientip]  {
    geoip {
      source => "clientip"
      target => "geoip"
      remove_field => ["[geoip][ip]", "[geoip][country_code3]", "[geoip][country_name]", "[geoip][continent_code]", "[geoip][region_name]",
                       "[geoip][city_name]", "[geoip][latitude]", "[geoip][longitude]", "[geoip][timezone]", "[geoip][real_region_name]"]
    }
  }
}
output {
#  elasticsearch {
#    cluster => "dev"
#    host => "10.0.2.83"
#    index => "apache-test"
#    protocol => "http"
#  }
elasticsearch_river {
  es_host => "10.0.2.82"
  rabbitmq_host => "10.0.2.81"
  index => "apache-test"
  user => "test"
  password => "test"
}
#stdout { codec => rubydebug }
}

logstash-kafka

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 3 --topic access-test

producer

input {
  file {
    type => "apache-access"
    path => "/home/mungeol/access-test"
  }
}
filter {
  if [type] == "apache-access" {
      grok {
        #match => { "message" => "%{COMBINEDAPACHELOG}" }
        match => { "message" => "%{COMBINEDAPACHELOG} %{QS:cookie}" }
        match => { "cookie" => "UID=(?<mem_idx>\d+);" }
        break_on_match => false
      }
    date {
      locale => en
      match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
  }
  if [clientip]  {
    geoip {
      source => "clientip"
      target => "geoip"
      remove_field => ["[geoip][ip]", "[geoip][country_code3]", "[geoip][country_name]", "[geoip][continent_code]", "[geoip][region_name]",
                       "[geoip][city_name]", "[geoip][latitude]", "[geoip][longitude]", "[geoip][timezone]", "[geoip][real_region_name]"]
    }
  }
}
output {
  kafka {
    topic_id => "access-test"
    broker_list => "10.0.2.83:9092"
  }
#stdout { codec => rubydebug }
}

consumer

input {
  kafka {
    topic_id => "access-test"
  }
}
output {
  elasticsearch {
    cluster => "dev"
    index => "test-%{+YYYY.MM.dd}"
    protocol => "http"
  }
#stdout { codec => rubydebug }
}

http://x.x.x.x:8080/#/topics

Kibana 4 Test

2014.11.07
1. basic authentication (temporal method)
  1. sudo su -c 'rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm'
  2. sudo yum install nginx
  3. sudo yum -y install httpd-tools
  4. sudo htpasswd -c /etc/nginx/conf.d/.htpasswd tester
  5. cd /etc/nginx/conf.d
  6. sudo vim kibana.conf
    1. ```
    server {
            listen       9201;
            server_name  hadoopdev-03;
            charset utf-8;
            
            location / {
                auth_basic "Restricted";
                auth_basic_user_file /etc/nginx/conf.d/.htpasswd;
                root   /opt/kibana;
                index  index.html index.htm;
            }
    }
```
7. cd /opt/
8. sudo wget https://download.elasticsearch.org/kibana/kibana/kibana-3.1.1.zip
9. sudo unzip kibana-3.1.1.zip
10. cd kibana-3.1.1/
11. sudo vim config.js
12. elasticsearch: "http://x.x.x.x:9200",
13. cd ..
14. sudo ln -s kibana-3.1.1/ kibana
15. sudo service nginx start
16. http://x.x.x.x:9201/#/dashboard/elasticsearch/Security_Team

Hive 4 Test

2015.03.23
1. row_sequence()
  1. add jar /opt/cloudera/parcels/CDH/jars/hive-contrib-0.13.1-cdh5.3.0.jar;
  2. CREATE TEMPORARY FUNCTION row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
2015.02.11
1. Software
  1. Hive 0.13.0 SetupHDP 2.1 General Availability
    - Hadoop 2.4.0
    - Tez 0.4.0
    - Hive 0.13.0
    HDP was deployed using Ambari 1.5.1. For the most part, the cluster used the Ambari defaults (except where noted below). Hive 0.13.0 runs were done using Java 7 (default JVM).
    
    Tez and MapReduce were tuned to process all queries using 4 GB containers at a target container-to-disk ratio of 2.0. The ratio is important because it minimizes disk thrash and maximizes throughput.
    
    Other Settings:
    - yarn.nodemanager.resource.memory-mb was set to 49152
    - Default virtual memory for a job’s map-task and reduce-task were set to 4096
    - hive.tez.container.size was set to 4096
    - hive.tez.java.opts was set to -Xmx3800m
    - Tez app masters were given 8 GB
    - mapreduce.map.java.opts and mapreduce.reduce.java.opts were set to -Xmx3800m. This is smaller than 4096 to allow for some garbage collection overhead
    - hive.auto.convert.join.noconditionaltask.size was set to 1252698795
    Note: this is 1/3 of the Xmx value, about 1.7 GB.
    
    The following additional optimizations were used for Hive 0.13.0:
    - Vectorized Query enabled
    - ORCFile formatted data
    - Map-join auto conversion enabled
2. Hardware
  1. 20 physical nodes, each with:
    - 2x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz for total of 16 CPU cores/machine
    - Hyper-threading enabled
    - 256GB RAM per node
    - 6x 4TB WDC WD4000FYYZ-0 drives per node
    - 10 Gigabit interconnect between the nodes
    Notes: Based on the YARN Node Manager’s Memory Resource setting used below, only 48 GB of RAM per node was dedicated to query processing, the remaining 200 GB of RAM were available for system caches and HDFS.
    
    Linux Configurations:
    - /proc/sys/net/core/somaxconn = 512
    - /proc/sys/vm/dirty_writeback_centisecs = 6000
    - /proc/sys/vm/swappiness = 0
    - /proc/sys/vm/zone_reclaim_mode = 0
    - /sys/kernel/mm/redhat_transparent_hugepage/defrag = never
    - /sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag = no
    - /sys/kernel/mm/transparent_hugepage/khugepaged/defrag = 0

ES 4 Test

2015.03.03
1. bashrc
  1. export INNERIP=`hostname -i`
    export ES_HEAP_SIZE=8g
    export ES_CLASSPATH=/etc/hadoop/conf:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop/.//*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/./:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-hdfs/.//*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/libexec/../../hadoop-yarn/.//*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/.//*
2. configuration
  1. cluster.name: test
  2. node.name: ${HOSTNAME}
  3. transport.host: ${INNERIP}
  4. discovery.zen.ping.multicast.enabled: false
  5. discovery.zen.ping.unicast.hosts: ["10.0.2.81", "10.0.2.82", "10.0.2.83"]
  6. indices.fielddata.cache.size: 40%
2015.03.02
1. snapshot and restore
  1. repository register
    1. PUT _snapshot/hdfs
      {
      "type": "hdfs",
      "settings": {
      "path": "/backup/elasticsearch"
      }
      }
  2. repository verification
    1. POST _snapshot/hdfs/_verify
  3. snapshot
    1. PUT _snapshot/hdfs/20150302
  4. monitoring snapshot/restore progress
    1. GET _snapshot/hdfs/20150302/_status
    2. GET _snapshot/hdfs/20150302
  5. snapshot information and status
    1. GET _snapshot/hdfs/20150302
    2. GET _snapshot/hdfs/_all
    3. GET _snapshot/_status
    4. GET _snapshot/hdfs/_status
    5. GET _snapshot/hdfs/20150302/_status
  6. restore
    1. POST _snapshot/hdfs/20150302/_restore
  7. snapshot deletion / stopping currently running snapshot and restore operations
    1. DELETE _snapshot/hdfs/20150302
  8. repository deletion
    1. DELETE _snapshot/hdfs
  9. reference
    1. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html
2. rolling update
  1. Disable shard reallocation
    1. curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" : { "cluster.routing.allocation.enable" : "none" } }'
  2. Shut down a single node within the cluster
    1. curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
  3. Confirm that all shards are correctly reallocated to the remaining running nodes
  4. Download newest version
  5. Extract the zip or tarball to a new directory
  6. Copy the configuration files from the old Elasticsearch installation’s config directory to the new Elasticsearch installation’s config directory
  7. Move data files from the old Elasticsesarch installation’s data directory
  8. Install plugins
  9. Start the now upgraded node
  10. Confirm that it joins the cluster
  11. Re-enable shard reallocation
    1. curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" : { "cluster.routing.allocation.enable" : "all" } }'
  12. Observe that all shards are properly allocated on all nodes
  13. Repeat this process for all remaining nodes
  14. Reference
    1. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-upgrade.html#rolling-upgrades

2015.02.13

MySQL Slow Query Log Mapping

PUT msql-2015
{
  "mappings": {
    "log": {
      "properties": {
        "@timestamp": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "@version": {
          "type": "string"
        },
        "host": {
          "type": "string",
          "index": "not_analyzed"
        },
        "ip": {
          "type": "string",
          "index": "not_analyzed"
        },
        "lock_time": {
          "type": "double"
        },
        "message": {
          "type": "string",
          "index": "not_analyzed"
        },
        "query": {
          "type": "string"
        },
        "query_time": {
          "type": "double"
        },
        "rows_examined": {
          "type": "double"
        },
        "rows_sent": {
          "type": "double"
        },
        "type": {
          "type": "string"
        },
        "user": {
          "type": "string"
        }
      }
    }
  }
}

MySQL Slow Query Dump Mapping

PUT msqld-2015
{
  "mappings": {
    "dump": {
      "properties": {
        "@timestamp": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "@version": {
          "type": "string"
        },
        "count": {
          "type": "double"
        },
        "host": {
          "type": "string",
          "index": "not_analyzed"
        },
        "ip": {
          "type": "string",
          "index": "not_analyzed"
        },
        "lock": {
          "type": "double"
        },
        "message": {
          "type": "string",
          "index": "not_analyzed"
        },
        "query": {
          "type": "string"
        },
        "rows": {
          "type": "double"
        },
        "time": {
          "type": "double"
        },
        "type": {
          "type": "string"
        },
        "user": {
          "type": "string"
        }
      }
    }
  }
}

2015.02.12

MySQL Slow Query Log & Dump Mappings

PUT msqld-2015
{
  "mappings": {
    "log": {
      "properties": {
        "@timestamp": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "@version": {
          "type": "string"
        },
        "host": {
          "type": "string",
          "index": "not_analyzed"
        },
        "ip": {
          "type": "string",
          "index": "not_analyzed"
        },
        "lock_time": {
          "type": "double"
        },
        "message": {
          "type": "string",
          "index": "not_analyzed"
        },
        "query": {
          "type": "string"
        },
        "query_time": {
          "type": "double"
        },
        "rows_examined": {
          "type": "double"
        },
        "rows_sent": {
          "type": "double"
        },
        "type": {
          "type": "string"
        },
        "user": {
          "type": "string"
        }
      }
    },
    "dump": {
      "properties": {
        "@timestamp": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "@version": {
          "type": "string"
        },
        "count": {
          "type": "double"
        },
        "host": {
          "type": "string",
          "index": "not_analyzed"
        },
        "ip": {
          "type": "string",
          "index": "not_analyzed"
        },
        "lock": {
          "type": "double"
        },
        "message": {
          "type": "string",
          "index": "not_analyzed"
        },
        "query": {
          "type": "string"
        },
        "rows": {
          "type": "double"
        },
        "time": {
          "type": "double"
        },
        "type": {
          "type": "string"
        },
        "user": {
          "type": "string"
        }
      }
    }
  }
}

2015.01.26
1. bashrc
  1. export ES_HEAP_SIZE=9g

2015.01.19

재시작 스크립트

curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
sleep 1s
curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
sleep 1s
bin/elasticsearch -d
sleep 3s
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'

설정

cluster.name: dev
node.name: "dev01"
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.0.2.81", "10.0.2.82", "10.0.2.83"]
indices.fielddata.cache.size: 40%

~ 2015.01.01
1. Settings
  1. curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"indices.recovery.concurrent_streams":6,"indices.recovery.max_bytes_per_sec":"50mb"}}'
  2. ES_HEAP_SIZE: 50% (< 32g)
  3. indices.fielddata.cache.size: 40%
2. Scenarios
  1. heap size setting
  2. export ES_HEAP_SIZE=9g
  3. curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
  4. curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
  5. bin/elasticsearch -d
  6. curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'
  7. TBD

Curator 4 Test

2015.02.23
1. 7일 전 '.marvel-*' index 삭제
  1. 최신 버전에서 -p 대신 --prefix 사용하여야 함
    
    10 0 * * * /usr/bin/curator delete --prefix .marvel- --older-than 7
~ 2015.02.23
1. 개발 존
  1. 3일 전 '.marvel-%Y.%m.%d' index 삭제
    
    10 0 * * * /usr/bin/curator delete -p .marvel- --older-than 3

MariaDB

2015.07.24
1. Installing MariaDB 10 with YUM
  1. vim /etc/yum.repos.d/MariaDB.repo
    1. # MariaDB 10.0 CentOS repository list - created 2015-07-24 02:31 UTC
    2. # http://mariadb.org/mariadb/repositories/
    3. [mariadb]
    4. name = MariaDB
    5. baseurl = http://yum.mariadb.org/10.0/centos6-amd64
    6. gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB
    7. gpgcheck=1
  2. yum -y install MariaDB-server MariaDB-client
  3. /etc/init.d/mysql start
2015.03.13
1. installation
  1. groupadd mysql
  2. useradd -g mysql mysql
  3. cd /usr/local
  4. download binary tarball from https://downloads.mariadb.org/
  5. tar -zxvpf /path-to/mariadb-VERSION-OS.tar.gz
  6. ln -s mariadb-VERSION-OS mysql
  7. cd mysql
  8. ./scripts/mysql_install_db --user=mysql
    1. solution for the error
      1. libaio.so.1: cannot open shared object file
        
        yum -y install libaio
  9. chown -R root .
  10. chown -R mysql data
  11. export PATH=$PATH:/usr/local/mysql/bin/
  12. cd support-files/
  13. cp mysql.server /etc/init.d/mysql
  14. chmod +x /etc/init.d/mysql
  15. chkconfig --add mysql
  16. chkconfig --level 345 mysql on
  17. service mysql start / stop
2. management
  1. set root password
    1. mysql -u root
    2. use mysql;
    3. update user set password=PASSWORD("new-password") where User='root';
    4. flush privileges;
    5. exit;
  2. granting user connections from remote hosts
    1. GRANT ALL PRIVILEGES ON *.* TO 'root'@'remote host' IDENTIFIED BY 'password' WITH GRANT OPTION;
3. reference
  1. https://mariadb.com/kb/en/mariadb/documentation/

Spark MLlib

Common algorithms
1. classification
2. regression
3. clustering
4. collaborative filtering
5. dimensionality reduction

data science algorithms (python)

Module - Classification	pyspark.mllib.classification
Linear Support Vector Machine	The SVMWithSGD class
Naive Bayes	The NaiveBayes class
Logistic regression	The LogisticRegressionWithSGD class
Module - Regression	pyspark.mllib.regression
Linear least squares	The LinearRegressionWithSGD class
Ridge regression	Linear least squares with L2 regularization, it is implemented with the RidgeRegressionWithSGD class
Lasso	Linear least squares with L1 regularization, it is implemented with the LassoWithSGD class
Module - Clustering	pyspark.mllib.clustering
K-Means	the KMeans class
Module - Recommendation	pyspark.mllib.recommendation
Alternating Least Squares	the ALS class

Scikit-learn

Scikit-learn’s algorithms
1. Support Vector Machines
  1. Classification
  2. Regression
  3. Outlier detection
2. Logistic Regression
3. Naive Bayes
4. Random Forests
5. Gradient Boosting
6. K-means
Scikit-learn Algorithm Cheat-sheet
Classification
1. Support Vector Classification
  1. sklearn.svm.SVC
  2. sklearn.svm.NuSVC
  3. sklearn.svm.LinearSVC
2. Naive Bayes for Classification
  1. sklearn.naive_bayes.GaussianNB
  2. sklearn.naive_bayes.MultinomialNB
  3. sklearn.naive_bayes.BernoulliNB
3. Nearest neighbors
Regression
1. Support Vector Regression
  1. sklearn.svm.SVR
  2. sklearn.svm.NuSVR
2. Decision Trees
Clustering
1. sklearn.cluster.KMeans
2. sklearn.cluster.AffinityPropagation
3. sklearn.cluster.DBSCAN
4. sklearn.cluster.Ward
5. sklearn.cluster.MeanShift
6. sklearn.cluster.SpectralClustering

Machine Learning

Mahout	https://mahout.apache.org/
Spark MLlib	http://spark.apache.org/mllib/
Scikit-learn	http://scikit-learn.org/stable/
NLTK	http://www.nltk.org/
PyBrain	http://pybrain.org/
mlpy	http://mlpy.sourceforge.net/
PyML	http://pyml.sourceforge.net/
Orange	http://orange.biolab.si/
PyMVPA	http://www.pymvpa.org/

elasticsearch-repository-hdfs

Introduction
1. elasticsearch-repository-hdfs plugin allows Elasticsearch 1.4 to use hdfs file-system as a repository for snapshot/restore.
Installation
1. version information
  1. CDH: 5.3.0
  2. ealsticsearch: 1.4.2
  3. elasticsearch-repository-hdfs: 2.1.0.Beta3-light
    1. note that the stable version which is 2.0.2 did not work right before I am writing this blog.
    2. check https://groups.google.com/forum/#!msg/elasticsearch/CZy1oJpKHyc/1uvoMbI5r5sJ
2. hadoop installed at the same node
  1. append the output of "hadoop classpath" commnad to ES_CLASSPATH
  2. example
  3. install plugin at each node and restart it
    1. bin/plugin -i elasticsearch/elasticsearch-repository-hdfs/2.1.0.Beta3-light
3. no hadoop installed at the same node
  1. install plugin at each node and restart it
    1. bin/plugin -i elasticsearch/elasticsearch-repository-hdfs/2.1.0.Beta3-hadoop2
4. repository register
  1. exmaple
    1. PUT _snapshot/hdfs
      {
      "type": "hdfs",
      "settings": {
      "path": "/backup/elasticsearch"
      }
      }
5. verification
  1. POST _snapshot/hdfs/_verify
Configuration
1. uri: "hdfs://<host>:<port>/" # optional - Hadoop file-system URI
2. path: "some/path" # required - path with the file-system where data is stored/loaded
3. load_defaults: "true" # optional - whether to load the default Hadoop configuration (default) or not
4. conf_location: "extra-cfg.xml" # optional - Hadoop configuration XML to be loaded (use commas for multi values)
5. conf.<key> : "<value>" # optional - 'inlined' key=value added to the Hadoop configuration
6. concurrent_streams: 5 # optional - the number of concurrent streams (defaults to 5)
7. compress: "false" # optional - whether to compress the data or not (default)
8. chunk_size: "10mb" # optional - chunk size (disabled by default)
Reference
1. https://github.com/elasticsearch/elasticsearch-hadoop/tree/master/repository-hdfs

Apache Spark

2015.07.09
1. hardware provisioning
  1. storage systems
    1. run on the same nodes as HDFS
      1. yarn
      2. with a fixed amount memory and cores dedicated to Spark on each node
    2. run on different nodes in the same local-area network as HDFS
    3. run computing jobs on different nodes than the storage system
  2. local disks
    1. 4-8 disks per node
    2. without RAID(just as separate mount points)
    3. noatime option
    4. configurate the spark.local.dir variable to be a comma-separated list of the local disks
    5. same disks as HDFS, if running HDFS
  3. memory
    1. 8 GB - hundreds of GB
    2. 75% of the memory
    3. if memory > 200 GB, then run multiple worker JVMs per node
      1. standalone mode
        
        conf/spark-env.sh
        
        SPARK_WORKER_INSTANCES: set the number of workers per node
        
        SPARK_WORKER_CORES: the number of cores per worker
  4. netowrk
    1. >= 10 gigabit
  5. CPU cores
    1. 8-16 cores per machine, or more
  6. reference
    1. https://spark.apache.org/docs/latest/hardware-provisioning.html
2. third-party hadoop distributions
  1. CDH
  2. HDP (recommended)
  3. inheriting cluster configuration
    1. spark-env.sh
      1. HADOOP_CONF_DIR
        
        hdfs-site.xml
        
        core-site.xml
  4. reference
    1. https://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
2015.06.04
1. external tools
  1. cluster-wide monitoring tool
    1. Gangila
  2. OS profiling tools
    1. dstat
    2. iostat
    3. iotop
  3. JVM utilities
    1. jstack
    2. jmap
    3. jstat
    4. jconsole
2015.05.15
1. overview
  1. Apache Spark is a fast and general-purpose cluster computing system.
  2. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs.
  3. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
2. quick start
  1. Quick Start
3. modules
4. reference
  1. https://spark.apache.org

2015.05.14

optimization

problem	configuration
out of memory	sysctl -w vm.max_map_count=65535 spark.storage.memoryMapThreshhold 131072
too many open files	sysctl -w fs.file-max=1000000 spark.shuffle.consolidateFiles true spark.shuffle.manager sort
connection reset by peer	-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=12 -XX:NewRatio=3 -XX:SurvivorRatio=3
error communication with MapOutputTracker	spark.akka.askTimeout 120 spark.akka.lookupTimeout 120

use case
1. Baidu real-time security product

2015.05.12
1. configuration
  1. 75% of a machine's memory (standalone)
  2. minimum executor heap size: 8 GB
  3. maximum executor heap size: 40 GB / under 45 GB (watch GC)
  4. kryo serialization
  5. parallel (old) / CMS / G1 GC
  6. pypy > cpython
2. notification
  1. memory usage is not same as data size (2x, 3x bigger)
  2. prefer reduceby than groupby
  3. there are limitations when using python with spark streaming (at least for now)