Wednesday, December 31, 2014

ElasticSearch

  1. Introduction

    1. Elasticsearch is an open-source search engine built on top of Apache Lucene™ , a full-text search engine library
    2. Elasticsearch is a real-time distributed search and analytics engine
    3. It allows you to explore your data at a speed and at a scale never before possible
    4. It is used for full text search, structured search, analytics, and all three in combination

  2. Why ES

    1. reason 1
      1. Unfortunately, most databases are astonishingly inept at extracting actionable knowledge from your data
      2. Can they perform full-text search, handle synonyms and score documents by relevance?
      3. Can they generate analytics and aggregations from the same data?
      4. Most importantly, can they do this in real-time without big batch processing jobs?
    2. reason 2
      1. A distributed real-time document store where every field is indexed and searchable
      2. A distributed search engine with real-time analytics
      3. Capable of scaling to hundreds of servers and petabytes of structured and unstructured data

  3. Installation

    1. curl -L -O https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.3.2.zip
    2. unzip elasticsearch-1.3.2.zip
    3. cd elasticsearch-1.3.2
    4. ./bin/elasticsearch -d
  4. Cluster
    1. 3 lower resource master-eligible nodes in large clusters
    2. light wight client nodes
    3. metal is more configurable
    4. metal can utilize SSD
  5. Commnad

    1. curl 'http://localhost:9200/?pretty'
    2. curl -XPOST 'http://localhost:9200/_shutdown'
    3. curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
    4. curl -XPOST 'http://localhost:9200/_cluster/nodes/nodeId1,nodeId2/_shutdown'
    5. curl -XPOST 'http://localhost:9200/_cluster/nodes/_master/_shutdown'
  6. Configuration

    1. config/elasticsearch.yml
      1. cluster.name
      2. node.name
      3. node.master
      4. node.data
      5. path.*
        1. path.conf: -Des.path.conf
        2. path.data
        3. path.work
        4. path.logs
      6. discovery.zen.ping.multicast.enabled: false
      7. discovery.zen.ping.unicast.hosts
      8. gateway.recover_after_nodes: n
      9. discovery.zen.minimum_master_nodes: (n/2) + 1
      10. action.disable_delete_all_indices: true
      11. action.auto_create_index: false
      12. action.destructive_requires_name: true
      13. index.mapper.dynamic: false
      14. script.disable_dynamic: true
    2. dynamic
      1. discovery.zen.minimum_master_nodes
        curl -XPUT localhost:9200/_cluster/settings -d '{
          "persistent" : {
            "discovery.zen.minimum_master_nodes" : (n/2) + 1
          }
        }'
      2. disable _all
        PUT /my_index/_mapping/my_type
        {
            "my_type": {
                "_all": { "enabled": false }
            }
        }
      3. include_in_all
        PUT /my_index/my_type/_mapping
        {
            "my_type": {
                "include_in_all": false,
                "properties": {
                    "title": {
                        "type":           "string",
                        "include_in_all": true
                    },
                    ...
                }
            }
        }
      4. _alias, _aliases
        1. PUT /my_index_v1 
          PUT /my_index_v1/_alias/my_index
        2. POST /_aliases
          {
              "actions": [
                  { "remove": { "index": "my_index_v1", "alias": "my_index" }},
                  { "add":    { "index": "my_index_v2", "alias": "my_index" }}
              ]
          }
      5. refresh_interval (bulk indexing)
        1. PUT /my_logs
          {
            "settings": {
              "refresh_interval": "30s" 
            }
          }
        2. POST /my_logs/_settings
          { "refresh_interval": -1 } 
          
          POST /my_logs/_settings
          { "refresh_interval": "1s" } 
      6. flush
        1. POST /blogs/_flush 
          
          POST /_flush?wait_for_ongoing
      7. optimize
        1. POST /logstash-old-index/_optimize?max_num_segments=1
      8. filed length norm (for logging)
        1. PUT /my_index
          {
            "mappings": {
              "doc": {
                "properties": {
                  "text": {
                    "type": "string",
                    "norms": { "enabled": false } 
                  }
                }
              }
            }
          }
      9. tune cluster and index recovery settings (test the value)
        curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.node_initial_primary_recoveries":25}}'
        curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.node_concurrent_recoveries":5}}'
        ?
        curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.recovery.max_bytes_per_sec":"100mb"}}'
        curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.recovery.concurrent_streams":20}}'
    3. logging.yml
      1. use node.name instead of cluster.name
        file: ${path.logs}/${node.name}.log
    4. elasticsearch.in.sh
      1. disable HeapDumpOnOutOfMemoryError
        #JAVA_OPTS="$JAVA_OPTS -XX:+HeapDumpOnOutOfMemoryError"
    5. ES_HEAP_SIZE: 50%
    6. heaps < 32GB
    7. no swap
      1. bootstrap.mlockall = true
      2. ulimit -l unlimited
    8. thread pools
      1. thread pool size
        1. search - 3 * # of processors (3 * 64 = 192)
        2. index - 2 * # of processors (2 * 64 = 128)
        3. bulk - 3 * # of processors (3 * 64 = 192)
      2. queues - set the size to -1 to prevent rejections from ES
    9. buffers
      1. increased indexing buffer size to 40%
    10. dynamic node.name
      1. ES script
        export ES_NODENMAE=`hostname -s`
      2. elasticsearch.yml
        node.name: "${ES_NODENAME}"
  7. Hardware
    1. CPU
      1. core
    2. disk
      1. SSD
        1. noop / deadline scheduler
        2. better IOPS
        3. cheaper WRT: IOPS
        4. manufacturing tolerance can vary
      2. RAID
        1. do not necessarily need
        2. ES handles redundancy
  8. Monitoring

    1. curl 'localhost:9200/_cluster/health'
    2. curl 'localhost:9200/_nodes/process'
      1. max_file_descriptotrs: 30000?
    3. curl 'localhost:9200/_nodes/jvm'
      1. version
      2. mem.heap_max
    4. curl 'localhost:9200/_nodes/jvm/stats'
      1. heap_used
    5. curl 'localhost:9200/_nodes/indices/stats'
      1. fielddata
    6. curl 'localhost:9200/_nodes/indices/stats?fields=created_on'
      1. fields
    7. curl 'localhost:9200/_nodes/http/stats'
      1. http
    8. GET /_stats/fielddata?fields=*
    9. GET /_nodes/stats/indices/fielddata?fields=*
    10. GET /_nodes/stats/indices/fielddata?level=indices&fields=*
  9. Scenario

    1. adding nodes
      1. disable allocation to stop shard shuffling until ready
        curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
      2. increase speed of transfers
        curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"indices.recovery.concurrent_streams":6,"indices.recovery.max_bytes_per_sec":"50mb"}}'
      3. start new nodes
      4. enable allocation
        curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'
    2. removing nodes
      1. exclude the nodes from the cluster, this will tell ES to move things off
        curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude._name":"node-05*,node-06*"}}'
      2. increase speed of transfers
        curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"indices.recovery.concurrent_streams":6,"indices.recovery.max_bytes_per_sec":"50mb"}}'
      3. shutdown old nodes after all shards move off
        curl -XPOST 'localhost:9200/_cluster/nodes/node-05*,node-06*/_shutdown'
    3. upgrades / node restarts
      1. disable auto balancing  if doing rolling restarts
        curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'
      2. restart
      3. able auto balancing
        curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'
    4. re / bulk indexing
      1. set replicas to 0
      2. increase after completion
  10. Restoration
    1. snapshot
  11. Reference

    1. Elasticsearch - The Definitive Guide
    2. http://www.elasticsearch.org/webinars/elasticsearch-pre-flight-checklist/
    3. http://www.elasticsearch.org/webinars/elk-stack-devops-environment/
    4. http://www.elasticsearch.org/videos/moloch-elasticsearch-powering-network-forensics-aol/
    5. http://www.elasticsearch.org/videos/elastic-searching-big-data/

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.