Mungeol Heo: ElasticSearch

Introduction
1. Elasticsearch is an open-source search engine built on top of Apache Lucene™ , a full-text search engine library
2. Elasticsearch is a real-time distributed search and analytics engine
3. It allows you to explore your data at a speed and at a scale never before possible
4. It is used for full text search, structured search, analytics, and all three in combination
Why ES
1. reason 1
  1. Unfortunately, most databases are astonishingly inept at extracting actionable knowledge from your data
  2. Can they perform full-text search, handle synonyms and score documents by relevance?
  3. Can they generate analytics and aggregations from the same data?
  4. Most importantly, can they do this in real-time without big batch processing jobs?
2. reason 2
  1. A distributed real-time document store where every field is indexed and searchable
  2. A distributed search engine with real-time analytics
  3. Capable of scaling to hundreds of servers and petabytes of structured and unstructured data
Installation
1. curl -L -O https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.3.2.zip
2. unzip elasticsearch-1.3.2.zip
3. cd elasticsearch-1.3.2
4. ./bin/elasticsearch -d
Cluster
1. 3 lower resource master-eligible nodes in large clusters
2. light wight client nodes
3. metal is more configurable
4. metal can utilize SSD
Commnad
1. curl 'http://localhost:9200/?pretty'
2. curl -XPOST 'http://localhost:9200/_shutdown'
3. curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
4. curl -XPOST 'http://localhost:9200/_cluster/nodes/nodeId1,nodeId2/_shutdown'
5. curl -XPOST 'http://localhost:9200/_cluster/nodes/_master/_shutdown'

Configuration

config/elasticsearch.yml
1. cluster.name
2. node.name
3. node.master
4. node.data
5. path.*
  1. path.conf: -Des.path.conf
  2. path.data
  3. path.work
  4. path.logs
6. discovery.zen.ping.multicast.enabled: false
7. discovery.zen.ping.unicast.hosts
8. gateway.recover_after_nodes: n
9. discovery.zen.minimum_master_nodes: (n/2) + 1
10. action.disable_delete_all_indices: true
11. action.auto_create_index: false
12. action.destructive_requires_name: true
13. index.mapper.dynamic: false
14. script.disable_dynamic: true

dynamic

discovery.zen.minimum_master_nodes

curl -XPUT localhost:9200/_cluster/settings -d '{
  "persistent" : {
    "discovery.zen.minimum_master_nodes" : (n/2) + 1
  }
}'

disable _all

PUT /my_index/_mapping/my_type
{
    "my_type": {
        "_all": { "enabled": false }
    }
}

include_in_all

PUT /my_index/my_type/_mapping
{
    "my_type": {
        "include_in_all": false,
        "properties": {
            "title": {
                "type":           "string",
                "include_in_all": true
            },
            ...
        }
    }
}

_alias, _aliases

PUT /my_index_v1 
PUT /my_index_v1/_alias/my_index

POST /_aliases
{
    "actions": [
        { "remove": { "index": "my_index_v1", "alias": "my_index" }},
        { "add":    { "index": "my_index_v2", "alias": "my_index" }}
    ]
}

refresh_interval (bulk indexing)

PUT /my_logs
{
  "settings": {
    "refresh_interval": "30s" 
  }
}

POST /my_logs/_settings
{ "refresh_interval": -1 } 

POST /my_logs/_settings
{ "refresh_interval": "1s" }

flush

POST /blogs/_flush 

POST /_flush?wait_for_ongoing

optimize

POST /logstash-old-index/_optimize?max_num_segments=1

filed length norm (for logging)

PUT /my_index
{
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "string",
          "norms": { "enabled": false } 
        }
      }
    }
  }
}

tune cluster and index recovery settings (test the value)

curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.node_initial_primary_recoveries":25}}'
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.node_concurrent_recoveries":5}}'
?
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.recovery.max_bytes_per_sec":"100mb"}}'
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.recovery.concurrent_streams":20}}'

logging.yml
1. use node.name instead of cluster.name
```
file: ${path.logs}/${node.name}.log
```

elasticsearch.in.sh

disable HeapDumpOnOutOfMemoryError

#JAVA_OPTS="$JAVA_OPTS -XX:+HeapDumpOnOutOfMemoryError"

ES_HEAP_SIZE: 50%
heaps < 32GB
no swap
1. bootstrap.mlockall = true
2. ulimit -l unlimited
thread pools
1. thread pool size
  1. search - 3 * # of processors (3 * 64 = 192)
  2. index - 2 * # of processors (2 * 64 = 128)
  3. bulk - 3 * # of processors (3 * 64 = 192)
2. queues - set the size to -1 to prevent rejections from ES
buffers
1. increased indexing buffer size to 40%

dynamic node.name

ES script
```
export ES_NODENMAE=`hostname -s`
```
elasticsearch.yml
```
node.name: "${ES_NODENAME}"
```

Hardware
1. CPU
  1. core
2. disk
  1. SSD
    1. noop / deadline scheduler
    2. better IOPS
    3. cheaper WRT: IOPS
    4. manufacturing tolerance can vary
  2. RAID
    1. do not necessarily need
    2. ES handles redundancy
Monitoring
1. curl 'localhost:9200/_cluster/health'
2. curl 'localhost:9200/_nodes/process'
  1. max_file_descriptotrs: 30000?
3. curl 'localhost:9200/_nodes/jvm'
  1. version
  2. mem.heap_max
4. curl 'localhost:9200/_nodes/jvm/stats'
  1. heap_used
5. curl 'localhost:9200/_nodes/indices/stats'
  1. fielddata
6. curl 'localhost:9200/_nodes/indices/stats?fields=created_on'
  1. fields
7. curl 'localhost:9200/_nodes/http/stats'
  1. http
8. GET /_stats/fielddata?fields=*
9. GET /_nodes/stats/indices/fielddata?fields=*
10. GET /_nodes/stats/indices/fielddata?level=indices&fields=*

Scenario

adding nodes

disable allocation to stop shard shuffling until ready

curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'

increase speed of transfers

curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"indices.recovery.concurrent_streams":6,"indices.recovery.max_bytes_per_sec":"50mb"}}'

start new nodes

enable allocation

curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'

removing nodes

exclude the nodes from the cluster, this will tell ES to move things off

curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude._name":"node-05*,node-06*"}}'

increase speed of transfers

curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"indices.recovery.concurrent_streams":6,"indices.recovery.max_bytes_per_sec":"50mb"}}'

shutdown old nodes after all shards move off

curl -XPOST 'localhost:9200/_cluster/nodes/node-05*,node-06*/_shutdown'

upgrades / node restarts

disable auto balancing if doing rolling restarts

curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":true}}'

restart

able auto balancing

curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation":false}}'

re / bulk indexing
1. set replicas to 0
2. increase after completion

Restoration
1. snapshot
Reference

Mungeol Heo

Wednesday, December 31, 2014

ElasticSearch

No comments:

Post a Comment