Friday, January 2, 2015

elasticsearch-hadoop

  1. Introduction
    1. Elasticsearch for Apache Hadoop is an open-source, stand-alone, self-contained, small library that allows Hadoop jobs (whether using Map/Reduce or libraries built upon it such as Hive, Pig or Cascading) to interact with Elasticsearch
    2. Data flows bi-directionaly so that applications can leverage transparently the Elasticsearch engine capabilities to significantly enrich their capabilities and increase the performance
    3. Elasticsearch for Apache Hadoop offers first-class support for vanilla Map/Reduce, Cascading, Pig and Hive so that using Elasticsearch is literally like using resources within the Hadoop cluster
  2. Key  features
    1. Scalable Map/Reduce model 
      1. elasticsearch-hadoop is built around Map/Reduce
      2. every operation done in elasticsearch-hadoop results in multiple Hadoop tasks (based on the number of target shards) that interact, in parallel with Elasticsearch
    2. REST based 
      1. elasticsearch-hadoop uses Elasticsearch REST interface for communication, allowing for flexible deployments by minimizing the number of ports needed to be open within a network
    3. Self contained 
      1. the library has been designed to be small and efficient
      2. At around 300KB and no extra dependencies outside Hadoop itself, distributing elasticsearch-hadoop within your cluster is simple and fast
    4. Universal jar 
      1. whether you are using Hadoop 1.x or Hadoop 2.x, vanilla Apache Hadoop or a certain distro, the same elasticsearch-hadoop jar works transparently across all of them
    5. Memory and I/O efficient 
      1. elasticsearch-hadoop is focused on performance
      2. From pull-based parsing, to bulk updates and direct conversion to/of native types, elasticsearch-hadoop keeps its memory and network I/O usage finely-tuned
    6. Adaptive I/O 
      1. elasticsearch-hadoop detects transport errors and retries automatically
      2. If the Elasticsearch node died, re-routes the request to the available nodes (which are discovered automatically)
      3. Additionally, if Elasticsearch is overloaded, elasticsearch-hadoop detects the data rejected and resents it, until it is either processed or the user-defined policy applies
    7. Facilitates data co-location 
      1. elasticsearch-hadoop fully integrates with Hadoop exposing its network access information, allowing co-located Elasticsearch and Hadoop clusters to be aware of each other and reduce network IO
      2. Map/Reduce API support At its core, elasticsearch-hadoop uses the low-level Map/Reduce API to read and write data to Elasticsearch allowing for maximum integration flexibility and performance
    8. old(mapred) & new(mapreduce) Map/Reduce APIs supported 
      1. elasticsearch-hadoop automatically adjusts to your environment
      2. one does not have to change between using the mapred or mapreduce APIs - both are supported, by the same classes, at the same time
    9. Hive support 
      1. Run Hive queries against Elasticsearch for advanced analystics and real_time reponses
      2. elasticsearch-hadoop exposes Elasticsearch as a Hive table so your scripts can crunch through data faster then ever
    10. Pig support 
      1. elasticsearch-hadoop supports Apache Pig exposing Elasticsearch as a native Pig Storage
      2. Run your Pig scripts against Elasticsearch without any modifications to your configuration or the Pig client
    11. Cascading support 
      1. Cascading is an application framework for Java developers to simply develop robust applications on Apache Hadoop
      2. And with elasticsearch-hadoop, Cascading can run its flows directly onto Elasticsearch
  3. Requirements
    1. JDK
      1. JDK 6.0 update 25
      2. JDK 7.0 update u55
    2. hadoop
      1. Apache Hadoop
      2. Cloudera CDH
      3. Hortonworks HDP
      4. MapR
  4. Installation
    1. http://www.elasticsearch.org/overview/hadoop/download/
    2. download zip file
    3. check sha1
    4. unzip it
  5. Reference
    1. http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/index.html

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.