- Introduction
- Elasticsearch for Apache Hadoop is an open-source, stand-alone, self-contained, small library that allows Hadoop jobs (whether using Map/Reduce or libraries built upon it such as Hive, Pig or Cascading) to interact with Elasticsearch
- Data flows bi-directionaly so that applications can leverage transparently the Elasticsearch engine capabilities to significantly enrich their capabilities and increase the performance
- Elasticsearch for Apache Hadoop offers first-class support for vanilla Map/Reduce, Cascading, Pig and Hive so that using Elasticsearch is literally like using resources within the Hadoop cluster
- Key features
- Scalable Map/Reduce model
- elasticsearch-hadoop is built around Map/Reduce
- every operation done in elasticsearch-hadoop results in multiple Hadoop tasks (based on the number of target shards) that interact, in parallel with Elasticsearch
- REST based
- elasticsearch-hadoop uses Elasticsearch REST interface for communication, allowing for flexible deployments by minimizing the number of ports needed to be open within a network
- Self contained
- the library has been designed to be small and efficient
- At around 300KB and no extra dependencies outside Hadoop itself, distributing elasticsearch-hadoop within your cluster is simple and fast
- Universal jar
- whether you are using Hadoop 1.x or Hadoop 2.x, vanilla Apache Hadoop or a certain distro, the same elasticsearch-hadoop jar works transparently across all of them
- Memory and I/O efficient
- elasticsearch-hadoop is focused on performance
- From pull-based parsing, to bulk updates and direct conversion to/of native types, elasticsearch-hadoop keeps its memory and network I/O usage finely-tuned
- Adaptive I/O
- elasticsearch-hadoop detects transport errors and retries automatically
- If the Elasticsearch node died, re-routes the request to the available nodes (which are discovered automatically)
- Additionally, if Elasticsearch is overloaded, elasticsearch-hadoop detects the data rejected and resents it, until it is either processed or the user-defined policy applies
- Facilitates data co-location
- elasticsearch-hadoop fully integrates with Hadoop exposing its network access information, allowing co-located Elasticsearch and Hadoop clusters to be aware of each other and reduce network IO
- Map/Reduce API support At its core, elasticsearch-hadoop uses the low-level Map/Reduce API to read and write data to Elasticsearch allowing for maximum integration flexibility and performance
- old(mapred) & new(mapreduce) Map/Reduce APIs supported
- elasticsearch-hadoop automatically adjusts to your environment
- one does not have to change between using the mapred or mapreduce APIs - both are supported, by the same classes, at the same time
- Hive support
- Run Hive queries against Elasticsearch for advanced analystics and real_time reponses
- elasticsearch-hadoop exposes Elasticsearch as a Hive table so your scripts can crunch through data faster then ever
- Pig support
- elasticsearch-hadoop supports Apache Pig exposing Elasticsearch as a native Pig Storage
- Run your Pig scripts against Elasticsearch without any modifications to your configuration or the Pig client
- Cascading support
- Cascading is an application framework for Java developers to simply develop robust applications on Apache Hadoop
- And with elasticsearch-hadoop, Cascading can run its flows directly onto Elasticsearch
- Scalable Map/Reduce model
- Requirements
- JDK
- JDK 6.0 update 25
- JDK 7.0 update u55
- hadoop
- Apache Hadoop
- Cloudera CDH
- Hortonworks HDP
- MapR
- JDK
- Installation
- http://www.elasticsearch.org/overview/hadoop/download/
- download zip file
- check sha1
- unzip it
- Reference
Friday, January 2, 2015
elasticsearch-hadoop
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.