Wednesday, December 31, 2014

Apache Flume

  1. Introduction
    1. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. 
    2. It has a simple and flexible architecture based on streaming data flows. 
    3. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. 
    4. It uses a simple extensible data model that allows for online analytic application.
  2. System Requirements
    1. Java Runtime Environment - Java 1.6 or later (Java 1.7 Recommended) 
    2. Memory - Sufficient memory for configurations used by sources, channels or sinks 
    3. Disk Space - Sufficient disk space for configurations used by channels or sinks 
    4. Directory Permissions - Read/Write permissions for directories used by agent
  3. Features
    1. complex flows
      1. Flume allows a user to build multi-hop flows where events travel through multiple agents before reaching the final destination. It also allows fan-in and fan-out flows, contextual routing and backup routes (fail-over) for failed hops.
    2. reliability
      1. The events are staged in a channel on each agent. The events are then delivered to the next agent or terminal repository (like HDFS) in the flow. The events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository. This is a how the single-hop message delivery semantics in Flume provide end-to-end reliability of the flow. Flume uses a transactional approach to guarantee the reliable delivery of the events. The sources and sinks encapsulate in a transaction the storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel. This ensures that the set of events are reliably passed from point to point in the flow. In the case of a multi-hop flow, the sink from the previous hop and the source from the next hop both have their transactions running to ensure that the data is safely stored in the channel of the next hop.
    3. recover ability
      1. The events are staged in the channel, which manages recovery from failure. Flume supports a durable file channel which is backed by the local file system. There’s also a memory channel which simply stores the events in an in-memory queue, which is faster but any events still left in the memory channel when an agent process dies can’t be recovered.
  4. Installation
    1. download
      1. http://flume.apache.org/download.html, or
      2. wget http://apache.mirror.cdnetworks.com/flume/1.5.0.1/apache-flume-1.5.0.1-bin.tar.gz
    2. tar xvf apache-flume-1.5.0.1-bin.tar.gz
    3. cd apache-flume-1.5.0.1-bin
    4. bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template
  5. CDH
    1. Installattion
      1. CDH5-Installation-Guide.pdf (P.155)
      2. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_flume_installation.html
    2. Security Configuration
      1. CDH5-Security-Guide.pdf (P.53)
      2. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Security-Guide/cdh5sg_flume_security.html
  6. CM
    1. Service
      1. Managing-Clusters-with-Cloudera-Manager.pdf (P.49)
      2. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Managing-Clusters/cm5mc_flume_service.html
    2. Properties: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Configuration-Properties/cm5config_cdh500_flume.html
    3. Health Tests
      1. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Health-Tests/ht_flume.html
      2. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Health-Tests/ht_flume_agent.html
    4. Metrics
      1. Metrics: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Metrics/flume_metrics.html
      2. Channel Metrics: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Metrics/flume_channel_metrics.html
      3. Sink Metrics: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Metrics/flume_sink_metrics.html
      4. Source Metrics: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM5/latest/Cloudera-Manager-Metrics/flume_source_metrics.html
  7. Reference
    1. http://flume.apache.org/

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.