Date post: | 28-Nov-2014 |
Category: |
Technology |
Upload: | swapnil-dubey |
View: | 773 times |
Download: | 5 times |
Apache FlumeLoading Big Data into Hadoop Cluster using Flume
Swapnil Dubey Big Data Hacker
GoDataDriven
Agenda
➢ What is Apache Flume?➢ Problem statement➢ Use Case : Collecting web server logs ➢ Overview/Architecture of Flume➢ Demos
What is Flume?
Collection & Aggregation of Streaming Data - Typically used for log data.
Advantages over other solutions:-➢ Scalable, Reliable, Customizable➢ Declarative and Dynamic Configuration➢ Contextual Routing➢ Feature Rich and Fully Extensible➢ Open source
Collecting web server logs.
➢ Collecting web logs using- - Single flume agent
- Using multiple flume agents ➢ Typical converging flow - Converging flow characteristics-Load Balancing, Multiplexing, Failover - Large converging flows
- Event volume
Problem Statement :Single Flume Agent
LOGS
LOGS
LOGS
Application Servers Hadoop Cluster
Flume Agent HDFS Write
Problem Statement:Multiple Flume Agent -1
LOGS
LOGS
LOGS
Application Servers Hadoop Cluster
Flume Agent HDFS Write
Flume Agent
Flume Agent
HDFS Write
HDFS Write
Problem Statement:Multiple Flume Agent -2
LOGS
LOGS
LOGS
Application Servers Hadoop Cluster
Flume Agent
HDFS WriteFlume Agent
Flume Agent
Flume Agent
Core Concepts
➢ Events➢ Client➢ Agents - Source, Channel, Sink - Interceptor - Channel Selector - Sink Processor
Core Concept:Event
An Event is the basic unit of data transported by Flume from source to destination.➢ Payload is opaque to Flume.➢ Events are accompanied by optional headers.
Headers: - Headers are collection of unique Key-Value pairs - Headers are used for contextual routing
Events
Client
Core Concept: Client
Entity that simulates event generation, passed to one or more agents.
➢ Example: Flume log4j Appender➢ Decouples Flume from the system where event data is generated.
Events
Client
Core Concepts: Source
Component that receives events and places it onto one or more channels.➢ Different types of sources: - Specialized sources for integrating with well known systems.
For example -Syslog, Netcat - Auto generating Sources-Exec,SEQ - IPC Sources for Agent to Agent communication: Avro, Thrift
➢ Requires at least one Channel to function.
Core Concept: Channel
Component that buffers incoming events which are ultimately consumed by Sinks.
➢ Different channels:- Memory, File, Database➢ Channels are fully transactional.
Core Concepts: Sink
Component that takes events from channel and transmits them to next hop destination.
Different type of Sinks: - Terminal Sinks: HDFS,Hbase - Auto consuming Sinks: Null Sink - IPC sink : Agent to Agent communication-Avro, Thrift
Core Concepts:Interceptor
Interceptors are applied to sources in a predetermined fashion to enable adding information and filtering of events.
➢ Built in Interceptors: Allows adding headers such as timestamps, static markers etc.
➢ Custom Interceptors: Create headers by inspecting the Event.
Channel Selector
It facilitates selection of one or more Channels, based on preset criteria.
➢ Built in Channel Selectors: - Replicating: for duplicating events - Multiplexing: for routing based based on headers.
➢ Custom selectors can be written for dynamic criteria.
Sink Processor
Sink Processor is responsible for invoking one sink from a specified group of Sinks.
➢ Built in Sink Processors: - Load Balancing Sink Processor. - Failover Sink Processor - Default Sink Processor.
Data Ingest
Source Channel Processor
Interceptor
Channel Selector(decides for channels)
ChannelEvents
CLIENTS
EVENTS
Events filtered Eventsunfiltered Events
Data Drain
➢ Event Removal from Channel is transactional.
Sink Runner SinkSink
Processor
Channels
Sink selection n invocation
Send events to next hop
Next Hop
Agent Pipeline
* Credits: http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf
Assured Delivery
Agents use transactional exchange to guarantee delivery across hops.
Start Transaction
take Events
end transaction
SinkChannel Source Channel
Start Transaction
take Events
end transaction
Send events
Setting up a simple agent for HDFSagent.sources= netcat-collectagent.sinks = hdfs-writeagent.channels= memoryChannel
agent.sources.netcat-collect.type = netcatagent.sources.netcat-collect.bind = 127.0.0.1agent.sources.netcat-collect.port = 11111
agent.sinks.hdfs-write.type = hdfsagent.sinks.hdfs-write.hdfs.path = hdfs://namenode_address:8020/path/to/flume_testagent.sinks.hdfs-write.rollInterval = 30agent.sinks.hdfs-write.hdfs.writeFormat=Textagent.sinks.hdfs-write.hdfs.fileType=DataStream
agent.channels.memoryChannel.type = memoryagent.channels.memoryChannel.capacity=10000agent.sources.netcat-collect.channels=memoryChannelagent.sinks.hdfs-write.channel=memoryChannel
Advanced Features
Fan-In and Fan-Outhdfs-agent.channels=mchannel1 mchannel2hdfs-agent.sources.netcat-collect.selector.type = replicatinghdfs-agent.sources.r1.channels = mchannel1 mchannel2
Interceptorshdfs-agent.sources.netcat-collect.interceptors = filt_inthdfs-agent.sources.netcat-collect.interceptors.filt_int.type=regex_filterhdfs-agent.sources.netcat-collect.interceptors.filt_int.regex=^echo.*hdfs-agent.sources.netcat-collect.interceptors.filt_int.excludeEvents=true