© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Apache Flume
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Motivation
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Motivation
You have a lot of servers and systems■ network devices■ operating systems■ web servers■ applications
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Motivation
They generate a lot of logs and other data
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Motivation
I have business idea how to use this data!
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
MotivationYou have Hadoop cluster running
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
MotivationYou want to move the logs to Hadoop
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Traditional Solutions
■ Own scripts● Probably a combination of
■
■
■ and/or /
● Cron or start/stop manually● Hardcoded or missing configuration● Tightly-coupled with data that is transferred
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Complications
■ High delays■ Limited manageability
● Compression, encryptions, various file formats● Throughput● Configuration and monitoring
■ Limited scalability● Data explosion, Failover, Load balancing
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Apache Flume
■ Aims to solve this problem!■ It can move large amounts of streaming event data from
one place to another● e.g. from web servers to Hadoop cluster
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Overview
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Various systems that constantly generate data
in form of events
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
■ Installed on each node■ Collects events
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
■ Filters useless events■ Decorates events by adding metadata
● e.g. timestamp, hostname, UUID, static markers
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Encrypts events in a file on disk
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
EncryptFlume Agent
Flume Agent
Flume Agent
Sends events to next-hop Flume agent
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
EncryptFlume Agent
Flume Agent
Flume Agent
Compression is supported
Compress
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
EncryptFlume Agent
Flume Agent
Flume Agent
Events can be multiplexed to multiple agents (to spread the
load) ….
A
B
A,B
Compress
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
EncryptFlume Agent
Flume Agent
Flume Agent
… or replicated for redundancy
A,B
A,B
A,B
Compress
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
EncryptFlume Agent
Flume Agent
Flume Agent
to survive a permanent failure of an agent, disk
or node.
A,B
A,B
A,B
Compress
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
EncryptFlume Agent
Flume Agent
Flume Agent
Events can be also delivered in “failover”
mode where …
C,D
C,D
Compress
1
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
EncryptFlume Agent
Flume Agent
Flume Agent
C,D
C,D
Compress
1
… in case of a failure of the next-
hop agent …
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
EncryptFlume Agent
Flume Agent
Flume Agent
… we try next Agent(s) on a prioritized list.
C,D
C,D
Compress
2
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
EncryptFlume Agent
Flume Agent
Flume Agent
Events can be also load-balanced (round robin, random
and custom) …
E,F
E,F
Compress
1
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
EncryptFlume Agent
Flume Agent
Flume Agent
… and go to different next-hop Agents to spread the
load
E,F
G,H
G,H
Compress
2
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
EncryptFlume Agent
Flume Agent
Flume Agent
Events can be stored■ in memory (for performance)■ on disk (for durability)
E,F
G,H
Compress
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
EncryptFlume Agent
Flume Agent
Flume Agent
Events can be finally transferred to HDFS■ Multiple file formats e.g. Text, JSON, Avro■ Compression supported■ Flexible names of HDFS path
Compress
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
EncryptFlume Agent
Flume Agent
Flume Agent
However,■ Many destinations are supported■ One can implement a custom one
ones
Compress
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume
■ Distributed● Agents installed on many machines
■ Scalable● Add more machines to transfer more events
■ Reliable● Durable storage, failover and/or replication
■ Manageable● Easy to install, configure, reconfigure and run
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume
■ Nicely integrated with the Hadoop Ecosystem● Various destinations e.g. HDFS, HBase● Various file formats e.g. Avro, SequenceFile
■ Extensible● Possibility to add new functionality e.g. source and
destination for events
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Architecture
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Event
■ Unit of data transported by Flume
Headers Payload
Generally smallYou can add own headers e.g. hostname, timestamp
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
■ Responsible for transferring events■ Runs in JVM■ Consists of Source(s), Channel(s) and Sink(s)
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Source
■ Collects and forwards events in channels● HTTP, JMS, RPC, NetCat● Exec● Spooling directory
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Exec Source
■ Runs a given Unix command on startup● Should continuously run and produce data on
■ If the process exits, the source also exits and will NOT produce any further data
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Spooling Directory Source
■ Watches a specified directory for new files■ Parses events out of new files as they appear■ After a file has been fully processed, it is renamed to
indicate completion (or optionally deleted)
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Channel
■ Buffers incoming events until they are extracted by Sinks■ Tradeoff between durability and throughput
● Memory● File● JDBC
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Memory Channel
■ Events stored in an in-memory queue■ Configurable capacity
● The maximum number of events and/or bytes in memory■ Nondurable, but faster
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
File Channel
■ Events stored in file on disk● Durable
■ Flushes to disk at the end of each transaction
● Supports encryption■ Configurable capacity
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
File Channel
■ The more disks● The better performance● The higher capacity
■ Can be limited by the amount of memory for in-memory queue that keeps pointers to all events stored in log files
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Sink
■ Removes events from a Channel and forwards them to their next destination● HDFS, HBase, Solr, ElasticSearch● File, Logger● Flume Agent
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
HDFS Sink
■ Writes events to HDFS● Flexible naming of HDFS paths
■ Multiple file formats are supported e.g. Text, Avro
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
HDFS Sink
■ Rollover properties for files generated by HDFS Sink
Number of seconds to wait before rolling a file (0 deactivates this feature)
File size, in bytes, to trigger roll of a file (0 deactivates this feature)
Number of events written to file before rolling a file (0 deactivates this feature)
Timeout after which inactive files get closed (0 deactivates this feature)
Number of events written to file before they are flushed to HDFS
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
HDFS Sink
■ Rolling a file will generate many small files● Need to compact them to avoid an explosion of HDFS
metadata■ Often, you also want to deduplicate, filter and split events
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Avro Source And Sink
Source Avro SinkChannel
Avro Source SinkChannel
Sends a batch of Avro events to a configured
hostname:port
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Avro Source And Sink
Source Avro SinkChannel
Avro Source SinkChannel
Listens to events on a given port
Sends a batch of Avro events to a configured
hostname:port
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Avro Source And Sink
Source Avro SinkChannel
Avro Source SinkChannel
Listens to events on a given port
Sends a batch of Avro events to a configured
hostname:port
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Avro Source And Sink
Source Avro SinkChannel
Avro Source SinkChannel
Compress
Encrypt
Listens to events on a given port
Sends a batch of Avro events to a configured
hostname:port
Optionally
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
■ Durable channels● Survive the agent failure, machine restarts or non disk-
related failures■ Redundant path in a workflow topology
● Survive the failure of a node● Achieved via replication or failover
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
■ Sufficient capacity of the channels● Minimize the back pressure on earlier points in the flow● Some sources might not be able to resend the data e.g.
■ Exec Source does not handle failures and might lose the data■ Spooling Directory Source offers reliability guarantees
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
Source Avro SinkChannel
Avro Source SinkChannel
D, C, B, A
Start the transaction1
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
Source Avro SinkChannelD, C
Take a batch of events2
B, A
Avro Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
Source Avro SinkChannelD, C
Send a batch of events3
B, A
Avro Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
Source Avro SinkChannelD, C
B, A
Avro Source SinkChannelStart the transaction 4
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
Source Avro SinkChannelD, C
B, A
Avro Source SinkChannelPut events into a
channel5
B, A
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
Source Avro SinkChannelD, C
B, A
Avro Source SinkChannelStop the transaction 6
B, A
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
Source Avro SinkChannel
Avro Source SinkChannel
Stop the transaction7
B, A
D, C
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
When Flume Is Not A Good Fit
■ Very large events● An event cannot be larger than memory or a disk on an
agent’s machine■ Infrequent bulks loads
● Other tools might be better e.g. HDFS File Slurper
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Configuration
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Configuration
■ Simple format■ A configuration file can contain configuration settings for
many Agents● Only settings needed by the Agent will be loaded
■ Agent automatically reloads configuration if it changes
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Configuration Example
■ We configure Flume to run a single agent that 1. listens for data on a given port2. turns each line of incoming text into an event3. and sends to HDFS via the in-memory channel.
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Agent
● Simple format● A configuration file can contain configuration
settings for many Agents○ Only settings needed by the Agent will be loaded
● Agent automatically reloads configuration if it changes
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Source
● Simple format● A configuration file can contain configuration
settings for many Agents○ Only settings needed by the Agent will be loaded
● Agent automatically reloads configuration if it changes
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Channel
● Simple format● A configuration file can contain configuration
settings for many Agents○ Only settings needed by the Agent will be loaded
● Agent automatically reloads configuration if it changes
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Sink
● Simple format● A configuration file can contain configuration
settings for many Agents○ Only settings needed by the Agent will be loaded
● Agent automatically reloads configuration if it changes
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Starting Agent
● Simple format● A configuration file can contain configuration
settings for many Agents○ Only settings needed by the Agent will be loaded
● Agent automatically reloads configuration if it changes
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
There Is More!
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
GetInData
■ Data consulting company■ We help you benefit from data
● Look at our portfolio: http://getindata.com/portfolio● Find our trainings: http://getindata.com/trainings● Learn more about our team: http://getindata.com/team