+ All Categories
Home > Data & Analytics > Apache Flume

Apache Flume

Date post: 22-Nov-2014
Category:
Upload: getindata-instructor
View: 462 times
Download: 3 times
Share this document with a friend
Description:
An introduction to Apache Flume that comes from Hadoop Administrator Training delivered by GetInData. Apache Flume is a distributed, reliable, and available service for collecting, aggregating, and moving large amounts of log data. By reading these slides, you will learn about Apache Flume, its motivation, the most important features, architecture of Flume, its reliability guarantees, Agent's configuration, integration with the Apache Hadoop Ecosystem and more.
Popular Tags:
68
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Apache Flume
Transcript
Page 1: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Apache Flume

Page 2: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Motivation

Page 3: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Motivation

You have a lot of servers and systems■ network devices■ operating systems■ web servers■ applications

Page 4: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Motivation

They generate a lot of logs and other data

Page 5: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Motivation

I have business idea how to use this data!

Page 6: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

MotivationYou have Hadoop cluster running

Page 7: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

MotivationYou want to move the logs to Hadoop

Page 8: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Traditional Solutions

■ Own scripts● Probably a combination of

■ and/or /

● Cron or start/stop manually● Hardcoded or missing configuration● Tightly-coupled with data that is transferred

Page 9: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Complications

■ High delays■ Limited manageability

● Compression, encryptions, various file formats● Throughput● Configuration and monitoring

■ Limited scalability● Data explosion, Failover, Load balancing

Page 10: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Apache Flume

■ Aims to solve this problem!■ It can move large amounts of streaming event data from

one place to another● e.g. from web servers to Hadoop cluster

Page 11: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Overview

Page 12: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Various systems that constantly generate data

in form of events

Page 13: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

■ Installed on each node■ Collects events

Page 14: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

■ Filters useless events■ Decorates events by adding metadata

● e.g. timestamp, hostname, UUID, static markers

Page 15: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

Encrypt

Encrypts events in a file on disk

Page 16: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

Sends events to next-hop Flume agent

Page 17: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

Compression is supported

Compress

Page 18: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

Events can be multiplexed to multiple agents (to spread the

load) ….

A

B

A,B

Compress

Page 19: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

… or replicated for redundancy

A,B

A,B

A,B

Compress

Page 20: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

to survive a permanent failure of an agent, disk

or node.

A,B

A,B

A,B

Compress

Page 21: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

Events can be also delivered in “failover”

mode where …

C,D

C,D

Compress

1

Page 22: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

C,D

C,D

Compress

1

… in case of a failure of the next-

hop agent …

Page 23: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

… we try next Agent(s) on a prioritized list.

C,D

C,D

Compress

2

Page 24: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

Events can be also load-balanced (round robin, random

and custom) …

E,F

E,F

Compress

1

Page 25: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

… and go to different next-hop Agents to spread the

load

E,F

G,H

G,H

Compress

2

Page 26: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

Events can be stored■ in memory (for performance)■ on disk (for durability)

E,F

G,H

Compress

Page 27: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

Events can be finally transferred to HDFS■ Multiple file formats e.g. Text, JSON, Avro■ Compression supported■ Flexible names of HDFS path

Compress

Page 28: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

However,■ Many destinations are supported■ One can implement a custom one

ones

Compress

Page 29: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume

■ Distributed● Agents installed on many machines

■ Scalable● Add more machines to transfer more events

■ Reliable● Durable storage, failover and/or replication

■ Manageable● Easy to install, configure, reconfigure and run

Page 30: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume

■ Nicely integrated with the Hadoop Ecosystem● Various destinations e.g. HDFS, HBase● Various file formats e.g. Avro, SequenceFile

■ Extensible● Possibility to add new functionality e.g. source and

destination for events

Page 31: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Architecture

Page 32: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Event

■ Unit of data transported by Flume

Headers Payload

Generally smallYou can add own headers e.g. hostname, timestamp

Page 33: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

■ Responsible for transferring events■ Runs in JVM■ Consists of Source(s), Channel(s) and Sink(s)

Source SinkChannel

Page 34: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Source

■ Collects and forwards events in channels● HTTP, JMS, RPC, NetCat● Exec● Spooling directory

Source SinkChannel

Page 35: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Exec Source

■ Runs a given Unix command on startup● Should continuously run and produce data on

■ If the process exits, the source also exits and will NOT produce any further data

Source SinkChannel

Page 36: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Spooling Directory Source

■ Watches a specified directory for new files■ Parses events out of new files as they appear■ After a file has been fully processed, it is renamed to

indicate completion (or optionally deleted)

Source SinkChannel

Page 37: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Channel

■ Buffers incoming events until they are extracted by Sinks■ Tradeoff between durability and throughput

● Memory● File● JDBC

Source SinkChannel

Page 38: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Memory Channel

■ Events stored in an in-memory queue■ Configurable capacity

● The maximum number of events and/or bytes in memory■ Nondurable, but faster

Source SinkChannel

Page 39: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

File Channel

■ Events stored in file on disk● Durable

■ Flushes to disk at the end of each transaction

● Supports encryption■ Configurable capacity

Source SinkChannel

Page 40: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

File Channel

■ The more disks● The better performance● The higher capacity

■ Can be limited by the amount of memory for in-memory queue that keeps pointers to all events stored in log files

Source SinkChannel

Page 41: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Sink

■ Removes events from a Channel and forwards them to their next destination● HDFS, HBase, Solr, ElasticSearch● File, Logger● Flume Agent

Source SinkChannel

Page 42: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

HDFS Sink

■ Writes events to HDFS● Flexible naming of HDFS paths

■ Multiple file formats are supported e.g. Text, Avro

Source SinkChannel

Page 43: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

HDFS Sink

■ Rollover properties for files generated by HDFS Sink

Number of seconds to wait before rolling a file (0 deactivates this feature)

File size, in bytes, to trigger roll of a file (0 deactivates this feature)

Number of events written to file before rolling a file (0 deactivates this feature)

Timeout after which inactive files get closed (0 deactivates this feature)

Number of events written to file before they are flushed to HDFS

Page 44: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

HDFS Sink

■ Rolling a file will generate many small files● Need to compact them to avoid an explosion of HDFS

metadata■ Often, you also want to deduplicate, filter and split events

Source SinkChannel

Page 45: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Avro Source And Sink

Source Avro SinkChannel

Avro Source SinkChannel

Sends a batch of Avro events to a configured

hostname:port

Page 46: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Avro Source And Sink

Source Avro SinkChannel

Avro Source SinkChannel

Listens to events on a given port

Sends a batch of Avro events to a configured

hostname:port

Page 47: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Avro Source And Sink

Source Avro SinkChannel

Avro Source SinkChannel

Listens to events on a given port

Sends a batch of Avro events to a configured

hostname:port

Page 48: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Avro Source And Sink

Source Avro SinkChannel

Avro Source SinkChannel

Compress

Encrypt

Listens to events on a given port

Sends a batch of Avro events to a configured

hostname:port

Optionally

Page 49: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

■ Durable channels● Survive the agent failure, machine restarts or non disk-

related failures■ Redundant path in a workflow topology

● Survive the failure of a node● Achieved via replication or failover

Page 50: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

■ Sufficient capacity of the channels● Minimize the back pressure on earlier points in the flow● Some sources might not be able to resend the data e.g.

■ Exec Source does not handle failures and might lose the data■ Spooling Directory Source offers reliability guarantees

Page 51: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

Source Avro SinkChannel

Avro Source SinkChannel

D, C, B, A

Start the transaction1

Page 52: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

Source Avro SinkChannelD, C

Take a batch of events2

B, A

Avro Source SinkChannel

Page 53: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

Source Avro SinkChannelD, C

Send a batch of events3

B, A

Avro Source SinkChannel

Page 54: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

Source Avro SinkChannelD, C

B, A

Avro Source SinkChannelStart the transaction 4

Page 55: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

Source Avro SinkChannelD, C

B, A

Avro Source SinkChannelPut events into a

channel5

B, A

Page 56: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

Source Avro SinkChannelD, C

B, A

Avro Source SinkChannelStop the transaction 6

B, A

Page 57: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

Source Avro SinkChannel

Avro Source SinkChannel

Stop the transaction7

B, A

D, C

Page 58: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

When Flume Is Not A Good Fit

■ Very large events● An event cannot be larger than memory or a disk on an

agent’s machine■ Infrequent bulks loads

● Other tools might be better e.g. HDFS File Slurper

Page 59: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Configuration

Page 60: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Configuration

■ Simple format■ A configuration file can contain configuration settings for

many Agents● Only settings needed by the Agent will be loaded

■ Agent automatically reloads configuration if it changes

Page 61: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Configuration Example

■ We configure Flume to run a single agent that 1. listens for data on a given port2. turns each line of incoming text into an event3. and sends to HDFS via the in-memory channel.

Page 62: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Agent

● Simple format● A configuration file can contain configuration

settings for many Agents○ Only settings needed by the Agent will be loaded

● Agent automatically reloads configuration if it changes

Page 63: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Source

● Simple format● A configuration file can contain configuration

settings for many Agents○ Only settings needed by the Agent will be loaded

● Agent automatically reloads configuration if it changes

Page 64: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Channel

● Simple format● A configuration file can contain configuration

settings for many Agents○ Only settings needed by the Agent will be loaded

● Agent automatically reloads configuration if it changes

Page 65: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Sink

● Simple format● A configuration file can contain configuration

settings for many Agents○ Only settings needed by the Agent will be loaded

● Agent automatically reloads configuration if it changes

Page 66: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Starting Agent

● Simple format● A configuration file can contain configuration

settings for many Agents○ Only settings needed by the Agent will be loaded

● Agent automatically reloads configuration if it changes

Page 67: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

There Is More!

Page 68: Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

GetInData

■ Data consulting company■ We help you benefit from data

● Look at our portfolio: http://getindata.com/portfolio● Find our trainings: http://getindata.com/trainings● Learn more about our team: http://getindata.com/team


Recommended