Apache Flume

Post on 22-Nov-2014

463 views 3 download

Tags:

description

An introduction to Apache Flume that comes from Hadoop Administrator Training delivered by GetInData. Apache Flume is a distributed, reliable, and available service for collecting, aggregating, and moving large amounts of log data. By reading these slides, you will learn about Apache Flume, its motivation, the most important features, architecture of Flume, its reliability guarantees, Agent's configuration, integration with the Apache Hadoop Ecosystem and more.

transcript

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Apache Flume

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Motivation

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Motivation

You have a lot of servers and systems■ network devices■ operating systems■ web servers■ applications

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Motivation

They generate a lot of logs and other data

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Motivation

I have business idea how to use this data!

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

MotivationYou have Hadoop cluster running

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

MotivationYou want to move the logs to Hadoop

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Traditional Solutions

■ Own scripts● Probably a combination of

■ and/or /

● Cron or start/stop manually● Hardcoded or missing configuration● Tightly-coupled with data that is transferred

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Complications

■ High delays■ Limited manageability

● Compression, encryptions, various file formats● Throughput● Configuration and monitoring

■ Limited scalability● Data explosion, Failover, Load balancing

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Apache Flume

■ Aims to solve this problem!■ It can move large amounts of streaming event data from

one place to another● e.g. from web servers to Hadoop cluster

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Overview

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Various systems that constantly generate data

in form of events

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

■ Installed on each node■ Collects events

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

■ Filters useless events■ Decorates events by adding metadata

● e.g. timestamp, hostname, UUID, static markers

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

Encrypt

Encrypts events in a file on disk

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

Sends events to next-hop Flume agent

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

Compression is supported

Compress

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

Events can be multiplexed to multiple agents (to spread the

load) ….

A

B

A,B

Compress

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

… or replicated for redundancy

A,B

A,B

A,B

Compress

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

to survive a permanent failure of an agent, disk

or node.

A,B

A,B

A,B

Compress

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

Events can be also delivered in “failover”

mode where …

C,D

C,D

Compress

1

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

C,D

C,D

Compress

1

… in case of a failure of the next-

hop agent …

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

… we try next Agent(s) on a prioritized list.

C,D

C,D

Compress

2

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

Events can be also load-balanced (round robin, random

and custom) …

E,F

E,F

Compress

1

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

… and go to different next-hop Agents to spread the

load

E,F

G,H

G,H

Compress

2

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

Events can be stored■ in memory (for performance)■ on disk (for durability)

E,F

G,H

Compress

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

Events can be finally transferred to HDFS■ Multiple file formats e.g. Text, JSON, Avro■ Compression supported■ Flexible names of HDFS path

Compress

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Flume Agent

Interceptor

EncryptFlume Agent

Flume Agent

Flume Agent

However,■ Many destinations are supported■ One can implement a custom one

ones

Compress

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume

■ Distributed● Agents installed on many machines

■ Scalable● Add more machines to transfer more events

■ Reliable● Durable storage, failover and/or replication

■ Manageable● Easy to install, configure, reconfigure and run

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume

■ Nicely integrated with the Hadoop Ecosystem● Various destinations e.g. HDFS, HBase● Various file formats e.g. Avro, SequenceFile

■ Extensible● Possibility to add new functionality e.g. source and

destination for events

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Architecture

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Event

■ Unit of data transported by Flume

Headers Payload

Generally smallYou can add own headers e.g. hostname, timestamp

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Flume Agent

■ Responsible for transferring events■ Runs in JVM■ Consists of Source(s), Channel(s) and Sink(s)

Source SinkChannel

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Source

■ Collects and forwards events in channels● HTTP, JMS, RPC, NetCat● Exec● Spooling directory

Source SinkChannel

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Exec Source

■ Runs a given Unix command on startup● Should continuously run and produce data on

■ If the process exits, the source also exits and will NOT produce any further data

Source SinkChannel

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Spooling Directory Source

■ Watches a specified directory for new files■ Parses events out of new files as they appear■ After a file has been fully processed, it is renamed to

indicate completion (or optionally deleted)

Source SinkChannel

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Channel

■ Buffers incoming events until they are extracted by Sinks■ Tradeoff between durability and throughput

● Memory● File● JDBC

Source SinkChannel

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Memory Channel

■ Events stored in an in-memory queue■ Configurable capacity

● The maximum number of events and/or bytes in memory■ Nondurable, but faster

Source SinkChannel

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

File Channel

■ Events stored in file on disk● Durable

■ Flushes to disk at the end of each transaction

● Supports encryption■ Configurable capacity

Source SinkChannel

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

File Channel

■ The more disks● The better performance● The higher capacity

■ Can be limited by the amount of memory for in-memory queue that keeps pointers to all events stored in log files

Source SinkChannel

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Sink

■ Removes events from a Channel and forwards them to their next destination● HDFS, HBase, Solr, ElasticSearch● File, Logger● Flume Agent

Source SinkChannel

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

HDFS Sink

■ Writes events to HDFS● Flexible naming of HDFS paths

■ Multiple file formats are supported e.g. Text, Avro

Source SinkChannel

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

HDFS Sink

■ Rollover properties for files generated by HDFS Sink

Number of seconds to wait before rolling a file (0 deactivates this feature)

File size, in bytes, to trigger roll of a file (0 deactivates this feature)

Number of events written to file before rolling a file (0 deactivates this feature)

Timeout after which inactive files get closed (0 deactivates this feature)

Number of events written to file before they are flushed to HDFS

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

HDFS Sink

■ Rolling a file will generate many small files● Need to compact them to avoid an explosion of HDFS

metadata■ Often, you also want to deduplicate, filter and split events

Source SinkChannel

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Avro Source And Sink

Source Avro SinkChannel

Avro Source SinkChannel

Sends a batch of Avro events to a configured

hostname:port

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Avro Source And Sink

Source Avro SinkChannel

Avro Source SinkChannel

Listens to events on a given port

Sends a batch of Avro events to a configured

hostname:port

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Avro Source And Sink

Source Avro SinkChannel

Avro Source SinkChannel

Listens to events on a given port

Sends a batch of Avro events to a configured

hostname:port

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Avro Source And Sink

Source Avro SinkChannel

Avro Source SinkChannel

Compress

Encrypt

Listens to events on a given port

Sends a batch of Avro events to a configured

hostname:port

Optionally

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

■ Durable channels● Survive the agent failure, machine restarts or non disk-

related failures■ Redundant path in a workflow topology

● Survive the failure of a node● Achieved via replication or failover

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

■ Sufficient capacity of the channels● Minimize the back pressure on earlier points in the flow● Some sources might not be able to resend the data e.g.

■ Exec Source does not handle failures and might lose the data■ Spooling Directory Source offers reliability guarantees

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

Source Avro SinkChannel

Avro Source SinkChannel

D, C, B, A

Start the transaction1

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

Source Avro SinkChannelD, C

Take a batch of events2

B, A

Avro Source SinkChannel

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

Source Avro SinkChannelD, C

Send a batch of events3

B, A

Avro Source SinkChannel

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

Source Avro SinkChannelD, C

B, A

Avro Source SinkChannelStart the transaction 4

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

Source Avro SinkChannelD, C

B, A

Avro Source SinkChannelPut events into a

channel5

B, A

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

Source Avro SinkChannelD, C

B, A

Avro Source SinkChannelStop the transaction 6

B, A

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Reliability

Source Avro SinkChannel

Avro Source SinkChannel

Stop the transaction7

B, A

D, C

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

When Flume Is Not A Good Fit

■ Very large events● An event cannot be larger than memory or a disk on an

agent’s machine■ Infrequent bulks loads

● Other tools might be better e.g. HDFS File Slurper

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Configuration

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Configuration

■ Simple format■ A configuration file can contain configuration settings for

many Agents● Only settings needed by the Agent will be loaded

■ Agent automatically reloads configuration if it changes

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Configuration Example

■ We configure Flume to run a single agent that 1. listens for data on a given port2. turns each line of incoming text into an event3. and sends to HDFS via the in-memory channel.

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Agent

● Simple format● A configuration file can contain configuration

settings for many Agents○ Only settings needed by the Agent will be loaded

● Agent automatically reloads configuration if it changes

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Source

● Simple format● A configuration file can contain configuration

settings for many Agents○ Only settings needed by the Agent will be loaded

● Agent automatically reloads configuration if it changes

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Channel

● Simple format● A configuration file can contain configuration

settings for many Agents○ Only settings needed by the Agent will be loaded

● Agent automatically reloads configuration if it changes

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Sink

● Simple format● A configuration file can contain configuration

settings for many Agents○ Only settings needed by the Agent will be loaded

● Agent automatically reloads configuration if it changes

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

Starting Agent

● Simple format● A configuration file can contain configuration

settings for many Agents○ Only settings needed by the Agent will be loaded

● Agent automatically reloads configuration if it changes

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

There Is More!

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.

GetInData

■ Data consulting company■ We help you benefit from data

● Look at our portfolio: http://getindata.com/portfolio● Find our trainings: http://getindata.com/trainings● Learn more about our team: http://getindata.com/team