Date post: | 27-Jul-2015 |
Category: |
Technology |
Upload: | xebia-france |
View: | 113 times |
Download: | 2 times |
#backdaybyxebia
Basics : KafkaA distributed messaging system
… multi-queues (called “topics”) splitted in partitions
#backdaybyxebia
Basics : KafkaA distributed messaging system
… multi-queues (called “topics”) splitted in partitions
… multi-clients
#backdaybyxebia
Basics : Hadoop Distributed FileSystem
Distributed & scalable
Highly fault-tolerant
Standard support for BigData jobs to run
“Moving computation is cheaper than moving data”
#backdaybyxebia
Flume
1. An “item” exists somewhereInitially, “items” were log files
2. A source is a way to transform this data into Flume events
#backdaybyxebia
Flume
1. An “item” exists somewhereInitially, “items” were log files
2. A source is a way to transform this data into Flume events
3. A channel is a way to transport data (memory, file)
#backdaybyxebia
Flume
1. An “item” exists somewhereInitially, “items” were log files
2. A source is a way to pull and transform this data into Flume events
3. A channel is a way to transport data (memory, file)
4. A sink is a way to put a Flume event somewhere
#backdaybyxebia
Simple flume configuration file# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1
# Describe/configure the sourcea1.sources.r1.type = netcata1.sources.r1.bind = localhosta1.sources.r1.port = 44444
# Describe the sinka1.sinks.k1.type = logger
# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1
1 - Define an agent
2 - Explicit the sources
3 - Explicit the sinks
4 - Explicit the channels
5 - Connect the pieces
#backdaybyxebia
Flafka source configuration
# Mandatory configa1.sources.r1.type = org.apache.flume.source.kafka.KafkaSourcea1.sources.r1.zookeeperConnect = localhost:2181a1.sources.r1.topic = MyTopic
# Optional configa1.sources.r1.batchSize = 1000a1.sources.r1.batchDurationMillis = 1000a1.sources.r1.consumer.timeout.ms = 10a1.sources.r1.auto.commit.enabled = falsea1.sources.r1.groupId = flume
#backdaybyxebia
HDFS Sink configuration
# Mandatory configa1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = hdfs://localhost:54310/data/flume/ \
%{topic}/%y-%m-%d
# A log of optional configs inluding :# Compression# File types : SequenceFile, Avro, etc.# Possibility to use a custom Writable# Kerberos configuration
#backdaybyxebia
Flume Interceptors
# Mandatory configa1.sources.r1.interceptors = i1a1.sources.r1.interceptors.i1.type = .....
➔ Transformation executed ◆ After event is generated◆ Before sending it to channel
➔ Some predefined interceptors◆ Timestamp◆ UUID◆ Filtering◆ Morphline◆ ...
➔ Could write your own (pure Java)
#backdaybyxebia
Flume : how to run it ?Command line : Included in distribs :
flume-ng agent \
-n a1 \
-c /usr/lib/flume-ng/conf/ \
-f /usr/lib/flume-ng/conf/flume-kafka.conf &
#backdaybyxebia
Camus
A batch consists in three steps :
- P1 : Gets metadata : topic & partitions, latest offsets
- P2 : Pulls new events
- P3 : Updates local metadatas
#backdaybyxebia
Round 1 : getting started
Flume Camus
Just a simple configuration file make it works
Need a complete dev environment (included maven) to use it
Morphline interceptor’syntax is quite complex
Dev should understand MapReduce concepts
#backdaybyxebia
Round 2 : running time
Flume Camus
Flume events are ingested with no delay MapReduce setup adds uncompressable time
31 sec uncompressable+
~ 1 sec / 500 messages / node
=> 111 sec for 1 message=> 117 sec for 50 messages=> 116 sec for 1.000 mesages=> 127 sec for 10.000 messages
#backdaybyxebia
Round 3 : maintainability
Flume Camus
When used by CM, server is easy to maintain, but config is not
Full Maven project. Just use a version control system (Git, SVN, aso.)
#backdaybyxebia
Round 4 : customization
Flume Camus
Interceptors are fully customizable Morphing data could be done easily
Event headers make HDFS path highly modulable
#backdaybyxebia
Round 5 : deployment
Flume Camus
When used by CM, just include your conf, that’s it
MapReduce jobs may be included in any MR orchestrator (Oozie for instance)
Without a manager, everything needs to be done manually
#backdaybyxebia
Round 6 : state of the project
Flume Camus
Released 1.0.0 in 2012 Currently in v 0.1.0-SNAPSHOT
Included by default with Hadoop distributions
Highly used by LinkedIn in production
Almost no documentation
#backdaybyxebia
SummaryFlume Camus
Getting Started
Running time
Maintainability
Customization
Deployment
State of the project
#backdaybyxebia
Debugging
Debugging on Flume is quite complex
Some really critical bugs like [FLUME-2578]
#backdaybyxebia
Documentation
Flume has really good quality doc
Camus only has a readme file and not up to date !
#backdaybyxebia
Camus & M/R
Camus suffers the use of Map/Reduce.
Maybe using some other concept like Spark may result in better perfs.
#backdaybyxebia
Flume quantity of files
Flume needs a very precise configuration not to generate a bunch of file.
It is easy to get it generate a lot of little files, which is problematic in term of BigData.