1©Cloudera,Inc.Allrightsreserved.
Ka9a–101EverythingyouwantedtoknowaboutKa9a–butwereafraidtoaskScoCCrawford|Sr.SystemEngineer,Cloudera
2©Cloudera,Inc.Allrightsreserved.
ClouderaEnterpriseMakingHadoopFast,Easy,andSecure
AnewkindofdataplaKorm:• Oneplaceforunlimiteddata• Unified,mulP-frameworkdataaccess
Clouderamakesit:• Fastforbusiness• Easytomanage• Securewithoutcompromise
OPERATIONS DATAMANAGEMENT
STRUCTURED UNSTRUCTURED
PROCESS,ANALYZE,SERVE
UNIFIEDSERVICES
RESOURCEMANAGEMENT SECURITY
FILESYSTEM RELATIONAL NoSQL
STORE
INTEGRATE
BATCH STREAM SQL SEARCH SDK
3©Cloudera,Inc.Allrightsreserved.
IngestTheFoundaPonofHadoop’sPotenPalDatacancomefromavarietyof“siloed”sources• ExisPngdatabases• Sensordata• Serverlogs• Chattranscripts
ValueofdataismulPpliedwhencombinedandcorrelatedwithotherdata• “40%valueimprovementfromcombiningdatafrommulPpleIoTsources”McKinseyGlobalIns0tute
4©Cloudera,Inc.Allrightsreserved.
ApacheSqoopSQLtoHadoop
• EfficientlyexchangedatabetweendatabaseandHadoop• BidirecPonal• ImportallorparPal/newdata• Exportforshareddataaccessacrosssystems
• Easilygetstartedwithhighperformanceconnectors• Freetouse• OpPmizedconnectorsforpopularRDBMS,EDW,andNoSQLopPons
OPERA
TIONS
DATA
+MAN
AGEM
ENT
UNIFIED+SERVICES
PROCESS,+ANALYZE,+SERVE
STORE
INTEGRATE
5©Cloudera,Inc.Allrightsreserved.
ApacheFlumeLog&EventAggregaPonforHadoop
• Efficientlymovelargeamountsofstreaming/logdata• EasilycollectdatafrommulPplesystems(sources)• Built-insources,sinks,andchannels• Customizedataflowtotransformdataon-the-fly
• Reliable,scalable,andextensibleforproducPon• ManageandmonitorwithClouderaManager
OPERA
TIONS
DATA
+MAN
AGEM
ENT
UNIFIED+SERVICES
PROCESS,+ANALYZE,+SERVE
STORE
INTEGRATE
6©Cloudera,Inc.Allrightsreserved.
ApacheKa9aPub-SubMessagingforHadoop
• Backboneforreal-Pmearchitectures• Fast,flexiblemessagingforawiderangeofusecases• Scaletosupportmoredatasourcesandgrowingdatavolumes• Zerodatalossdurabilityandalways-onfault-tolerance• Built-insecurityanddataprotecPon
• SeamlessintegraPonacrosstheplaKorm• ConnecttoFlume,SparkStreaming,HBase,andmore• ManageandmonitorwithClouderaManager
OPERA
TIONS
DATA
+MAN
AGEM
ENT
UNIFIED+SERVICES
PROCESS,+ANALYZE,+SERVE
STORE
INTEGRATE
8©Cloudera,Inc.Allrightsreserved.
WhyKa9a?AbriefhistoryCan you make Flume dopull instead of push?
No.
9©Cloudera,Inc.Allrightsreserved.
• Noabilitytoreplayevents• MulPplesinksrequireseventreplicaPon(viamulPplechannels)• Sinksthatshareasource(mostly)processeventsinsync• Thisistemporalcoupling(whichkindasmellsbad)
WhyKa9a?(Orrather,whydidn’tFlumesufficeforLinkedIn?)
Spool Source
AvroSinkChannel
Spool Source
AvroSinkChannel
Avro Source
HBaseSinkChannel
HDFSSink
HBase
HDFS
Logs
More Logs Channel
11©Cloudera,Inc.Allrightsreserved.
WhyKa9a?Increasingcomplexity
Web logs Hadoop
Connections = O(1)
Connections = O(Systems2)
Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
2009 2014
12©Cloudera,Inc.Allrightsreserved.
WhyKa9a?Decoupling
Connections = O(Systems2)
Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
Connections = O(Systems)
Kafka
2014 2015+?
13©Cloudera,Inc.Allrightsreserved.
Ka9aBenefitsHigh-Throughput&LowLatency Enterprise-Grade
Scalable&Flexible SimpleIntegraPon
Durable&Reliable Cost-Efficient
• Singlebrokerhandles100MBsofreads/writespersecond,from1000sclients
• Messagesdeliveredinmilliseconds
• ElasPcallyscaleanddynamicallyaddProducers/ConsumerswithoutdownPme
• ReliablemulP-tenantoperaPonswithclientthroCling
• Messagespersistedondiskforzerodataloss• Highly-availablewithbuilt-infault-toleranceandreplicaPon
• Over-the-wireencrypPonandKerberosauthenPcaPonforsecurestreaming
• RobustmonitoringandtroubleshooPngthroughClouderaManager
• EasydevelopmentwithAPIstoconnectwithothertoolsandsystems
• ExtendcapabiliPeswithCloudera’spartnernetwork
• ModestclusteropPmizedtohandlemillionsofmessagespersecond
• SinglesystemformulPple,completeworkloadswithCloudera
14©Cloudera,Inc.Allrightsreserved.
WhatisKa9a?
• Ka9ais…
Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
Kafka
15©Cloudera,Inc.Allrightsreserved.
WhatisKa9a?
• Ka9aisadistributed,…
Transactions
Metrics
Web logs Hadoop
Warehouse
Alerting
Audit Logs Security
Broker
Broker
Broker
Kafka
16©Cloudera,Inc.Allrightsreserved.
WhatisKa9a?
• Ka9aisadistributed,topic-oriented,…
Source 1
Topic 1 Sink 1
Source 2
Source 3
Topic 2 Sink 2
Broker
17©Cloudera,Inc.Allrightsreserved.
WhatisKa9a?
• Ka9aisadistributed,topic-oriented,par44oned,…
Source 1
Topic 1Partition 1
Sink 1
Source 2
Source 3
Topic 2Partition 1
Sink 2
Broker
Topic 1Partition 2
Topic 2Partition 2
Broker
18©Cloudera,Inc.Allrightsreserved.
WhatisKa9a?
• Ka9aisadistributed,topic-oriented,parPPoned,replicatedcommitlog.
Source 1
Topic 1Partition 1
Sink 1
Source 2
Source 3
Topic 2Partition 1
Sink 2
Broker
Topic 1Partition 2
Topic 2Partition 2
Broker
Topic 1Partition 2
Topic 2Partition 2
Topic 1Partition 1
Topic 2Partition 1
19©Cloudera,Inc.Allrightsreserved.
WhatisKa9a?
• Ka9aisadistributed,topic-oriented,parPPoned,replicatedcommitlog.
• Ka9aisalsopub-submessagingsystem.
• Messagescanbetext(e.g.syslog),butbinaryisbest(preferablyAvro!).
Source 1
Topic 1Partition 1
Sink 1
Source 2
Source 3
Topic 2Partition 1
Sink 2
Broker
Topic 1Partition 2
Topic 2Partition 2
Broker
Topic 1Partition 2
Topic 2Partition 2
Topic 1Partition 1
Topic 2Partition 1
20©Cloudera,Inc.Allrightsreserved.
ArchitecturalOverview
• EachmachineiscalledaBroker
• DatawriCenbelongstoTopics(analogoustoaTableinadatabase)
• EachTopicisparPPoned
• ParPPonsaredistributedacrosstheBrokers
• ParPPonsarealsoreplicated(onereplicaperparPPonisLeaderParPPon)
• ProducersandConsumerstalktotheLeaderParPPon
Broker1 Broker2 Broker3
ParPPon1(Leader)
ParPPon2
ParPPon3
ParPPon2(Leader)
ParPPon1
ParPPon3
ParPPon3(Leader)
ParPPon1
ParPPon2
Producer Producer
ConsumerConsumer
Ka9aCluster
21©Cloudera,Inc.Allrightsreserved.
• Anoffsetisa“messagesequencenumber”• Theyalwaysstartat0(iniPally)• Messagesexpire,thefirstvalidoffsetislikely>0
• OffsetsareanAPI-visiblefeature• Correctoffsethandlingiskeytodurableprocessing
• Turnoffauto-commit,commitoncemessagesareprocesseddurably
• Consumerscanchoosetoreadfromanyoffset• Mostconsumersjustreadfromtheirlastcommi9edoffset
• Ka9astoresoffsetsforyou
Ka9aConcepts:Offsets
0 1 2 3 4 5 6 7 8
Topic 1, Partition 0
Consumer
Consumer
Read
Read
9
22©Cloudera,Inc.Allrightsreserved.
• ParPPoningbasedonparPPonkey• Writesareinterleavedbetweenproducers
Ka9aConcepts:ParPPoning
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5
0 1 2 3 4 5 6 7
Partition 0
Partition 1
Partition 2
Producer
Writes
Old New
Producer
9
6
8
• Enableshighwritethroughput• Strategycanbecustomised• Writesfromaproducerstayinorder(withinaparPPon)
23©Cloudera,Inc.Allrightsreserved.
• Aconsumergroupisa“logicalsubscriber”• Eachdownstreamsystem/topicgetsagroup
• Aconsumergroupwillseeeverymessageinatopic• Butagivenmemberwillonlyseeasubset
• ConsumerGroupshandlereadresiliency• Ifaconsumerdies,otherspickuptheload
Ka9aConcepts:ConsumerGroups
Partition 0
Reads
Partition 1
Partition 2 Consumer C1
Consumer C2
Consumer C3
Consumer B1
Consumer B2
Consumer A
Consumer D1
Consumer D2
Consumer D3
Consumer D4
24©Cloudera,Inc.Allrightsreserved.
• ParPPonshavealeaderand(possiblymulPple)followers• Allwritesgototheleader
• LeaderconsidersthewritecommiCedonceallreplicas(thatarecurrentlyin-sync)haveit
ReplicaPon
Topic 1Partition 1
Broker
Leader
Topic 1Partition 1
Broker
Topic 1Partition 1
Broker
Producer
Follower Follower
1. Write 2. Read
2. Read
3. Commit
4. Ack
25©Cloudera,Inc.Allrightsreserved.
• Producerscanchoosetotradethroughputfordurabilityofwrites:
• Sincethroughputcanalsoberaisedwithmorebrokers…(sodothisinstead)!• AsaneconfiguraPon:
DurableWrites
Durability Behaviour Throughput RequiredAcknowledgements(request.required.acks)
Highest ACKoncemin.insync.replicashavereceived Lowest(2.05ms) -1
Medium ACKoncetheleaderhasreceived Medium(1.05ms) 1
Lowest NoACKsrequired Highest(0.29ms) 0
Property Value
replicaPon 3
min.insync.replicas 2
request.required.acks -1
26©Cloudera,Inc.Allrightsreserved.
• Uncommi9edmessagesarelost…:(
• Fortopic-parPPonsthatwerethelostbrokerwasleader:• Leaderelec4ontakesplacebetweenfollowers
• Producersdidn’treceiveanACKfortheuncommiCedmessages• Writesretriedagainstnewlyavailableleaderpost-elecPon
Resilience–Whathappenswhenabrokerdies?
28©Cloudera,Inc.Allrightsreserved.
• Createatopic*:• Listtopics:• Writedata:• Readdata:
Usage
bin/kafka-topics.sh --zookeeper zkhost:2181/kafka --create --topic foo --replication-factor 1 --partitions 1!
bin/kafka-topics.sh --zookeeper zkhost:2181/kafka --list!
cat data | bin/kafka-console-producer.sh --broker-list brokerhost:9092 --topic test!
bin/kafka-console-consumer.sh --zookeeper zkhost:2181/kafka --topic test --from-beginning!
*delePngatopicnotpossiblein0.8.2
29©Cloudera,Inc.Allrightsreserved.
Flume-Ka9aIntegraPon(Fla9a)
• FlumeSink(Producer):
flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource!flume1.sources.kafka-source-1.zookeeperConnect = flume1.ent.cloudera.com:2181/kafka!flume1.sources.kafka-source-1.topic = test!flume1.sources.kafka-source-1.batchSize = 100!
flume1.sinks.kafka-sink-1.type = org.apache.flume.sink.kafka.KafkaSink!flume1.sinks.kafka-sink-1.batchSize = 5!flume1.sinks.kafka-sink-1.brokerList = kafka1.ent.cloudera.com:9092!flume1.sinks.kafka-sink-1.topic = test!
Spool Source
KafkaSinkChannelLogs Kafka
• FlumeSource(Consumer): Kafka Source
HDFSSinkChannel HDFSKafka
30©Cloudera,Inc.Allrightsreserved.
Flume-Ka9aIntegraPon(Fla9a)
• FlumeChannel(Producer/Consumer)• AdistributedreplacementforFileChannel(sweet!)• AvailableinCDH5.3• EnablessomefunkyFlume-ness:
Kafka Source
HDFSSink
Kafka Channel HDFSLogs
Kafka
Kafka Source
Kafka ChannelLogs
Kafka
HDFSSink
Kafka Channel HDFS
Kafka
31©Cloudera,Inc.Allrightsreserved.
CanonicalUseCases• Real-TimeStreamProcessing
• General-PurposeMessageBus• UserAcPvityDataCollecPon• OperaPonalMetricsCollecPon(applicaPons,servers,ordevices)
• LogAggregaPon• ChangeDataCapture• DistributedSystemsCommitLog
32©Cloudera,Inc.Allrightsreserved.©2014Cloudera,Inc.Allrightsreserved.
• ~300TBoflogdatapercluster• Tensofthousandsofproducers• Thousandsofconsumers• 7millionmessageswriCen/sec• 35millionmessagesconsumed/sec• SupportsbothStreamingandBatchclients
NumbersfromLinkedIn
33©Cloudera,Inc.Allrightsreserved.
StreamProcessingIntegraPonCanonicalArchitecturewithKa9a
• IntegratewithApacheSparkStreamingforreal-4meanalysisofdata
• WritebacktoKa9aforfurtherprocessingortosendtoanapplicaPonlayer
HDFS
DataSources
Real-TimeNo4fica4onsKaPa/
FlumeSpark
Streaming
HBase/Kudu
Solr
KaPa
Applica4on
34©Cloudera,Inc.Allrightsreserved.
Cloudera+Ka9aCommunityinvolvementandcontribu4on:• SpearheadingaddedenterprisefeaturestoKa9a• IdenPfiedandfixedcorearchitecturalissuestomakeKa9afullyreliable• StrongrelaPonshipwiththeConfluent.ioandotherKa9aCommiCers
Supportexper4seandexperience:• MulPpleproducPoncustomers• SupportteamtrainedbyKa9aCommiCers
IntegratedwithCloudera’sproduc4on-readyplaVorm:• ClouderaManagerCSDmakesiteasytodeploy,configure,andmonitorKa9aclusters• End-to-endworkloadswithothercomponents,allonasinglesystem• Leadingsecurity,governance,administraPon,andpartnernetwork
35©Cloudera,Inc.Allrightsreserved.
ClouderaLeadsDeliveryofReal-TimeCapabiliPesRe
al-Tim
eCa
pabili4
es
2011 2012 2013 2015
Real-Time:• Ingest
DevelopedFlumeReal-Time:• Ingest• Serving
FirstcommercialHBaseoffering Real-Time:
• Ingest• Serving• Processing• Querying
• DevelopedImpala• Firstcommercial
Spark&SparkStreamingoffering
Real-Time:• ConPnuousIngest• Serving• Processing• Querying• On-the-FlyAnalyPcs
KaPaintegraPon