+ All Categories
Home > Documents > Kaa – 101 Everything you wanted to know about Kaa – but...

Kaa – 101 Everything you wanted to know about Kaa – but...

Date post: 02-May-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
36
1 © Cloudera, Inc. All rights reserved. Ka9a – 101 Everything you wanted to know about Ka9a – but were afraid to ask ScoC Crawford| Sr. System Engineer, Cloudera
Transcript

1©Cloudera,Inc.Allrightsreserved.

Ka9a–101EverythingyouwantedtoknowaboutKa9a–butwereafraidtoaskScoCCrawford|Sr.SystemEngineer,Cloudera

2©Cloudera,Inc.Allrightsreserved.

ClouderaEnterpriseMakingHadoopFast,Easy,andSecure

AnewkindofdataplaKorm:•  Oneplaceforunlimiteddata•  Unified,mulP-frameworkdataaccess

Clouderamakesit:•  Fastforbusiness•  Easytomanage•  Securewithoutcompromise

OPERATIONS DATAMANAGEMENT

STRUCTURED UNSTRUCTURED

PROCESS,ANALYZE,SERVE

UNIFIEDSERVICES

RESOURCEMANAGEMENT SECURITY

FILESYSTEM RELATIONAL NoSQL

STORE

INTEGRATE

BATCH STREAM SQL SEARCH SDK

3©Cloudera,Inc.Allrightsreserved.

IngestTheFoundaPonofHadoop’sPotenPalDatacancomefromavarietyof“siloed”sources•  ExisPngdatabases•  Sensordata•  Serverlogs• Chattranscripts

ValueofdataismulPpliedwhencombinedandcorrelatedwithotherdata•  “40%valueimprovementfromcombiningdatafrommulPpleIoTsources”McKinseyGlobalIns0tute

4©Cloudera,Inc.Allrightsreserved.

ApacheSqoopSQLtoHadoop

• EfficientlyexchangedatabetweendatabaseandHadoop• BidirecPonal•  ImportallorparPal/newdata•  Exportforshareddataaccessacrosssystems

• Easilygetstartedwithhighperformanceconnectors•  Freetouse• OpPmizedconnectorsforpopularRDBMS,EDW,andNoSQLopPons

OPERA

TIONS

DATA

+MAN

AGEM

ENT

UNIFIED+SERVICES

PROCESS,+ANALYZE,+SERVE

STORE

INTEGRATE

5©Cloudera,Inc.Allrightsreserved.

ApacheFlumeLog&EventAggregaPonforHadoop

• Efficientlymovelargeamountsofstreaming/logdata•  EasilycollectdatafrommulPplesystems(sources)• Built-insources,sinks,andchannels• Customizedataflowtotransformdataon-the-fly

• Reliable,scalable,andextensibleforproducPon• ManageandmonitorwithClouderaManager

OPERA

TIONS

DATA

+MAN

AGEM

ENT

UNIFIED+SERVICES

PROCESS,+ANALYZE,+SERVE

STORE

INTEGRATE

6©Cloudera,Inc.Allrightsreserved.

ApacheKa9aPub-SubMessagingforHadoop

• Backboneforreal-Pmearchitectures•  Fast,flexiblemessagingforawiderangeofusecases•  Scaletosupportmoredatasourcesandgrowingdatavolumes•  Zerodatalossdurabilityandalways-onfault-tolerance• Built-insecurityanddataprotecPon

• SeamlessintegraPonacrosstheplaKorm• ConnecttoFlume,SparkStreaming,HBase,andmore• ManageandmonitorwithClouderaManager

OPERA

TIONS

DATA

+MAN

AGEM

ENT

UNIFIED+SERVICES

PROCESS,+ANALYZE,+SERVE

STORE

INTEGRATE

7©Cloudera,Inc.Allrightsreserved.

ADeeperLookatKa9a

8©Cloudera,Inc.Allrightsreserved.

WhyKa9a?AbriefhistoryCan you make Flume dopull instead of push?

No.

9©Cloudera,Inc.Allrightsreserved.

• Noabilitytoreplayevents• MulPplesinksrequireseventreplicaPon(viamulPplechannels)•  Sinksthatshareasource(mostly)processeventsinsync•  Thisistemporalcoupling(whichkindasmellsbad)

WhyKa9a?(Orrather,whydidn’tFlumesufficeforLinkedIn?)

Spool Source

AvroSinkChannel

Spool Source

AvroSinkChannel

Avro Source

HBaseSinkChannel

HDFSSink

HBase

HDFS

Logs

More Logs Channel

10©Cloudera,Inc.Allrightsreserved.

WhyKa9a?

Web logs Hadoop

Connections = O(1)

2009

11©Cloudera,Inc.Allrightsreserved.

WhyKa9a?Increasingcomplexity

Web logs Hadoop

Connections = O(1)

Connections = O(Systems2)

Transactions

Metrics

Web logs Hadoop

Warehouse

Alerting

Audit Logs Security

2009 2014

12©Cloudera,Inc.Allrightsreserved.

WhyKa9a?Decoupling

Connections = O(Systems2)

Transactions

Metrics

Web logs Hadoop

Warehouse

Alerting

Audit Logs Security

Transactions

Metrics

Web logs Hadoop

Warehouse

Alerting

Audit Logs Security

Connections = O(Systems)

Kafka

2014 2015+?

13©Cloudera,Inc.Allrightsreserved.

Ka9aBenefitsHigh-Throughput&LowLatency Enterprise-Grade

Scalable&Flexible SimpleIntegraPon

Durable&Reliable Cost-Efficient

•  Singlebrokerhandles100MBsofreads/writespersecond,from1000sclients

•  Messagesdeliveredinmilliseconds

•  ElasPcallyscaleanddynamicallyaddProducers/ConsumerswithoutdownPme

•  ReliablemulP-tenantoperaPonswithclientthroCling

•  Messagespersistedondiskforzerodataloss•  Highly-availablewithbuilt-infault-toleranceandreplicaPon

•  Over-the-wireencrypPonandKerberosauthenPcaPonforsecurestreaming

•  RobustmonitoringandtroubleshooPngthroughClouderaManager

•  EasydevelopmentwithAPIstoconnectwithothertoolsandsystems

•  ExtendcapabiliPeswithCloudera’spartnernetwork

•  ModestclusteropPmizedtohandlemillionsofmessagespersecond

•  SinglesystemformulPple,completeworkloadswithCloudera

14©Cloudera,Inc.Allrightsreserved.

WhatisKa9a?

• Ka9ais…

Transactions

Metrics

Web logs Hadoop

Warehouse

Alerting

Audit Logs Security

Kafka

15©Cloudera,Inc.Allrightsreserved.

WhatisKa9a?

• Ka9aisadistributed,…

Transactions

Metrics

Web logs Hadoop

Warehouse

Alerting

Audit Logs Security

Broker

Broker

Broker

Kafka

16©Cloudera,Inc.Allrightsreserved.

WhatisKa9a?

•  Ka9aisadistributed,topic-oriented,…

Source 1

Topic 1 Sink 1

Source 2

Source 3

Topic 2 Sink 2

Broker

17©Cloudera,Inc.Allrightsreserved.

WhatisKa9a?

• Ka9aisadistributed,topic-oriented,par44oned,…

Source 1

Topic 1Partition 1

Sink 1

Source 2

Source 3

Topic 2Partition 1

Sink 2

Broker

Topic 1Partition 2

Topic 2Partition 2

Broker

18©Cloudera,Inc.Allrightsreserved.

WhatisKa9a?

• Ka9aisadistributed,topic-oriented,parPPoned,replicatedcommitlog.

Source 1

Topic 1Partition 1

Sink 1

Source 2

Source 3

Topic 2Partition 1

Sink 2

Broker

Topic 1Partition 2

Topic 2Partition 2

Broker

Topic 1Partition 2

Topic 2Partition 2

Topic 1Partition 1

Topic 2Partition 1

19©Cloudera,Inc.Allrightsreserved.

WhatisKa9a?

• Ka9aisadistributed,topic-oriented,parPPoned,replicatedcommitlog.

• Ka9aisalsopub-submessagingsystem.

• Messagescanbetext(e.g.syslog),butbinaryisbest(preferablyAvro!).

Source 1

Topic 1Partition 1

Sink 1

Source 2

Source 3

Topic 2Partition 1

Sink 2

Broker

Topic 1Partition 2

Topic 2Partition 2

Broker

Topic 1Partition 2

Topic 2Partition 2

Topic 1Partition 1

Topic 2Partition 1

20©Cloudera,Inc.Allrightsreserved.

ArchitecturalOverview

•  EachmachineiscalledaBroker

•  DatawriCenbelongstoTopics(analogoustoaTableinadatabase)

•  EachTopicisparPPoned

•  ParPPonsaredistributedacrosstheBrokers

•  ParPPonsarealsoreplicated(onereplicaperparPPonisLeaderParPPon)

•  ProducersandConsumerstalktotheLeaderParPPon

Broker1 Broker2 Broker3

ParPPon1(Leader)

ParPPon2

ParPPon3

ParPPon2(Leader)

ParPPon1

ParPPon3

ParPPon3(Leader)

ParPPon1

ParPPon2

Producer Producer

ConsumerConsumer

Ka9aCluster

21©Cloudera,Inc.Allrightsreserved.

•  Anoffsetisa“messagesequencenumber”•  Theyalwaysstartat0(iniPally)•  Messagesexpire,thefirstvalidoffsetislikely>0

• OffsetsareanAPI-visiblefeature•  Correctoffsethandlingiskeytodurableprocessing

•  Turnoffauto-commit,commitoncemessagesareprocesseddurably

•  Consumerscanchoosetoreadfromanyoffset•  Mostconsumersjustreadfromtheirlastcommi9edoffset

•  Ka9astoresoffsetsforyou

Ka9aConcepts:Offsets

0 1 2 3 4 5 6 7 8

Topic 1, Partition 0

Consumer

Consumer

Read

Read

9

22©Cloudera,Inc.Allrightsreserved.

•  ParPPoningbasedonparPPonkey•  Writesareinterleavedbetweenproducers

Ka9aConcepts:ParPPoning

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5

0 1 2 3 4 5 6 7

Partition 0

Partition 1

Partition 2

Producer

Writes

Old New

Producer

9

6

8

•  Enableshighwritethroughput•  Strategycanbecustomised•  Writesfromaproducerstayinorder(withinaparPPon)

23©Cloudera,Inc.Allrightsreserved.

•  Aconsumergroupisa“logicalsubscriber”•  Eachdownstreamsystem/topicgetsagroup

•  Aconsumergroupwillseeeverymessageinatopic•  Butagivenmemberwillonlyseeasubset

•  ConsumerGroupshandlereadresiliency•  Ifaconsumerdies,otherspickuptheload

Ka9aConcepts:ConsumerGroups

Partition 0

Reads

Partition 1

Partition 2 Consumer C1

Consumer C2

Consumer C3

Consumer B1

Consumer B2

Consumer A

Consumer D1

Consumer D2

Consumer D3

Consumer D4

24©Cloudera,Inc.Allrightsreserved.

•  ParPPonshavealeaderand(possiblymulPple)followers•  Allwritesgototheleader

•  LeaderconsidersthewritecommiCedonceallreplicas(thatarecurrentlyin-sync)haveit

ReplicaPon

Topic 1Partition 1

Broker

Leader

Topic 1Partition 1

Broker

Topic 1Partition 1

Broker

Producer

Follower Follower

1. Write 2. Read

2. Read

3. Commit

4. Ack

25©Cloudera,Inc.Allrightsreserved.

•  Producerscanchoosetotradethroughputfordurabilityofwrites:

•  Sincethroughputcanalsoberaisedwithmorebrokers…(sodothisinstead)!•  AsaneconfiguraPon:

DurableWrites

Durability Behaviour Throughput RequiredAcknowledgements(request.required.acks)

Highest ACKoncemin.insync.replicashavereceived Lowest(2.05ms) -1

Medium ACKoncetheleaderhasreceived Medium(1.05ms) 1

Lowest NoACKsrequired Highest(0.29ms) 0

Property Value

replicaPon 3

min.insync.replicas 2

request.required.acks -1

26©Cloudera,Inc.Allrightsreserved.

• Uncommi9edmessagesarelost…:(

• Fortopic-parPPonsthatwerethelostbrokerwasleader:•  Leaderelec4ontakesplacebetweenfollowers

• Producersdidn’treceiveanACKfortheuncommiCedmessages• Writesretriedagainstnewlyavailableleaderpost-elecPon

Resilience–Whathappenswhenabrokerdies?

27©Cloudera,Inc.Allrightsreserved.

Performance

28©Cloudera,Inc.Allrightsreserved.

• Createatopic*:•  Listtopics:• Writedata:• Readdata:

Usage

bin/kafka-topics.sh --zookeeper zkhost:2181/kafka --create --topic foo --replication-factor 1 --partitions 1!

bin/kafka-topics.sh --zookeeper zkhost:2181/kafka --list!

cat data | bin/kafka-console-producer.sh --broker-list brokerhost:9092 --topic test!

bin/kafka-console-consumer.sh --zookeeper zkhost:2181/kafka --topic test --from-beginning!

*delePngatopicnotpossiblein0.8.2

29©Cloudera,Inc.Allrightsreserved.

Flume-Ka9aIntegraPon(Fla9a)

•  FlumeSink(Producer):

flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource!flume1.sources.kafka-source-1.zookeeperConnect = flume1.ent.cloudera.com:2181/kafka!flume1.sources.kafka-source-1.topic = test!flume1.sources.kafka-source-1.batchSize = 100!

flume1.sinks.kafka-sink-1.type = org.apache.flume.sink.kafka.KafkaSink!flume1.sinks.kafka-sink-1.batchSize = 5!flume1.sinks.kafka-sink-1.brokerList = kafka1.ent.cloudera.com:9092!flume1.sinks.kafka-sink-1.topic = test!

Spool Source

KafkaSinkChannelLogs Kafka

•  FlumeSource(Consumer): Kafka Source

HDFSSinkChannel HDFSKafka

30©Cloudera,Inc.Allrightsreserved.

Flume-Ka9aIntegraPon(Fla9a)

•  FlumeChannel(Producer/Consumer)• AdistributedreplacementforFileChannel(sweet!)• AvailableinCDH5.3•  EnablessomefunkyFlume-ness:

Kafka Source

HDFSSink

Kafka Channel HDFSLogs

Kafka

Kafka Source

Kafka ChannelLogs

Kafka

HDFSSink

Kafka Channel HDFS

Kafka

31©Cloudera,Inc.Allrightsreserved.

CanonicalUseCases•  Real-TimeStreamProcessing

•  General-PurposeMessageBus•  UserAcPvityDataCollecPon•  OperaPonalMetricsCollecPon(applicaPons,servers,ordevices)

•  LogAggregaPon•  ChangeDataCapture•  DistributedSystemsCommitLog

32©Cloudera,Inc.Allrightsreserved.©2014Cloudera,Inc.Allrightsreserved.

• ~300TBoflogdatapercluster• Tensofthousandsofproducers• Thousandsofconsumers• 7millionmessageswriCen/sec• 35millionmessagesconsumed/sec•  SupportsbothStreamingandBatchclients

NumbersfromLinkedIn

33©Cloudera,Inc.Allrightsreserved.

StreamProcessingIntegraPonCanonicalArchitecturewithKa9a

•  IntegratewithApacheSparkStreamingforreal-4meanalysisofdata

•  WritebacktoKa9aforfurtherprocessingortosendtoanapplicaPonlayer

HDFS

DataSources

Real-TimeNo4fica4onsKaPa/

FlumeSpark

Streaming

HBase/Kudu

Solr

KaPa

Applica4on

34©Cloudera,Inc.Allrightsreserved.

Cloudera+Ka9aCommunityinvolvementandcontribu4on:•  SpearheadingaddedenterprisefeaturestoKa9a•  IdenPfiedandfixedcorearchitecturalissuestomakeKa9afullyreliable•  StrongrelaPonshipwiththeConfluent.ioandotherKa9aCommiCers

Supportexper4seandexperience:•  MulPpleproducPoncustomers•  SupportteamtrainedbyKa9aCommiCers

IntegratedwithCloudera’sproduc4on-readyplaVorm:•  ClouderaManagerCSDmakesiteasytodeploy,configure,andmonitorKa9aclusters•  End-to-endworkloadswithothercomponents,allonasinglesystem•  Leadingsecurity,governance,administraPon,andpartnernetwork

35©Cloudera,Inc.Allrightsreserved.

ClouderaLeadsDeliveryofReal-TimeCapabiliPesRe

al-Tim

eCa

pabili4

es

2011 2012 2013 2015

Real-Time:•  Ingest

DevelopedFlumeReal-Time:•  Ingest•  Serving

FirstcommercialHBaseoffering Real-Time:

•  Ingest•  Serving•  Processing•  Querying

•  DevelopedImpala•  Firstcommercial

Spark&SparkStreamingoffering

Real-Time:•  ConPnuousIngest•  Serving•  Processing•  Querying•  On-the-FlyAnalyPcs

KaPaintegraPon

36©Cloudera,Inc.Allrightsreserved.

ThankYou


Recommended