Real time data ingestion and Hybrid Cloud

Post on 08-Jan-2017

154 views 0 download

transcript

GreatIdeas….SimpleSolutions

DataIngestionPlatform(DiP)Neeraj Sabharwal@allaboutbdata

Aboutme

XavientCorporateOverview2

• HeadofCloud,Data&Analytics@Xavient• Spentcoupleofyears@Hortonworks• OveradecadeinCloud&Datadomain• StartedcareerasOracleDBA

Disclosure–Morememescomingup…

Agenda

XavientCorporateOverview3

Platform

DataAccess

HybridCloud

Data IngestionPlatform (DiP)4

Beforewestart…

**NearrealtimeisokasIameasygoingbutnomorehoursordayswaitondata

Problem

XavientCorporateOverview5

UI/API Platform DataAccess

No…nearreal-timeaccess

Cloud

GreatIdeas….SimpleSolutions

Shiftingthegear– Let’sgettechnical

StreamingBlueprint

XavientCorporateOverview7

DataCollection MessagingTier StreamingEngine AnalysisTierInmemoryDataStore DataAccess

**NearrealtimeisokasIameasygoingbutnomorehoursordayswaitondata

MessagingBus

XavientCorporateOverview8

• Open-sourcemessagebroker• Unified,high-throughput,low-latencyplatformforhandlingreal-timedatafeeds• Massivelyscalablepub/submessagequeuearchitectedasadistributedtransactionlog

Emotions

XavientCorporateOverview9

Streamingengines

XavientCorporateOverview10

Storm - Distributedreal-timecomputationsystemforprocessinglargevolumesofhigh-velocitydata

Flink - Streamingdataflowengine thatprovidesdatadistribution,communication,andfaulttolerancefordistributedcomputationsoverdatastreams

Apex - Enterprise-gradeunifiedstreamandbatchprocessingengine

SparkStreaming- ApacheSpark's language-integratedAPI tostreamprocessing,lettingyouwritestreamingjobsthesamewayyouwritebatchjobs.ItsupportsJava,ScalaandPython

CTM

XavientCorporateOverview11

GreatIdeas….SimpleSolutions

Platform(DiP)

Data IngestionPlatform (DiP)13

Features

EasytouseUI

MultipleStreamingEngines

Supportsxml,jsonandtsv dataformats

ManualdataentryviaUI

Uploadfilesforbatchprocessing

HybridCloud

BatchandRealtimeviewsofdata

Datavisualizationandanalytics

YARNfeaturesDataIngestionPlatform

Data IngestionPlatform (DiP)14

UseCases– AnyData

SentimentalAnalysis LogAnalysis

ClickStreamAnalysis AnalyzeMachineandSensorData

SocialMediaandCustomerSentiment

UI

XavientCorporateOverview15

https://techblog.xavient.com/

Whatwasinthepreviousslide?Isthatforreal?

XavientCorporateOverview16

NomoreMemes…EnoughnowJ

Data IngestionPlatform (DiP)17

DiPTechnologyStack

MessagingSystem

TargetSystem

ReportingSystem

SourceSystem

StreamingAPI’s

ProgrammingLanguage

IDE

Buildtool

OperatingSystem

ApacheKafka

HDFS,NoSql,ApacheHive

ApachePhoenix,ApacheZeppelin

WebClient

ApacheApex,ApacheFlink,Apache SparkandApacheStorm

Java

Eclipse

ApacheMaven

CentOS7

Data IngestionPlatform (DiP)18

DiPHighLevelArchitecture

Data IngestionPlatform (DiP)19

DiPusingStorm

• Multipleprocessingparadigm - Real-time,InteractiveandBatchprocesses• Reliable – eachunitofdata(tuple)willbeprocessedatleastonceorexactlyonce.• Fast andscalable- parallelcalculationsarerunacrossaclusterofmachines.• Fault-tolerant - workersautomaticallyrestartsincasetheydie.

ApacheStormfeatures

Data IngestionPlatform (DiP)20

DiPusingSparkStreaming

• Multipleprocessingparadigm - BatchandInteractive• EaseofUse– containshigh-leveloperatorswritteninJava,ScalaandPython• FaultTolerance- lostworkandoperatorstatecanberecoveredwithnoextracode• CodeReusability– samecodecanbeusedforbatchprocessing,joinstreamsagainsthistoricaldata,ortorunad-

hocqueriesonstreamstate

SparkStreaming features

Data IngestionPlatform (DiP)21

DiPusingApex

Modular - Malhar,alibraryofoperators,comesbundledwithApexforquickdevelopmentcycles• Supportsboth streamandbatchprocessing• Supportsoperatorexchangeatruntime• Supportsfaulttoleranceanddynamicscaling

ApacheApex features

Data IngestionPlatform (DiP)22

DiPusingFlink

Multipleprocessingparadigm - distributed,streamandbatchprocessing.SeveralAPIs forcreatingapplicationsaresupported

• DataStreamAPI forunboundedstreamsembeddedinJavaandScala• DataSetAPI forstaticdataembeddedinJava,Scala,andPython,• TableAPIwithaSQL-likeexpressionlanguageembeddedinJavaandScala.

Faulttolerancefordistributedcomputationsoverdatastreams

ApacheFlink features

Data IngestionPlatform (DiP)23

DiP-DruidArchitecture(HighLevel)

Credit:https://imply.io/docs/latest/

https://techblog.xavient.com/kafka-druid-integration-with-ingestion-dip-real-time-data

Data IngestionPlatform (DiP)24

DataAccess

ApacheZeppelin/CustomUI

• DataStoredonHDFSasHiveExternalTables

• DatastoredonHBaseasPhoenixView

CustomUI“Co-Dev”

XavientCorporateOverview25

• Integratedwithelasticsearch

• EnterprisesecurityandSSO

• Recommendationmodelbasedonuserprofile,tagsandactivity

• Chat• Blog/Dropletfeatures• Taskscreationandfollow-up

• Notifications• Smartphoneapp

Data IngestionPlatform (DiP)26

DiP@Hallwaze.com

Data IngestionPlatform (DiP)27

Getinvolved

https://github.com/XavientInformationSystems/Data-Ingestion-Platform

Co-Dev:Reachoutincaseyouwanttocustomizetheplatform,choosetherightstreamingenginebasedonlatency,usecaseandcustomUI/reporting.

GreatIdeas….SimpleSolutions

HybridCloud

HadoopandCloud

XavientCorporateOverview29

ApacheFalcon

XavientCorporateOverview30

DiP Hadoop

On-prem Cloud

ApacheFalcon isadatamanagementtoolforoverseeingdatapipelinesinHadoopclusters.Itcanbeusedtoreplicatedatafromoneclustertoanother.

Hadoop

KafkaMirroring

XavientCorporateOverview31

The Kafka mirroring feature is used for creating the replica of an existing cluster, for example, for the replication of an active datacenter into a passive datacenter. Kafka provides a mirror maker tool for mirroring the source cluster into target cluster.

Data IngestionPlatform (DiP)32

KafkaMirroring– HybridCloudEnvironment

Cassandra

XavientCorporateOverview33

DiP

Cassandra

Cassandra

On-prem Cloud

• RDBMSmigration• DSEadvancereplication• Kafka

Data IngestionPlatform (DiP)34

WIP

• IntegrationwithKafkaConnectandKafkaStreaming• DataMunging,Validation• MachineLearning• Search– Elastic,Solr

Thanks!@allaboutbdatansabharwal@xavient.com