+ All Categories
Home > Technology > Real time data ingestion and Hybrid Cloud

Real time data ingestion and Hybrid Cloud

Date post: 08-Jan-2017
Category:
Upload: neeraj-sabharwal
View: 154 times
Download: 0 times
Share this document with a friend
35
Great Ideas….Simple Solutions Data Ingestion Platform (DiP) Neeraj Sabharwal @allaboutbdata
Transcript
Page 1: Real time data ingestion and Hybrid Cloud

GreatIdeas….SimpleSolutions

DataIngestionPlatform(DiP)Neeraj Sabharwal@allaboutbdata

Page 2: Real time data ingestion and Hybrid Cloud

Aboutme

XavientCorporateOverview2

• HeadofCloud,Data&Analytics@Xavient• Spentcoupleofyears@Hortonworks• OveradecadeinCloud&Datadomain• StartedcareerasOracleDBA

Disclosure–Morememescomingup…

Page 3: Real time data ingestion and Hybrid Cloud

Agenda

XavientCorporateOverview3

Platform

DataAccess

HybridCloud

Page 4: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)4

Beforewestart…

**NearrealtimeisokasIameasygoingbutnomorehoursordayswaitondata

Page 5: Real time data ingestion and Hybrid Cloud

Problem

XavientCorporateOverview5

UI/API Platform DataAccess

No…nearreal-timeaccess

Cloud

Page 6: Real time data ingestion and Hybrid Cloud

GreatIdeas….SimpleSolutions

Shiftingthegear– Let’sgettechnical

Page 7: Real time data ingestion and Hybrid Cloud

StreamingBlueprint

XavientCorporateOverview7

DataCollection MessagingTier StreamingEngine AnalysisTierInmemoryDataStore DataAccess

**NearrealtimeisokasIameasygoingbutnomorehoursordayswaitondata

Page 8: Real time data ingestion and Hybrid Cloud

MessagingBus

XavientCorporateOverview8

• Open-sourcemessagebroker• Unified,high-throughput,low-latencyplatformforhandlingreal-timedatafeeds• Massivelyscalablepub/submessagequeuearchitectedasadistributedtransactionlog

Page 9: Real time data ingestion and Hybrid Cloud

Emotions

XavientCorporateOverview9

Page 10: Real time data ingestion and Hybrid Cloud

Streamingengines

XavientCorporateOverview10

Storm - Distributedreal-timecomputationsystemforprocessinglargevolumesofhigh-velocitydata

Flink - Streamingdataflowengine thatprovidesdatadistribution,communication,andfaulttolerancefordistributedcomputationsoverdatastreams

Apex - Enterprise-gradeunifiedstreamandbatchprocessingengine

SparkStreaming- ApacheSpark's language-integratedAPI tostreamprocessing,lettingyouwritestreamingjobsthesamewayyouwritebatchjobs.ItsupportsJava,ScalaandPython

Page 11: Real time data ingestion and Hybrid Cloud

CTM

XavientCorporateOverview11

Page 12: Real time data ingestion and Hybrid Cloud

GreatIdeas….SimpleSolutions

Platform(DiP)

Page 13: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)13

Features

EasytouseUI

MultipleStreamingEngines

Supportsxml,jsonandtsv dataformats

ManualdataentryviaUI

Uploadfilesforbatchprocessing

HybridCloud

BatchandRealtimeviewsofdata

Datavisualizationandanalytics

YARNfeaturesDataIngestionPlatform

Page 14: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)14

UseCases– AnyData

SentimentalAnalysis LogAnalysis

ClickStreamAnalysis AnalyzeMachineandSensorData

SocialMediaandCustomerSentiment

Page 15: Real time data ingestion and Hybrid Cloud

UI

XavientCorporateOverview15

https://techblog.xavient.com/

Page 16: Real time data ingestion and Hybrid Cloud

Whatwasinthepreviousslide?Isthatforreal?

XavientCorporateOverview16

NomoreMemes…EnoughnowJ

Page 17: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)17

DiPTechnologyStack

MessagingSystem

TargetSystem

ReportingSystem

SourceSystem

StreamingAPI’s

ProgrammingLanguage

IDE

Buildtool

OperatingSystem

ApacheKafka

HDFS,NoSql,ApacheHive

ApachePhoenix,ApacheZeppelin

WebClient

ApacheApex,ApacheFlink,Apache SparkandApacheStorm

Java

Eclipse

ApacheMaven

CentOS7

Page 18: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)18

DiPHighLevelArchitecture

Page 19: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)19

DiPusingStorm

• Multipleprocessingparadigm - Real-time,InteractiveandBatchprocesses• Reliable – eachunitofdata(tuple)willbeprocessedatleastonceorexactlyonce.• Fast andscalable- parallelcalculationsarerunacrossaclusterofmachines.• Fault-tolerant - workersautomaticallyrestartsincasetheydie.

ApacheStormfeatures

Page 20: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)20

DiPusingSparkStreaming

• Multipleprocessingparadigm - BatchandInteractive• EaseofUse– containshigh-leveloperatorswritteninJava,ScalaandPython• FaultTolerance- lostworkandoperatorstatecanberecoveredwithnoextracode• CodeReusability– samecodecanbeusedforbatchprocessing,joinstreamsagainsthistoricaldata,ortorunad-

hocqueriesonstreamstate

SparkStreaming features

Page 21: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)21

DiPusingApex

Modular - Malhar,alibraryofoperators,comesbundledwithApexforquickdevelopmentcycles• Supportsboth streamandbatchprocessing• Supportsoperatorexchangeatruntime• Supportsfaulttoleranceanddynamicscaling

ApacheApex features

Page 22: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)22

DiPusingFlink

Multipleprocessingparadigm - distributed,streamandbatchprocessing.SeveralAPIs forcreatingapplicationsaresupported

• DataStreamAPI forunboundedstreamsembeddedinJavaandScala• DataSetAPI forstaticdataembeddedinJava,Scala,andPython,• TableAPIwithaSQL-likeexpressionlanguageembeddedinJavaandScala.

Faulttolerancefordistributedcomputationsoverdatastreams

ApacheFlink features

Page 23: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)23

DiP-DruidArchitecture(HighLevel)

Credit:https://imply.io/docs/latest/

https://techblog.xavient.com/kafka-druid-integration-with-ingestion-dip-real-time-data

Page 24: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)24

DataAccess

ApacheZeppelin/CustomUI

• DataStoredonHDFSasHiveExternalTables

• DatastoredonHBaseasPhoenixView

Page 25: Real time data ingestion and Hybrid Cloud

CustomUI“Co-Dev”

XavientCorporateOverview25

• Integratedwithelasticsearch

• EnterprisesecurityandSSO

• Recommendationmodelbasedonuserprofile,tagsandactivity

• Chat• Blog/Dropletfeatures• Taskscreationandfollow-up

• Notifications• Smartphoneapp

Page 26: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)26

[email protected]

Page 27: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)27

Getinvolved

https://github.com/XavientInformationSystems/Data-Ingestion-Platform

Co-Dev:Reachoutincaseyouwanttocustomizetheplatform,choosetherightstreamingenginebasedonlatency,usecaseandcustomUI/reporting.

Page 28: Real time data ingestion and Hybrid Cloud

GreatIdeas….SimpleSolutions

HybridCloud

Page 29: Real time data ingestion and Hybrid Cloud

HadoopandCloud

XavientCorporateOverview29

Page 30: Real time data ingestion and Hybrid Cloud

ApacheFalcon

XavientCorporateOverview30

DiP Hadoop

On-prem Cloud

ApacheFalcon isadatamanagementtoolforoverseeingdatapipelinesinHadoopclusters.Itcanbeusedtoreplicatedatafromoneclustertoanother.

Hadoop

Page 31: Real time data ingestion and Hybrid Cloud

KafkaMirroring

XavientCorporateOverview31

The Kafka mirroring feature is used for creating the replica of an existing cluster, for example, for the replication of an active datacenter into a passive datacenter. Kafka provides a mirror maker tool for mirroring the source cluster into target cluster.

Page 32: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)32

KafkaMirroring– HybridCloudEnvironment

Page 33: Real time data ingestion and Hybrid Cloud

Cassandra

XavientCorporateOverview33

DiP

Cassandra

Cassandra

On-prem Cloud

• RDBMSmigration• DSEadvancereplication• Kafka

Page 34: Real time data ingestion and Hybrid Cloud

Data IngestionPlatform (DiP)34

WIP

• IntegrationwithKafkaConnectandKafkaStreaming• DataMunging,Validation• MachineLearning• Search– Elastic,Solr

Page 35: Real time data ingestion and Hybrid Cloud

Thanks!@[email protected]


Recommended