PentahoDataIntegrationBestArchitecturePracticesMattCastersPentahoChiefArchitectofDataIntegration,HitachiVantara
Introduction
• Whatis“dataintegrationarchitecture”?– Highlevelviewona(potential)DIsolution– Describescomponentsandtheirrelationships– Takingintoaccountallparts– Avoidingdetailswithoutskippinganything
Introduction
• Whydoyouneedanarchitecture?– Solutionsgetverycomplex– Teamsofengineersgetlarge– Consciousdecisionsonuseofsolutioncomponents– Holisticviewsonsecurity,quality,transparency,performance– Allowsforvalidationofhighlevelrequirements– Allowsforthecreationandvalidationofscenarios– Clearlydefinesstakeholders
GeneralAdvice– Don’tForgettheDetails…
• Learnthebasicsofthebuildingblocks…– PDIBestPractices#PWorld14• Standards,naming,…– PDIBestGovernancePractices#PWorld15• PM,CI,VCS,Testing,…– Getexpertiseforallsoftwarecomponentsyouuse
GeneralAdvice– Whiteboarding
• Whiteboarding– Isdonewithinterestedstakeholders– Triestocompromiseknowledgefromvariousparties– Allowsforquickhighleveldesign– Itisjustastartingpoint!– Needstogetfollowedup,validatedagainstscenarios– Forgetconviction:timetochangeyourmind
GeneralAdvice– Scalability
• Parallelizeonahighlevel– Aggressivelowlevelparallelizationcangetyouintotrouble
• Remembertoallowdatatoflowinswimlanes– Parallelizationofasmuchaspossible– “Sharding”andsoonshouldbearchitectedin
• Identifytimewindowearlyon,assessHWneeds
GeneralAdvice– Transparency
• Greatcomplexityrequirestransparency– Somethingwillalwaysgowrong– Attheworstpossibletime
• Asarule:– alwaystracedatamovingbetweenpartsofarchitecture–Whenindoubt:addmorelogging,trackingandtracing
• Usecomponentsinarchitecturethatallowformonitoring– Preferserversthatallowyoutoseewhat’sgoingon
GeneralAdvice– Predictability
• Enormousworkloads,batchjobs,putsystemsunderstress
• Batchestendtogrowbiggerovertime,causingmorestress
• Asarule:– Ifyoucaninanyway,usemicro-batching– Chopup1largenightlyworkloadintohundredsofsmallonesthroughouttheday
• Advantages:– Morefrequentupdates– Predictableworkload– Failearlyscenario:problemsaredetectedearlier
SpecificAdvice– Hadoop
• Hadoophasitselfbecomeanecosystemofsoftware
• Selectthesoftwareintheecosystemtofityouridealarchitecture
• Onlyselectproperlysupportedcomponents,avoidbleedingedge
• Combatlackoftransparencywithextensivelogging
• Followtherightsizingforyourarchitecture,balancecorrectly• Useitasascalablepart,notjustasa“Database”
SpecificAdvice– IoT
• IoTisMessy– DataQualityvarying– DataConnectivityproblems– Latearrivingdata– Flash-floodsofdata(lowpredictability)– Highcomplexity– Varyingdataformatsandversions– Numberofdifferentdevicescanbehigh
HitachiVantara IoTOfferings
CONNECTEDTHINGS
OperationalInsights
AssetIntelligence
MaintenanceOptimization
ManufacturingOptimization
EDGE
AssetAvatar State
CORE ANALYTICS
FOUNDRY
DataCollection
AssetManagement
AssetAvatar
ArtificialIntelligence
Batch/Stream/Analytics
DataBlending/Orchestration
AssetIntegration
EdgeAnalytics
DataFiltering
DataTransformation
DashboardAlerts/
NotificationsApplicationEnablement
SpecificAdvice– IoT
• Planaheadforfailure• UsemoderntechniqueslikeMetadataInjection
• Makeextensiveuseofqueuesinanyformat
• Assumethatthingswillgowrongineveryscenario
• Designthearchitecturetocopewithfailures• Designthearchitecturetoreportonstatistics
Examples– LargeServicesVendor
• Movinglargeamountsofsmalldatapacketsaround
• Pickedtherighttools,didn’tpickanoverallarchitecture• Differentteams“workingtogether”indifferentcountries
• Architecturebecamesecondarytotheoverallsolution
• Technologywasselectednotarchitecture
Examples– LargeServicesVendor
• Carteserversgothammeredthousandsoftimespersecond– Useofaspecificschedulerwasmandated– Runningoutofsockets,HTTPserverbucklingundertheload
• ComplaintsaboutPDIstartuptimes
• Overallperformancetoolow
• Servicescalledintosolve“critical”issuesinoursoftware
Examples– LargeServicesVendor
• Don’tallowinternalorganizationalneedsdrivethearchitecture• Don’tallowtechnologychoicestodrivearchitecture– Andifyoutoo,handletheimplications
• Toscale,rampupperformance,alwaysqueueandintelligentlyhandlequeuedtasks(notoneatatimeforexample)
• Theperformanceofthewholeisdeterminedbytheslowestlink– Considerthisup-frontinthearchitecture
Examples– HandlingTVSet-topData
• Periodicinnature,handlingclicks• ReadingfromMQTT,dumpingdataintoOracleforanalysis
• ReportedPDIperformancetrouble,servicescalledin
• Smallscaletest,predictedten-foldincreaseinsize,alreadyintrouble
Examples– HandlingTVSet-topData
• MQTT:greatforqueuingandIoT
• Notalwayspossibletoreadinparallelfromqueues!
• OracleisanRDBMS,killsparallelisminarchitecture
Examples– HandlingTVSet-topData
• Considerpartitioninglargeamountsofclients
• Considerdataextractionforanydatastoragemechanism
Examples– BigBank
• Processedagazillionrecordseverynight• Hadabatchwindowof2hours• Gotamonstercomputertodothejobwith64cores
• RancomplexdataqualityvalidationsinPDI,hundredsofsteps
• Gotintoaperformanceproblem
• Neededextensiveperformancetuning
Examples– BigBank
• Considerup-frontwhetherHWchoiceswillpinyoudownlater
• Weightheimportanceofspecificrequirementsintothearchitecture– timevscomplexityvshardwareinthiscase
Recap
• Makeanarchitectureup-front,notaspartofthedocumentation
• Becritical• Bedetailed• Runscenariosagainstit• Bereadytochangeyourmind
• Getstakeholdersinvolved• UsePDI:PessimisticDataIntegration