2
Legal MessageTHE INFORMATION PROVIDED IN THIS PRESENTATION IS INTENDED TO BE GENERAL IN NATURE AND IS NOT SPECIFIC GUIDANCE. RECOMMENDATIONS (INCLUDING POTENTIAL COST SAVINGS) ARE BASED UPON INTEL'S EXPERIENCE AND ARE ESTIMATES ONLY. INTEL DOES NOT GUARANTEE OR WARRANT OTHERS WILL OBTAIN SIMILAR RESULTS
This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright © 2016, Intel Corporation. All rights reserved.
Outline
Integrated Analytics Vision
Data Ingestion Challenges
Solution
What we would like to do
What we did
Challenges
Need Help
Summary
3
Integrated Analytics Vision & MissionOur Vision: Customers are empowered to easily make rapid, impactful business decisions and
uncover new revenue channels through connected data & analytics
Our Mission: Provide clean, relatable, integrated data using a consistent approach to deliver business recommendations and insights through visual and interactive usage
Transformed and
Connected Data
Raw Data Advanced
Analytics
4
As Is – Data Ingestion Architecture
Firewall and Proxy Channels
External Source Systems
IT BI Hadoop Cluster
Gateway Node
Camel
Hadoop Storage
Internal Source Systems
Logs
DataMart
EDW
DataMart
RDBMSFlat/CSVFiles
SFTP
Vendorutility
Hadoop Put
Python script
HDFS Hive
Hadoop Put
Customutility
Hadoop Put
Hadoop Put
Hadoop Put
Data
ConsumptionTra
nsfo
rma
tion
Visualizationtools
Client Tools
Sales CRM
Marketing campaign management
Content Tagging
Webinar
5
Data Ingestion ChallengesIngesting a variety of internal/external data sources, such as enterprise data warehouse, enterprise master data, spreadsheets, social media feeds, marketing data, retailer data, etc.
This resulted in variety of challenges including:
• Individual project teams instrumenting their own methods for ingesting data from various sources and building their own data pipelines
• Operational Complexity to manage the individual pipelines
• No reusability as each project team created redundant methods/codebases for ingesting data sources
• High development cost as each team built their own data ingestion pipelines
• Inconsistency in the quality of project teams’ data ingestion codebases impacting data qualify and reliability
• Job failures resulting from data format, quality, schema evolution and availability issues
• Skillset challenges
6
No standardized reusable framework for data ingestion
Solution: Data Ingestion Architecture with Gobblin/Kite
Firewall and Proxy Channels
External Source Systems
IT BI Hadoop Cluster
Gateway Node
DataMart
EDW
DataMart
Data IngestionReusable Framework
Kafka
Validation
RestFul APIs
And many
more….
Hadoop Storage
Hive / HDFS /Hbase
Internal Source Systems
RDBMSFlat/CSVFiles
SFTPVendor
APIs
Gobblin
Inte
rface Logs
File Adapter
ConfigFiles
Alert
CSV Adapter
RDBMSJDBC Connector
Data
Consumption
Visualizationtool
Client Tools
Sales CRM
Marketing campaign management
Content tagging
Webinar
Retailer
Social media feeds
Kite
7
UI
8
What we set out to do?Functionally evaluate Gobblin for ingesting and integrating data.
Prototype a non OOB source to extract data out of an “online campaign automation provider”
Acceptance Criteria
Bulk RestAPI
Validate the correctness of data
Data Consistency from end to end
Notification, status and error logging
Ability to log kickout records
Training plan for implementation and adoption plan
9
What we did
Data Scope• 4 objects
• accounts• contacts • 9 activities • 59 custom objects
Parallel load data• Hive (not using compaction) *• HDFS (BaseDataPublisher)
Functional UI ready• Scheduling• Job History• Authoring job configurations
Functional backend ready• Enterprise scheduler• Gobblin Standalone• Gobblin Map-Reduce *Quality checking policies• Row level• Task levelEnterprise features• Alerting• Monitoring• Profiling *• Logging
* Needs more attention
10
Process Flow
Establish connection
•Authentication
•Endpoint indirection
Object Determination
•Get Object Listing
•Get Schema Definition
•Slice Schema
Create Intent
•Create Exports
Establish size boundaries
•Create Syncs
•Poll Syncs
•Slice batches
Download
•Parallel batches
Rebuild data
•Reassemble
•Schema inferencing
•Data Conversion
Data Publishing
•Hive/Impala load
•View Definition
•Quality enforcement
Parallel download and reassembly of data blocks
11
Gobblin Challenges
User Interface – Visual Execution and Evaluation
Data Routing – Complex enterprise integration patterns routing challenging to implement
public enum Result {PASSED, // The test passedFAILED // The test failed
}
12
Need Gobblin Community Help Address adoption challenges
Intake process for third-party contributions.
– New Source - “online campaign automation provider”
– Spark based ingestion candidates (parquet, avro, json, JDBC, s3) and runtime
– Kite SDK
Partnership with key big data vendors – CDH, HDP, MAPR – for internalizing Gobblin capability
– Deployment, Management, Metrics, and Lineage Integration
Implement queuing or pluggable schedulers that do not rely on PID and workdir states; better integration with enterprise schedulers.
Make Hive publishers native; versus offline compactions.
Publish documentation for user community
13
Summary
Gobblin is a robust data integration framework that meets the scale, quality, enterprise readiness imperatives expected;
However, some features like usability, enterprise integration patterns, scheduling, profiling, lineage, deployment, documentation could be improved.