Date post: | 11-Jan-2017 |
Category: |
Software |
Upload: | pat-patterson |
View: | 212 times |
Download: | 0 times |
AgendaIngest, Data Drift and StreamSets
Short Demo
Building a custom integration
Real-world integration: Salesforce Wave Analytics
Traditional and Big Data Founders
Company Background
Top tier Investors
Momentum to Date
Strategic Partners
● Launched 2014; exited stealth 9/15● ~30 employees● Double-digit enterprise customers● 10,000 downloads
Past ETL ETL
Emerging Ingest Analyze
Data Sources Data Stores Data Consumers
Market Trends
Problem: Data Drift
The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data
Structure Drift
Semantic Drift
Infrastructure Drift
Delayed and False Insights
Solving Data Drift
Tools
Applications
Data Stores Data ConsumersData Sources
Poor Data QualityData DriftCustom code
Fixed-schema
Trusted InsightsData KPIs
Solving Data Drift
Tools
Applications
Data Stores Data ConsumersData Sources
Data DriftIntent-Driven
Drift-Handling
Demo
Let’s build a simple pipeline to answer a real question:
What’s the biggest city lot in San Francisco?
Customizing StreamSets
Currently 25 standard StreamSets destinations, covering a wide variety of target systems, from flat files to S3 to Kafka
But… there’s always some system not on the list
Solution: DIY!
Create Your Own Destination
Five Step Process:○ Create template from Maven archetype○ Add logging○ Create a record buffer○ Add configuration parameters○ Send data to external system
bit.ly/sdc-dest
Your System Here!
Create Template from Archetype
mvn archetype:generate-DarchetypeGroupId=com.streamsets -DarchetypeArtifactId=streamsets-datacollector-stage-lib-tutorial -DarchetypeVersion=1.3.0.0 -DinteractiveMode=true
Add Logging
Not 100% necessary, but VERY helpfulStreamSets uses SLF4J
$ tail -f streamsets-datacollector-1.3.0.0/log/sdc.log
Create a Record Buffer
Leverage existing code where possible!StreamSets includes generators for CSV, JSON, Avro, Protocol Buffers etc
Configuration
Separate configuration and codeDON’T PUT CREDENTIALS IN CODE!!!DON’T PUT CREDENTIALS IN CODE!!!Make your users’ and your lives easier!
Send Data to the External System
Don’t forget security policy!
streamsets-datacollector/etc/sdc-security.policy
grant codebase "file://${sdc.dist.dir}/user-libs/sampletest/-" { permission java.net.SocketPermission "requestb.in", "connect, resolve";};
A Real Custom DestinationSalesforce Wave Analytics
● Adapt to batch processing model○ Configure wait time before ‘closing’ a batch
● External Data API○ Create new dataset○ Write to dataset○ Close dataset on timeout○ Trigger dataflow execution
Conclusion
StreamSets Data Collector makes simple tasks easy, complex tasks possible
Use ‘off the shelf’ stages for simple tasks
Leverage script processors (Jython, JavaScript, Groovy) for more complex work
Build custom stages for ultimate performance, flexibility
Thank You!
Structure Drift
Data structures and formats evolve and
change unexpectedly
Implication:Data Loss
Data Squandering
Delimited Data
107.3.137.195 fe80::21b:21ff:fe83:90fa
Attribute Format Changes
{ “first“: “jon” “last“: “smith” “email“: “[email protected]” “add1“: “123 Washington” “add2“: “” “city“: “Tucson” “state“: “AZ” “zip“: “85756”}
{ “first“: “jane” “last“: “smith” “email“: “[email protected]” “add1“: “456 Fillmore” “add2“: “Apt 120” “city“: “Fairfield” “state“: “VA” “zip“: “24435-1001” “phone”: “401-555-1212”}
Data Structure Evolution
Structure Drift
Semantic Drift
Data semantics change with evolving applications
Implication:Data Corrosion
Data Loss
Semantic Drift
24122-52172 00-24122-52172
Account Number Expansion
M134: user {jsmith} read access granted {ac:24122-52172}
M134: user {jsmith} read access granted {ca.ac:24122-52172}
Namespace Qualification
………,3588310669797950,$91.41,jcb,K1088-W#9,……,6759006011936944,$155.04,switch,A6504-Y#9,……,6771111111151415,$37.78,laser,Q9936-T#9,……,3585905063294299,$164.48,jcb,S4643-H#9,……,5363527828638736,$117.52,mastercard,X3286-P#9,……,4903080150282806,$168.03,switch,I9133-W#3,………
Outlier / Anomaly Detection
InfrastructureDrift
Physical and Logical Infrastructure changes
rapidly
Implication:Poor Agility
Operational Downtime
Data Center 1 Data Center 2 Data Center n
3rd Party Service Provider
App a App k
App qCloud
Infrastructure
Infrastructure Drift