+ All Categories
Home > Technology > August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Date post: 06-Jan-2017
Category:
Upload: yahoo-developer-network
View: 4,375 times
Download: 1 times
Share this document with a friend
14
Open Source Big Data Ingest with StreamSets Data Collector Pat Patterson Community Champion @metadaddy [email protected]
Transcript
Page 1: August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Open Source Big Data Ingest with

StreamSets Data Collector

Pat PattersonCommunity Champion

@[email protected]

Page 2: August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Traditional and Big Data Founders

Company Background

Top tier Investors

Momentum to Date

Strategic Partners

Launched 2014; exited stealth 9/15

~30 employees

Double-digit enterprise customers

10,000 downloads

Page 3: August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Past ETL ETL

Emerging Ingest Analyze

Data Sources Data Stores Data Consumers

Market Trends

Page 4: August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Data Drift

The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data

Structure Drift

Semantic Drift

Infrastructure

Drift

Page 5: August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Delayed and False Insights

Solving Data Drift

Tools

Applications

Data Stores Data ConsumersData Sources

Poor Data QualityData DriftCustom code

Fixed-schema

Page 6: August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Trusted InsightsData KPIs

Solving Data Drift

Tools

Applications

Data Stores Data ConsumersData Sources

Data DriftIntent-Driven

Drift-Handling

Page 7: August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

SQL on Hadoop (Hive) Y/Y Click Through Rate

80% of analyst time is spent preparing and validating data, while the remaining 20% is actual data analysis

Example: Data Loss and Corrosion

Page 8: August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

StreamSets Data Collector

Open source software for the rapid development

and reliably operation of complex data flows.

➢ Efficiency➢ Control➢ Agility

Page 9: August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

SDC Demo

StreamSetsData Collector

Apache Kafka

Apache Kudu

Page 10: August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

SF Bay Area Data Ingest Meetup - Aug 25, Palo Alto, CA

MapR Big Data Everywhere - Aug 30, San Francisco, CA

Strata + Hadoop World - Sep 27-29, New York, NY

Upcoming Events

Page 11: August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Thank You!

Page 12: August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Structure Drift

Data structures and formats evolve and

change unexpectedly

Implication:Data Loss

Data Squandering

Delimited Data

107.3.137.195

fe80::21b:21ff:fe83:90fa

Attribute Format Changes

{ “first“: “jon” “last“: “smith” “email“: “[email protected]” “add1“: “123 Washington” “add2“: “” “city“: “Tucson” “state“: “AZ” “zip“: “85756”}

{ “first“: “jane” “last“: “smith” “email“: “[email protected]” “add1“: “456 Fillmore” “add2“: “Apt 120” “city“: “Fairfield” “state“: “VA” “zip“: “24435-1001” “phone”: “401-555-1212”}

Data Structure Evolution

Structure Drift

Page 13: August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Semantic Drift

Data semantics change with evolving

applications

Implication:Data Corrosion

Data Loss

Semantic Drift24122-52172 00-24122-52172

Account Number Expansion

M134: user {jsmith} read access granted {ac:24122-52172}

M134: user {jsmith} read access granted {ca.ac:24122-52172}Namespace Qualification

………,3588310669797950,$91.41,jcb,K1088-W#9,……,6759006011936944,$155.04,switch,A6504-Y#9,……,6771111111151415,$37.78,laser,Q9936-T#9,……,3585905063294299,$164.48,jcb,S4643-H#9,……,5363527828638736,$117.52,mastercard,X3286-P#9,……,4903080150282806,$168.03,switch,I9133-W#3,……… Outlier / Anomaly

Detection

Page 14: August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

InfrastructureDrift

Physical and Logical Infrastructure changes

rapidly

Implication:Poor Agility

Operational Downtime

Data Center 1

Data Center 2

Data Center n

3rd Party Service Provider

App a

App k

App qCloud

Infrastructure

Infrastructure Drift


Recommended