+ All Categories
Home > Data & Analytics > Introducing Databricks Delta

Introducing Databricks Delta

Date post: 21-Jan-2018
Category:
Upload: databricks
View: 680 times
Download: 1 times
Share this document with a friend
36
Unifying Data Warehousing with Data Lakes Ali Ghodsi, Co-Founder & CEO Oct 25, 2017
Transcript

Unifying Data Warehousing with Data Lakes

Ali Ghodsi, Co-Founder & CEOOct 25, 2017

Many enterprises are undergoing a data transformation

Databricks Customers Across IndustriesFinancialServices Healthcare&Pharma Media&Entertainment Technology

PublicSector Retail&CPG ConsumerServices Energy&IndustrialIoTMarketing&AdTech

Data&AnalyticsServices

Health care AI cloud dataset use caseFinancialServices Healthcare&Pharma Media&Entertainment Technology

PublicSector Retail&CPG ConsumerServices Energy&IndustrialIoTMarketing&AdTech

Data&AnalyticsServices

Correlate EMR of 50,000 patients compared with their DNA

FinancialServices Healthcare&Pharma Media&Entertainment Technology

PublicSector Retail&CPG ConsumerServices Energy&IndustrialIoTMarketing&AdTech

Data&AnalyticsServices

Enterprise AI use case

5

Provide recommendations to sales using NLP and deep learning

FinancialServices Healthcare&Pharma Media&Entertainment Technology

PublicSector Retail&CPG ConsumerServices Energy&IndustrialIoTMarketing&AdTech

Data&AnalyticsServices

6

Real-time AI use-case

Curb abusive behavior across gamers globally

Big Data was the Missing Link for AIBIG DATA

Customer Data

Emails/Web pages

Click Streams

Sensor data (IoT)

Video/Speech

Most companies are Struggling with Big Data

GREAT RESULTS

Hardest part of AI isn’t AI“Hidden Technical Debt in Machine Learning Systems", Google NIPS 2015

The hardest part of AI is Big Data

MLCode

Building Predictive Applicationsis really Hard!

Unified Analytics Platform

UNIFIEDEXPERIENCE

ACROSS TEAMS

UNIFIEDPROCESSING

ENGINE

The Evolutionof Big Data

The Era of the Data Warehouse

Data Warehouse (DW)

THE GOOD• Pristine Data• Fast Queries• Transactional

THE BAD• Expensive to Scale, not Elastic• Requires ETL, Stale Data, No Real-Time• No Predictions, No ML• Closed formats (lock in)

Not Future Proof – Missing Predictions, Real-time, Scale

ETLimportantdatatocentralDWandgetBusinessIntelligence(BI)

The Era of the Data Lake

THE BAD• Inconsistent Data• Unreliable for Analytics• Lack of Schema• Poor Performance

Hadoop Data Lake

Become a cheap messy data store with poor performance

ETLalldatatocentralscalableopenlakeforallusecases

THE GOOD• Massive scale • Inexpensive Storage• Open Formats (Parquet, ORC)• Promise of ML & Real Time

Streaming

The Current Stateof Data Platforms

Info Sec at a Fortune 100 Company

DISADVANTAGES OF ARCHITECTURE• Poor agility in responding to new threats• Scale Limitations, no historical data• 6 Months and twenty people to build

ENTERPRISE DATA WAREHOUSE• Only 2 weeks of data • Very expensive to scale• Proprietary Formats• No Predictions (ML)

Messy data not ready for analytics

Billions of records a day HADOOPDATA LAKE

Complex ETL

EDW

EDW

EDWIncidence Response

Alerting

Reports

The Next GenerationData Platform

First UNIFIED data management system that delivers:

The

SCALEof data lake

The

LOW-LATENCYof streaming

TheRELIABILITY &

PERFORMANCEof data warehouse

Announcing Databricks Delta

The

SCALEof data lake

The

LOW-LATENCYof streaming

TheRELIABILITY &

PERFORMANCEof data warehouse

Databricks Delta

Enables Predictions, Real-time and Ad Hoc Analytics at Massive Scale

THE GOOD OF DATA LAKES• Massive scale on Amazon S3• Open Formats (Parquet, ORC)• Predictions (ML) & Real Time

Streaming

THE GOOD OF DATA WAREHOUSES• Pristine Data• Transactional Reliability• Fast Queries (10-100x)

Databricks Delta Under the Hood• Decouple Compute & Storage

• ACID Transactions & Data Validation

• Data Indexing & Caching (10-100x)

• Real-Time Streaming Ingest

MASSIVE SCALE

RELIABILITY

PERFORMANCE

LOW-LATENCY

Info Sec with Databricks Delta

DATABRICKS RUNTIMEpowered by

DATABRICKSRUNTIMETrillion Records a Day DATABRICKS

DELTA

ETL, Schema Validation SQL , ML, StreamADVANTAGES• AI capable data warehouse at the scale of a data lake• Interactive analysis on 2 years of data• 2 Weeks to build with a 5 person data platform team

Unified Analytics PlatformUNIFIED

EXPERIENCE ACROSS TEAMSNotebooks, Dashboards, Reports

Unified Analytics Platform

+UNIFIEDDATA

MANAGEMENTReliable Transactions, Performance

UNIFIEDEXPERIENCE

ACROSS TEAMSNotebooks, Dashboards, Reports

Demo by Michael Armbrust

Evolution of a Cutting-Edge Data Pipeline

Events

?Reporting

StreamingAnalytics

Data Lake

Evolution of a Cutting-Edge Data Pipeline

Events

Reporting

StreamingAnalytics

Data Lake

Challenge #1: Historical Queries?

Data Lake

λ-arch

λ-arch

StreamingAnalytics

Reporting

Eventsλ-arch1

1

1

Challenge #2: Messy Data?

Data Lake

λ-arch

λ-arch

StreamingAnalytics

Reporting

Events

Validation

λ-archValidation

1

21

1

2

Reprocessing

Challenge #3: Mistakes and Failures?

Data Lake

λ-arch

λ-arch

StreamingAnalytics

Reporting

Events

Validation

λ-archValidation

Reprocessing

Partitioned

1

2

3

1

1

3

2

Reprocessing

Challenge #4: Query Performance?

Data Lake

λ-arch

λ-arch

StreamingAnalytics

Reporting

Events

Validation

λ-archValidation

Reprocessing

Compaction

Partitioned

CompactSmall Files

Scheduled to Avoid Compaction

1

2

3

1

1

2

4

4

4

2

Let’s try it instead with DELTA

Reprocessing

The Canonical Data Pipeline

Data Lake

λ-arch

λ-arch

StreamingAnalytics

Reporting

Events

Validation

λ-archValidation

Reprocessing

Compaction

Partitioned

CompactSmall Files

Scheduled to Avoid Compaction

1

2

3

1

1

2

4

4

4

2

Challenge

DELTA

DATA LAKE Reporting

StreamingAnalytics

TheLOW-LATENCY

of streaming

TheRELIABILITY &

PERFORMANCEof data warehouse

TheSCALE

of data lake

The Delta Architecture

Sign up for the Private Betavisit databricks.


Recommended