Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | databricks |
View: | 2,371 times |
Download: | 1 times |
Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read
Jason Pohl, Data Solutions Engineer Denny Lee, Technology Evangelist
About the speaker: Jason Pohl
Jason Pohl is a solutions engineer with Databricks, focused on helping customers become successful with their data initiatives. Jason has spent his career building data-driven products and solutions.
2
About the moderator: Denny Lee
Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. Prior to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).
3
We are Databricks, the company behind Apache Spark
Founded by the creators of Apache Spark in 2013
Share of Spark code contributed by Databricks in 2014
75%
4
Data Value
Created Databricks on top of Spark to make big data simple.
…
Apache Spark Engine
Spark Core
Spark Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R APIs
Standard libraries
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO
Source: Slide 5 of Spark Community Update
Traditional Data Warehousing Pain PointsInelasticity of compute and storage resources
• Burst workloads requires max. load capacity planning
• Fixed size DW = compute and storage to scale linearly together
(these are orthogonal problems)
• Expensive conundrum:
• If your DW is successful, you cannot easily exapnd
• If there is overcapacity = idle resources
Traditional Data Warehousing Pain PointsRigid architecture that’s difficult to change
• Traditional DW are schema-on-write requiring schemas, partitions, and indexes to be
pre-built.
• Rigidity = maintaining costly ETL pipelines
• Expend finite resources to continually augment pipelines to absorb new data.
Traditional Data Warehousing Pain PointsLimited advanced analytics capabilities
• Want more than what business intelligence and data warehousing provides
• More than just counts, aggregates and trends
• Desire forecasting using ML, segmentation, graph processing, etc.
Just-in-Time Data WarehousingScale resources on demand
13
• Scale resources based on query load
• Separate compute and storage to scale
either independently
• Easily setup multiple clusters against the
same data sources
Just-in-Time Data WarehousingDirect access to data sources
14
• Scale resources based on query load
• Separate compute and storage to scale
either independently
• Easily setup multiple clusters against the
same data sources
Just-in-Time Data WarehousingScale resources on demand
15
• Scale resources based on query load
• Separate compute and storage to scale
either independently
• Easily setup multiple clusters against the
same data sources
Change Data CaptureWhat is it?
• System to automatically capture changes in source system (e.g. transactional database) and automatically capture those changes in a target system (e.g. data warehouse). • Important for data warehouses because it allows it to record (and
ultimately report) any changes, e.g.: • Customer A buys a pair of skis for $250 on 1/2/2015 • On 1/5/2015, realize that the purchase was $350 not $250
16
Change Data CaptureSource to Target
17
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
Target
ID Date Product Price
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
Change Data CaptureAdd new row
18
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
Target
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
Change Data CaptureUpdate an existing row
19
Source
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
Target
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $250.00
103 1/3/2016 Disc $15.00
ID Date Product Price
101 1/1/2016 Skates $80.00
102 1/2/2016 Skis $350.00
103 1/3/2016 Disc $15.00
Change Data CaptureUpdate an existing row
20
Source Target
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $350.00 1/5/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/5/2016
103 1/3/2016 Disc $15.00 1/3/2016
102 1/2/2016 Skis $350.00 1/5/2016
DemoHigh Watermark with LastUpdatedDate
21
22
Stage Data from Employee Database
23
Update Records in Employee Source Database
UPDATE employees SET last_name = 'Spark' WHERE emp_no = 16894
Job to Automate CDC
24
Source Target
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
Jobs
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
ID Date Product Tag Price LastUpdated
101 1/1/2016 Skates ice $80.00 1/1/2016
102 1/2/2016 Skis snow $250.00 1/2/2016
103 1/3/2016 Disc field $15.00 1/3/2016
ID Date Product Price LastUpdated
101 1/1/2016 Skates $80.00 1/1/2016
102 1/2/2016 Skis $250.00 1/2/2016
103 1/3/2016 Disc $15.00 1/3/2016
25
Add a column to the Departments table
ALTER TABLE departments ADD COLUMN dept_desc VARCHAR(50)
UPDATE departments SET dept_desc = dept_name
Job to Automate CDC
Source Target
Jobs
dept_no
dept_name
dept_no
dept_name dept_no
dept_name dept_desc
Notebooks
To access the notebooks, please reference the attachments in the Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read webinar.
• Stage Data From Employee Database: • Notebook that starts the process • Defines the ETL process
• Change Schema in Employee Source Database • Update Records in Employee Source Database • Validate Departments
Resources
• Just-in-Time Data Warehousing Solution Brief • Building a Turbo-fast Data Warehousing Platform with
Databricks • Spark DataFrames: Simple and Fast Analysis of Structured Data • Transitioning from Traditional DW to Spark in OR Predictive
Modeling • Advertising Technology Sample Notebook (Part 1)
More resources
• Databricks Guide • Apache Spark User Guide • Databricks Community Forum • Training courses: public classes, MOOCs, & private training • Databricks Community Edition: Free hosted Apache Spark.
Join the waitlist for the beta release!
29
Thanks!