+ All Categories
Home > Data & Analytics > Data Warehouse Data Integration

Data Warehouse Data Integration

Date post: 14-Aug-2015
Category:
Upload: christian-stade-schuldt
View: 100 times
Download: 4 times
Share this document with a friend
Popular Tags:
18
DWH Data Integration Christian Stade-Schuldt Project-A Ventures BI Team Knowledge Transfer
Transcript
Page 1: Data Warehouse Data Integration

DWH Data Integration

Christian Stade-SchuldtProject-A Ventures

BI Team Knowledge Transfer

Page 2: Data Warehouse Data Integration

Outline

Motivation

Import

Data Quality

Perfomance

Monitoring

,

Project-A, DWH Data Integration, 2014 2

Page 3: Data Warehouse Data Integration

What is data integration?

É combination of technical and business processes usedto combine data from disparate sources into meaningfuland valuable information

É encompasses discovery, cleansing, monitoring,transforming and delivery of data from a varietyof sources

É by far the largest portion of building a data warehouse

,

Project-A, DWH Data Integration, 2014 3

Page 4: Data Warehouse Data Integration

The ETL Process

Extract data from homogeneous or heterogeneous data sources

Transform the data for storing it in proper format or structure forquerying and analysis purpose

Load it into the final target

,

Project-A, DWH Data Integration, 2014 4

Page 5: Data Warehouse Data Integration

Processes and Jobs

É Process → Set of jobs in aparticular orderÉ Different processes for

separationÉ can run at different time

intervals

É File-dependency managementÉ Visualize graph

,

Project-A, DWH Data Integration, 2014 5

Page 6: Data Warehouse Data Integration

Processes and Jobs

É Job → Set of commands,depend on other jobs

É Command → Specific action(eg. run sql file)

É ⇒ developer friendly (plaintext files)

,

Project-A, DWH Data Integration, 2014 6

Page 7: Data Warehouse Data Integration

Sources

É Comma-separated filesÉ JSON filesÉ various databases (MySQL,

PostgreSQL, Microsoft SQLServer)

É via project codeÉ external APIs (usually export to

csv via cronjob)

,

Project-A, DWH Data Integration, 2014 7

Page 8: Data Warehouse Data Integration

The Schema Life-Cycle

É Data warehouse can be rebuild from scratch with every importÉ Import runs on a next schemaÉ Switch schemata in the last stepÉ Failure does not impact current data warehouse

,

Project-A, DWH Data Integration, 2014 8

Page 9: Data Warehouse Data Integration

Data Quality

É Real-world data is dirtyÉ Data quality is critical to data warehouse and business intelligence

solutionsÉ Goal:

É single point of truthÉ cleaned-up and validated dataÉ easily accessable for user

,

Project-A, DWH Data Integration, 2014 9

Page 10: Data Warehouse Data Integration

Data Quality 2

É Referential integrity → requires every value ofone attribute (column) of a relation (table)to exist as a value of another attribute in a different(or the same) relation (table)

É Check constraints (ADD CHECK)É Unique constraintsÉ Consistency checks → What goes in, has to come out,

No one’s left behind, some are. :(

,

Project-A, DWH Data Integration, 2014 10

Page 11: Data Warehouse Data Integration

Improving performance

É Cost-based scheduling for jobs(Priority Queue)

É Incremental loadsÉ Parallel jobsÉ Compute keys (e.g date,

corridor_id →(1000*sender_country_id +receiver_country_id))

É Index relevant columns

,

Project-A, DWH Data Integration, 2014 11

Page 12: Data Warehouse Data Integration

Monitoring

Runtime stats: How long doeseach job/process run

Timeline graph: How parallel is aprocess

,

Project-A, DWH Data Integration, 2014 12

Page 13: Data Warehouse Data Integration

Monitoring 2

DB schema: Visualize Schema

Relation sizes: Visualize growthover time

,

Project-A, DWH Data Integration, 2014 13

Page 14: Data Warehouse Data Integration

Monitoring 3

Index usage: Are indexes used orunecessary?

,

Project-A, DWH Data Integration, 2014 14

Page 15: Data Warehouse Data Integration

Naming conventions

É prefix schemata(e.g. os_, om_)

É schema names(e.g. dim_next, dim, tmp, data)

,

Project-A, DWH Data Integration, 2014 15

Page 16: Data Warehouse Data Integration

Naming conventions 2

Jobs follow a pattern:

load load data into the data schema

transform transform data into the dim schema

copy copy data into the dim schema (no transformation)

flatten creates flattened tables for faster access

constrain applies foreign key constrains

,

Project-A, DWH Data Integration, 2014 16

Page 17: Data Warehouse Data Integration

Summary

É Data integration is the largest portion of building a data warehouseÉ Ensure data quality by applying constraints and testsÉ Monitor your data integration process

,

Project-A, DWH Data Integration, 2014 17

Page 18: Data Warehouse Data Integration

For Further Reading I

Ralph KimballThe Data Warehouse Toolkit.Wiley, 2013.

,

Project-A, DWH Data Integration, 2014 18


Recommended