Data Warehouse Data Integration

Post on 14-Aug-2015

100 views 4 download

Tags:

transcript

DWH Data Integration

Christian Stade-SchuldtProject-A Ventures

BI Team Knowledge Transfer

Outline

Motivation

Import

Data Quality

Perfomance

Monitoring

,

Project-A, DWH Data Integration, 2014 2

What is data integration?

É combination of technical and business processes usedto combine data from disparate sources into meaningfuland valuable information

É encompasses discovery, cleansing, monitoring,transforming and delivery of data from a varietyof sources

É by far the largest portion of building a data warehouse

,

Project-A, DWH Data Integration, 2014 3

The ETL Process

Extract data from homogeneous or heterogeneous data sources

Transform the data for storing it in proper format or structure forquerying and analysis purpose

Load it into the final target

,

Project-A, DWH Data Integration, 2014 4

Processes and Jobs

É Process → Set of jobs in aparticular orderÉ Different processes for

separationÉ can run at different time

intervals

É File-dependency managementÉ Visualize graph

,

Project-A, DWH Data Integration, 2014 5

Processes and Jobs

É Job → Set of commands,depend on other jobs

É Command → Specific action(eg. run sql file)

É ⇒ developer friendly (plaintext files)

,

Project-A, DWH Data Integration, 2014 6

Sources

É Comma-separated filesÉ JSON filesÉ various databases (MySQL,

PostgreSQL, Microsoft SQLServer)

É via project codeÉ external APIs (usually export to

csv via cronjob)

,

Project-A, DWH Data Integration, 2014 7

The Schema Life-Cycle

É Data warehouse can be rebuild from scratch with every importÉ Import runs on a next schemaÉ Switch schemata in the last stepÉ Failure does not impact current data warehouse

,

Project-A, DWH Data Integration, 2014 8

Data Quality

É Real-world data is dirtyÉ Data quality is critical to data warehouse and business intelligence

solutionsÉ Goal:

É single point of truthÉ cleaned-up and validated dataÉ easily accessable for user

,

Project-A, DWH Data Integration, 2014 9

Data Quality 2

É Referential integrity → requires every value ofone attribute (column) of a relation (table)to exist as a value of another attribute in a different(or the same) relation (table)

É Check constraints (ADD CHECK)É Unique constraintsÉ Consistency checks → What goes in, has to come out,

No one’s left behind, some are. :(

,

Project-A, DWH Data Integration, 2014 10

Improving performance

É Cost-based scheduling for jobs(Priority Queue)

É Incremental loadsÉ Parallel jobsÉ Compute keys (e.g date,

corridor_id →(1000*sender_country_id +receiver_country_id))

É Index relevant columns

,

Project-A, DWH Data Integration, 2014 11

Monitoring

Runtime stats: How long doeseach job/process run

Timeline graph: How parallel is aprocess

,

Project-A, DWH Data Integration, 2014 12

Monitoring 2

DB schema: Visualize Schema

Relation sizes: Visualize growthover time

,

Project-A, DWH Data Integration, 2014 13

Monitoring 3

Index usage: Are indexes used orunecessary?

,

Project-A, DWH Data Integration, 2014 14

Naming conventions

É prefix schemata(e.g. os_, om_)

É schema names(e.g. dim_next, dim, tmp, data)

,

Project-A, DWH Data Integration, 2014 15

Naming conventions 2

Jobs follow a pattern:

load load data into the data schema

transform transform data into the dim schema

copy copy data into the dim schema (no transformation)

flatten creates flattened tables for faster access

constrain applies foreign key constrains

,

Project-A, DWH Data Integration, 2014 16

Summary

É Data integration is the largest portion of building a data warehouseÉ Ensure data quality by applying constraints and testsÉ Monitor your data integration process

,

Project-A, DWH Data Integration, 2014 17

For Further Reading I

Ralph KimballThe Data Warehouse Toolkit.Wiley, 2013.

,

Project-A, DWH Data Integration, 2014 18