Post on 22-Dec-2015
transcript
Extract Transform InsertExtract Transform Insert
Extract data from operational system, Extract data from operational system, transform and insert into data transform and insert into data warehousewarehouse
Why ETI?Why ETI? Will your warehouse produce correct Will your warehouse produce correct
information with the current data?information with the current data? How ho w can I ensure warehouse How ho w can I ensure warehouse
credibility?credibility?
Excuses for NOT Excuses for NOT Transforming Legacy DataTransforming Legacy Data Old data works fine, new will work as well.Old data works fine, new will work as well. Data will be Data will be fixed fixed at point of entry through at point of entry through
GUI.GUI. If needed, data will be cleaned after new If needed, data will be cleaned after new
system populated; After proof-of-concept system populated; After proof-of-concept pilot.pilot.
Keys join the data most of the time.Keys join the data most of the time. Users will not agree to modifying or Users will not agree to modifying or
standardizing their data.standardizing their data.
Levels of Migration ProblemLevels of Migration Problem
Existing metadata is insufficient and unreliableExisting metadata is insufficient and unreliable Metadata must hold for all occurrencesMetadata must hold for all occurrences Metadata must represent business and technical Metadata must represent business and technical
attributesattributes
Data values incorrectly typed and accessibleData values incorrectly typed and accessible Values form extracted from storageValues form extracted from storage Values meaning inferred from its contentValues meaning inferred from its content
Entity keys unreliable or unavailableEntity keys unreliable or unavailable Inferred from related valuesInferred from related values
Metadata ChallengeMetadata Challenge
Metadata gets out of synch with details it Metadata gets out of synch with details it summarizessummarizes Business grows faster than systems designed to Business grows faster than systems designed to
capture business infocapture business info Not at the right level of detailNot at the right level of detail
Multiple values in a single fieldMultiple values in a single field Multiple meanings to a single fieldMultiple meanings to a single field No fixed format for valueNo fixed format for value
Expressed in awkward of limited termsExpressed in awkward of limited terms Program/compiler view rather than business viewProgram/compiler view rather than business view
Character-level ChallengeCharacter-level Challenge
Value Value instanceinstance level level Spelling, aliasesSpelling, aliases Abbreviations, truncations, transpositionsAbbreviations, truncations, transpositions Inconsistent storage formatsInconsistent storage formats
Named Named typetype level level Multiple meanings, contextual meaningsMultiple meanings, contextual meanings Synonyms, homonymsSynonyms, homonyms
EntityEntity level level No common keys or representationNo common keys or representation No integrated view across records, files, systemsNo integrated view across records, files, systems
Some Data Quality examplesSome Data Quality examples
The magic shrinking vendor fileThe magic shrinking vendor file
127 ways to spell...127 ways to spell...
Data surprises in individual fieldsData surprises in individual fields
Cowbirds and Data FieldsCowbirds and Data Fields
Magic numbers and embedded intelligenceMagic numbers and embedded intelligence
The Magic Shrinking Vendor The Magic Shrinking Vendor FileFileA Medical claims processor was A Medical claims processor was
having trouble with their Insurance having trouble with their Insurance Vendor file. They thought they had Vendor file. They thought they had 300,000 Insurance Vendors. 300,000 Insurance Vendors.
When they cleaned up their data, When they cleaned up their data, they discovered they had only they discovered they had only 27,000 unique Insurance Vendors.27,000 unique Insurance Vendors.
127 ways to spell...127 ways to spell...
Have over 127 different ways to spell Have over 127 different ways to spell AT&TAT&T
Have over 1000 ways to spell duPontHave over 1000 ways to spell duPont
NAME SOC. SEC. # TELEPHONE
Data surprises in individual Data surprises in individual fieldsfieldsData surprises in individual Data surprises in individual fieldsfields
Source: ValitySource: Vality
NAME SOC. SEC. # TELEPHONEMeta
Data surprises in individual Data surprises in individual fieldsfieldsData surprises in individual Data surprises in individual fieldsfields
Source: ValitySource: Vality
NAME SOC. SEC. # TELEPHONE
Denise Mario DBA
Marc Di Lorenzo ETAL
Tom & Mary Roberts
First Natl Provident
Digital 15 State St.
Astorial Fedrl Savings
Kevin Cooke, Receiver
John Doe Trustee for K
Meta
Actual Data Values
Data surprises in individualData surprises in individual fieldsfieldsData surprises in individualData surprises in individual fieldsfields
Source: ValitySource: Vality
NAME SOC. SEC. # TELEPHONE
Denise Mario DBA
Marc Di Lorenzo ETAL
Tom & Mary Roberts
First Natl Provident
Digital 15 State St.
Astorial Fedrl Savings
Kevin Cooke, Receiver
John Doe Trustee for K
228-02-1975
999999999
025-37-1888
34-2671434
101010101
LN#12-756
18-7534216
111111111
Meta
Actual Data Values
Data surprises in individualData surprises in individual fieldsfieldsData surprises in individualData surprises in individual fieldsfields
Source: ValitySource: Vality
NAME SOC. SEC. # TELEPHONE
Denise Mario DBA
Marc Di Lorenzo ETAL
Tom & Mary Roberts
First Natl Provident
Digital 15 State St.
Astorial Fedrl Savings
Kevin Cooke, Receiver
John Doe Trustee for K
228-02-1975
999999999
025-37-1888
34-2671434
101010101
LN#12-756
18-7534216
111111111
6173380300
3380321
415-392-2000
508-466-1200
212-235-1000
FAX 528-9825
5436
Meta
Actual Data Values
Data surprises in individualData surprises in individual fieldsfieldsData surprises in individualData surprises in individual fieldsfields
Source: ValitySource: Vality
Cowbirds and Data FieldsCowbirds and Data Fields
Cowbirds lay their eggs in other birds Cowbirds lay their eggs in other birds netsnets
Users use data fields that are not used Users use data fields that are not used for other purposesfor other purposes
Magic Numbers and Magic Numbers and Embedded IntelligenceEmbedded Intelligence
Customer Number = Customer Number = XXXXXXXX--YYYY--ZZZZ
XXXXXXXX = 1st 4 Positions of Zip Code = 1st 4 Positions of Zip Code
If If YYYY = 00-70 Then Cust = Pharmacy = 00-70 Then Cust = Pharmacy
If If YY YY = 80-89 Then Cust = Hospital = 80-89 Then Cust = Hospital
Except if Except if YYYY = 82 and = 82 and ZZZZ = ** Which Means... = ** Which Means...
Orr's Laws of Data QualityOrr's Laws of Data QualityLaw #1 - “Data that is not used cannot be correct!”Law #1 - “Data that is not used cannot be correct!”
Law #2 - “Data quality is a function of its use, not its Law #2 - “Data quality is a function of its use, not its collection!”collection!”
Law #3 - “Data will be no better than its most stringent Law #3 - “Data will be no better than its most stringent use!”use!”
Law #4 - “Data quality problems increase with the age of Law #4 - “Data quality problems increase with the age of the system!”the system!”
Law #5 - “Data quality laws apply equally to meta-data!”Law #5 - “Data quality laws apply equally to meta-data!”
Law #6 - The less likely something is to occur, the more Law #6 - The less likely something is to occur, the more traumatic it will be when it happens!traumatic it will be when it happens!
Legacy Data Contaminants Legacy Data Contaminants Found in MigrationsFound in Migrations Lack of standardsLack of standards Data surprises in individual fieldsData surprises in individual fields Legacy information buried in free form Legacy information buried in free form
fieldsfields Legacy myopia – multiple account Legacy myopia – multiple account
numbers block consolidated viewnumbers block consolidated view Anomaly nightmare – complex matching Anomaly nightmare – complex matching
and consolidationand consolidation
4 Fundamental Types of 4 Fundamental Types of TransformationTransformation Simple TransformationSimple Transformation
Fundamental building blocks of all data Fundamental building blocks of all data transformationstransformations
One field at a timeOne field at a time Cleansing and ScrubbingCleansing and Scrubbing
Ensure consistent formatting and usage of Ensure consistent formatting and usage of field or related group of filedsfield or related group of fileds
Checks valid valuesChecks valid values
4 Fundamental Types of 4 Fundamental Types of Transformation (con’t)Transformation (con’t) IntegrationIntegration
Takes operational data from one or more Takes operational data from one or more sources and maps it, field by field to new sources and maps it, field by field to new data structuredata structure
Aggregation and SummarizationAggregation and Summarization Remove low level of detailRemove low level of detail Data for data martData for data mart
Simple TransformationSimple Transformation
Convert data element from one type to Convert data element from one type to anotheranother semantic value samesemantic value same rename elementsrename elements
Date time conversionDate time conversion standard warehouse formatstandard warehouse format
Decode encoded fieldsDecode encoded fields M F vs C S MMM F vs C S MM
Cleansing and ScrubbingCleansing and Scrubbing
Actual content examinedActual content examined Range checking, Range checking, enumerated lists, enumerated lists, dependency checkingdependency checking
Uniform representation for dwUniform representation for dw address informationaddress information
parse to componentsparse to components
IntegrationIntegration Simple field level mappingsSimple field level mappings -80-90% -80-90% Complex integrationComplex integration
No common identifierNo common identifier probable matchesprobable matches 2-stage process, isolation/reconciliation2-stage process, isolation/reconciliation
Multiple sources for same target elementMultiple sources for same target element contradictorycontradictory
Missing dataMissing data Derived/calculated dataDerived/calculated data
redundant?redundant?
Aggregation and SummarizationAggregation and Summarization Summarization Summarization is the addition of like values is the addition of like values
along one or more business dimensionsalong one or more business dimensions add daily sales by stores for monthly sales by add daily sales by stores for monthly sales by
regionregion Aggregation Aggregation is the addition of different is the addition of different
business elements into common totalbusiness elements into common total daily product sales plus monthly consulting sales daily product sales plus monthly consulting sales
give monthly combined sales amountgive monthly combined sales amount Details of process available in metadataDetails of process available in metadata
Data Re-engineering Data Re-engineering ProblemProblem
Programming for the unknownProgramming for the unknown Unanticipated values, structures and patternsUnanticipated values, structures and patterns
Programming for noise and uncertaintyProgramming for noise and uncertainty Conflicting and missing valuesConflicting and missing values
Programming for productivity and efficiencyProgramming for productivity and efficiency Changing data values, changing user requirementsChanging data values, changing user requirements High volumes, non-linear searchesHigh volumes, non-linear searchesConventional data transformation methods do not solve Conventional data transformation methods do not solve
the metadata and data value challenges – need data the metadata and data value challenges – need data re-engineeringre-engineering Stephen Brown, Vality Corp.Stephen Brown, Vality Corp.
Data Re-engineering ProcessData Re-engineering Process
External Files
Legacy Applications
Historical Extracts
•Customer Information Systems
•Data Warehouses
•Client/server Applications
•Consolidations
• Data investigation and Metadata Mining
• Data Standardization
• Data Integration
• Data Survivorship and Formating
Natural Laws of Natural Laws of Data Re-engineeringData Re-engineering
Data has no standardData has no standard You can’t predict or legislate format or contentYou can’t predict or legislate format or content Data will evolve faster than its capture and storage systemsData will evolve faster than its capture and storage systems You can’t write rules for what you don’t know and can’t seeYou can’t write rules for what you don’t know and can’t see Instructions for handling data are within the dataInstructions for handling data are within the data Don’t trust the metadata, make the data reveal itselfDon’t trust the metadata, make the data reveal itself Revealed metadata is knowledge about the businessRevealed metadata is knowledge about the business Revealed metadata validates warehouse designRevealed metadata validates warehouse design Revealed metadata supports conversion project managementRevealed metadata supports conversion project management Revealed metadata is insurance against misinformationRevealed metadata is insurance against misinformation
3 - DW Tools3 - DW Tools
1st Generation ETL
Job Schedulers
RDBMS Utilities
Replication/Distribution Tools
Universal Repositories
Q&R/MQE/MRE
MOLAP/ROLAP/LowLAP
Data Mining
CASE
DB Design
Repositories
DB & System Monitors
TechnologiesTechnologies
ProcessesProcesses
EIS
Data Visualization
Meta Data Browsers
•Design•Mapping
•Design•Mapping
•Extract•Scrub•Transform
•Extract•Scrub•Transform
•Load•Index•Aggregation
•Load•Index•Aggregation
•Replication•Data Set Distribution
•Replication•Data Set Distribution
•Access & Analysis•Resource Scheduling & Distribution
•Access & Analysis•Resource Scheduling & Distribution
Meta DataMeta Data
System MonitoringSystem Monitoring
2nd Generation ETL Suites / Environments
TransformationTransformation Choosing between tool and manually coded Choosing between tool and manually coded
programsprograms Time frames - tools take longerTime frames - tools take longer
select, configure, learnselect, configure, learn
Budgets - short term or long termBudgets - short term or long term Size of warehouse - initial project small enough for Size of warehouse - initial project small enough for
codingcoding Size and skills of warehouse teamSize and skills of warehouse team
Tool automatically generates and maintains Tool automatically generates and maintains metadatametadata
Hand Generated CodeHand Generated Code
UpsideUpside No learning curveNo learning curve Inherent skillsInherent skills In house capabilitiesIn house capabilities Usually simpleUsually simple No culture change/mandate (CASE)No culture change/mandate (CASE)
DownsideDownside Manual meta dataManual meta data Maintenance challenge when talent level changesMaintenance challenge when talent level changes No automationNo automation
ToolsTools
UpsideUpside Easy to maintain as talent level changesEasy to maintain as talent level changes Automatic meta dataAutomatic meta data May gain efficienciesMay gain efficiencies Integration with repositoriesIntegration with repositories Integration with other toolsIntegration with other tools
SchedulersSchedulers MonitorsMonitors Meta data managementMeta data management
ToolsTools
DownsideDownside Cost (1st generation tools very high $)Cost (1st generation tools very high $) Learning curveLearning curve Enforced culture changeEnforced culture change
Must use tool for all changesMust use tool for all changes
Speed, may be slower to implementSpeed, may be slower to implement May require additional resourcesMay require additional resources
Copyright © 1997, Enterprise Group, Ltd.
Manual Code / 1st Generation Manual Code / 1st Generation ETL Tools ProcessETL Tools Process
ExtractProgram
SourceSourceOLTPOLTP
SystemsSystems
TransformProgram
FileTransferProgram
FileLoad
Program
Source Mainframe or C/S System Data Warehouse Client/Server System
IndexProgram
AggregationProgram
External Job Scheduling and Control - External Meta Data Load/Maintenance
Copyright © 1997, Enterprise Group, Ltd.
2nd Generation ETL Tools 2nd Generation ETL Tools ProcessProcess
Source OLTPSource OLTPSystemsSystems
TransformationEngine
•Monitoring•Scheduling•Extraction•Scrubbing•Transformation•Load•Index•Aggregation•Meta Data Load•Meta Data Maint.
CachingCaching
Source Mainframeor C/S System
Transformation EngineC/S System
Data Warehouse/MartC/S System
Data WarehouseData Warehouseor Data Martor Data Mart
Copyright © 1997, Enterprise Group, Ltd.
2nd Generation ETL Environment 2nd Generation ETL Environment ProcessProcess
Source OLTPSource OLTPSystemsSystems
TransformationEngine
•Monitoring•Scheduling•Extraction•Scrubbing•Transformation•Load•Index•Aggregation•Meta Data Load•Meta Data Maint.•Request Broker
CachingCaching
Source Mainframeor C/S System
Transformation EngineC/S System
Data WarehouseC/S System
Enterprise Meta Data
Data MartC/S System
Data MartData Mart
Data MartC/S System
Data MartData Mart
•Surf Meta Data•Request Resource•Schedule Delivery
User Process
DataWarehouse
1st Generation ETL Tools 1st Generation ETL Tools Hampered by:Hampered by:
High cost (average deal prices in the $250-400k range)High cost (average deal prices in the $250-400k range) Long learning curvesLong learning curves Perceived value (most teams felt they could write better code)Perceived value (most teams felt they could write better code) Cultural challenges (like a CASE tool, the team must use the Cultural challenges (like a CASE tool, the team must use the
code generator for all creation and changes, no matter how code generator for all creation and changes, no matter how minor)minor)
Core capabilities (complex transformations still required manual Core capabilities (complex transformations still required manual code)code)
Management requirements (users still had to manage all the Management requirements (users still had to manage all the programs generated)programs generated)
Performance issues (the resulting programs could not leverage Performance issues (the resulting programs could not leverage parallelism)parallelism)
Important 2nd Generation ETL Important 2nd Generation ETL tool features:tool features:
Transformation engine designTransformation engine design Ability to leverage parallel server technologyAbility to leverage parallel server technology CDC (Change Data Capture, which allows only the CDC (Change Data Capture, which allows only the
new data to be extracted)new data to be extracted) Incremental aggregation (ability to add CDC Incremental aggregation (ability to add CDC
incremental data to existing aggregations)incremental data to existing aggregations) Limited or no use of temporary files or data base tables Limited or no use of temporary files or data base tables
(virtual caching only)(virtual caching only) Common, open and extensible meta data repositoryCommon, open and extensible meta data repository Enterprise scalabilityEnterprise scalability
Important 2nd Generation ETL Important 2nd Generation ETL tool features:tool features:
Common UI (User Interface) across all toolsCommon UI (User Interface) across all tools Extensive selection of transformation algorithmsExtensive selection of transformation algorithms Easily extensible scrub and transform algorithm library Easily extensible scrub and transform algorithm library Extensive heterogeneous source and target supportExtensive heterogeneous source and target support Native OLAP data set target supportNative OLAP data set target support System monitoring & management System monitoring & management Enterprise meta data repository (content, resources, Enterprise meta data repository (content, resources,
structure, etc.)structure, etc.) Transform once, populate many (populate multiple targets Transform once, populate many (populate multiple targets
with a single transformation output)with a single transformation output)
Important 2nd Generation ETL Important 2nd Generation ETL tool features:tool features:
Integrated enterprise scale scrubbing capabilitiesIntegrated enterprise scale scrubbing capabilities Seamless interoperability with external point solution Seamless interoperability with external point solution
toolstools Integrated information access, analysis, scheduling Integrated information access, analysis, scheduling
and deliveryand delivery Aggregate aware information request broker (enables Aggregate aware information request broker (enables
virtual data warehouse)virtual data warehouse) Ad hoc aggregation monitoring and managementAd hoc aggregation monitoring and management Pipeline parallelism / very high throughputPipeline parallelism / very high throughput Native drivers (source and target)Native drivers (source and target)
OLTP <> OLAPOLTP <> OLAP
OLTPOLTP normalizednormalized
OLAPOLAP tools must provide multidimensional tools must provide multidimensional
conceptual view of data ??????conceptual view of data ??????Providing OLAP to User Analysts, Providing OLAP to User Analysts, E.F.CoddE.F.Codd
redundant dataredundant data