IBM Information Integration Capabilities · PDF fileNotes on any Repository Object –...

Post on 06-Feb-2018

217 views 1 download

transcript

© 2006 IBM Corporation

IBM Information IntegrationCapabilities

2

The IBM Solution: IBM Information ServerDelivering information you can trust

Understand Cleanse Transform Deliver

Parallel ProcessingRich Connectivity to Applications, Data, and Content

IBM Information Server

Discover, model, and govern information

structure and content

Standardize, merge,and correct information

Combine and restructure information

for new uses

Synchronize, virtualizeand move information

for in-line delivery

Unified Deployment

Unified Metadata Management

3

Where is my information?

How do I get it when I need it?

What does it mean?

Can I trust it?

How do I get it in the form I need?

How do I get it where it needs to go?

How do I control it?

Why Is it Important to Start with Understanding?

4

Physical Metadata: WebSphere Information Analyzer

Data-centric analysis of application, database and file-based sources

Secure, detailed profiling of fields, across fields, and across sources

Creation of metadata from profiling results

Results instantly promotable across IBM Information Server

UnderstandAnalyze source data structures, and

monitor adherence to integration and quality rules

WebSphere Information Analyzer

DataAnalysts

Subject Matter Experts

Physical View

5

WebSphere Information Analyzer

What is it?What is it?Next generation data profiling and analysis tool for heterogeneous enterprise data sources

• Integrates profiling capabilities from three distinct products

What does it do?What does it do?Analyzes data sources to discover structure, contents and quality of information

• Infers the “reality” of the data, not just the data definition• Finds and reports missing, inaccurate and inconsistent data • Allows review of the quality of data throughout the life cycle

Who uses it?Who uses it?Business and Data Analysts, Data Quality Specialists, Data Architects and Data Stewards

6

WebSphere Information Analyzer

End-to-End Data Profiling and Content Analysis– Combines data profiling, data audit, and data format investigation technologies

– Provides column, primary key, foreign key, and cross-domain analysis

– Incorporates comparative analysis against established baselines over time

– Leverages central repository for analysis results with project- and role-level data security

Driven by Business– Intuitive and Collaborative Environment

– Visualization of data analysis

– Extensive Reporting of analytical results

Exploiting Unique Information Integration Platform Advantages– Shared metadata and connectivity services

– Shared analytical results with WebSphere DataStage/QualityStage

– Parallel Engine technology for highly scalable performance

7

A single unified and integrated framework

A new and exciting visual design

Pillar menu focused on methodology and user-based tasks, not products

Environment that promotes collaboration

Personalization and customization

Information Analyzer Home Screen

8

Full graphical enablement and display of key analytical data

Potential problems flagged for easy identification

Multiple open workspaces and tabs for easy navigation to facilitate review

Ability to filter results to quickly focus on business issues

Information Analyzer Drill Down

9

Quality Controls for Completeness and Validity of data values

Incomplete or Invalid values set by value, range, or reference sources

Consistency checks for data formats

Information Analyzer Validation

10

Information Analyzer Spotlight: Column Analysis

•Domain Values & Validation

•Data Classification

•Data Properties

•Formats

11

Frequencies of data values and format patterns

Classification of data by system and user

Inferences of data properties (e.g. data type, length, uniqueness)

Information Analyzer Spotlight

12

Quality Controls for Completeness and Validity of data values

Incomplete or Invalid values set by value, range, or reference sources

Conformity checks for data formats

Information Analyzer Spotlight

13

Easily generate reference tables of default, valid, or invalid data

Incorporate transformation mapping values

Preview table output

Export reference tables to desired location for ongoing use

Leverage in WebSphere DataStage or QualityStage jobs

Information Analyzer Spotlight

14

Drilldown to underlying data

Review exception conditions from profiling or data rules

View in workspace with associated information

Filter drilldown results to enhance understanding

Information Analyzer Spotlight

15

Information Analyzer Spotlight: Table Analysis

•Primary Keys(single or multi-column)

•Key Duplicates

16

Evaluate single or multi-column primary keys

Summary and detail of column uniqueness

Details of primary key duplicates

Review of frequency distribution

Information Analyzer Spotlight

17

Information Analyzer Spotlight: Cross Table Analysis

•Foreign Key Relationships

•Referential Integrity

•Cross-Domain Relationships

•Data Redundancy

18

Evaluate single or multi-column foreign keys across any number of tables and sources

Summary of referential integrity

Details of key violations including orphaned values

Test any set of common domains for compatibility or redundancy

Information Analyzer Spotlight

19

Information Analyzer Spotlight: Baseline Analysis

•Current-to-Prior Comparison

•Content & Structural Variation

20

Compare a checkpoint or current analysis to a baseline

Table-level summary & column-level details

Identify changes in structure or content

Includes changes in quality measures

Turns data profiling into an ongoing event throughout project lifecycle

Information Analyzer Spotlight

21

All analytical processes can be scheduled

Scheduling supports: start date or delay, repeating definitions, end date or delay, and repeat count to stop schedule

Information Analyzer Spotlight

22

Information Analyzer Spotlight

Notes on any Repository Object– Metadata

Information– Any Analytical

Result

Supports user-defined Status and Type for subsequent reporting

23

Multi-level security and administration framework:

Suite

Product

Project

Data source

Standard Authentication controls

User, role, and privilege assignment

Environment that supports critical compliance regulations

Information Analyzer Highlights

24

Metadata discovery shared across Suite

Projects register interest only in Data Sources of concern

Metadata Import focused on user interest

Analytical results published in secured framework

Information Analyzer Spotlight

25

The IBM Solution: IBM Information ServerDelivering information you can trust

Understand Cleanse Transform Deliver

Parallel ProcessingRich Connectivity to Applications, Data, and Content

IBM Information Server

Discover, model, and govern information

structure and content

Standardize, merge,and correct information

Combine and restructure information

for new uses

Synchronize, virtualizeand move information

for in-line delivery

Unified Deployment

Unified Metadata Management

26

Why Should I Care About Cleansing Information?

Lack of information standards– Different formats & structures

across different systems

Data surprises in individual fields– Data misplaced in the database

Information buried in free-form fields

Data myopia– Lack of consistent identifiers inhibit

a single view

The redundancy nightmare– Duplicate records with a lack of

standards

Kate A. Roberts 416 Columbus Ave #2, Boston, Mass 02116

Catherine Roberts Four sixteen Columbus APT2, Boston, MA 02116

Mrs. K. Roberts 416 Columbus Suite #2, Suffolk County 02116

Name Tax ID Telephone

J Smith DBA Lime Cons. 228-02-1975 6173380300Williams & Co. C/O Bill 025-37-1888 415-392-20001st Natl Provident 34-2671434 3380321HP 15 State St. 508-466-1200 Orlando

WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT 1/4 INCH

WING ASSEMBY, USE 5J868-A HEX BOLT .25” - DRILL FOUR HOLES

USE 4 5J868A BOLTS (HEX .25) - DRILL HOLES FOR EA ON WING ASSEM

RUDER, TAP 6 WHOLES, SECURE W/KL2301 RIVETS (10 CM)

19-84-103 RS232 Cable 6' M-F CandS

CS-89641 6 ft. Cable Male-F, RS232 #87951

C&SUCH6 Male/Female 25 PIN 6 Foot Cable

90328574 IBM 187 N.Pk. Str. Salem NH 0145690328575 I.B.M. Inc. 187 N.Pk. St. Salem NH 0145690238495 Int. Bus. Machines 187 No. Park St Salem NH 0415690233479 International Bus. M. 187 Park Ave Salem NH 0415690233489 Inter-Nation Consults 15 Main Street Andover MA 0234190345672 I.B. Manufacturing Park Blvd. Bostno MA 04106

27

Data Cleansing: WebSphere QualityStage

Specialized data quality functions seamlessly integrated with DataStage

Visual tools for defining complex matching and survivorship logic

Ensures clean, standardized, de-duplicated information

Enables a single version of the truth

Cleanse

Subject Matter Experts

Standardize and correct source data fields, and match records together

across sources to create a single view

WebSphere QualityStage™

Visual Match Rule Design

DataAnalysts

28

Integrated Approach - QualityStage & Information Analyzer

Sharing metadata

Both Information Analyzer and QualityStage store Table metadata in the common repository

• Allows sharing of metadata definitions• Provides single metadata import from data source ~ for use in both tools

– Analytical information available in QS Designer• Enables QualityStage user to see analysis data for shared tables• “Analytical Information” tab on the

EditRow dialog when looking at thedetails of an individual column from…

– …a Table Definition– …a stage editor

• “Analytical Information” tab on the TableDefinition dialog

29

Standardization Benefits

Direct from DB or flat file

Optimize disk

Rules are now ‘first class’ objects

30

Introduction to New Match Design Environment -Features

The Major Components

Holding AreaHistogram

Data Viewer

Decision Rules

Pass Composer

Cutoff Tuning

31

Statistics

Introduction to New Match Design Environment -Features

The Major Components (cont.)

Baseline Analysis

Customizable Graphics

32

QualityStage ProcessData

Quality Assessment

(DQA)

Investigation

Data Re-Engineering (DRE)

Standardization Matching Survivorship

Blk 1, 1 St, 05-0005-00 Frist St, Block 11 First Str, #05-001, St, #05-00

Blk 1|First St|05-00Blk 1|First St|05-001|First St|#05-001|St|#05-00

Blk 1|First St|05-00Blk 1|First St|05-001|First St|#05-001|St|#05-00

#05-00, Blk 1, First St#05-00, 1, St

0001 25.0% L^^T^-^0001 25.0% ^-^+TL^0001 25.0% ^OT#^-^0001 25.0% ^T#^-^

33

Investigation - Character

1. Double Click

34

Investigation - Character

2. Select Column 3. Add

35

Investigation - Character

9. Define output as desired

36

Standardization

1. Double Click

Job: Tech Symposium\QualityStage\2.Standardarize\StanAndGenMatchFreqODBC

37

Standardization

1. Double Click

38

Standardization

6. Stage Properties

39

Standardization

7. Output tab to map columns

40

Standardization

8. OK

41

Match Design - Unduplicate

42

Match Design – Unduplicate - Overview

The Major ComponentsHolding AreaHistogram

Data Viewer

Decision Rules

Pass Composer

Cutoff Tuning

43

Match Design - Unduplicate

1. Create Specification

44

Match Design - Unduplicate

Blank Specification

45

Match Design - Unduplicate

2. Select Match Type

46

Match Design - Unduplicate3.

Double

click

on

link t

o loa

d meta

data

4. Load

5. NavigateAnd OK

47

Match Design - Unduplicate

OK

48

Match Design - Unduplicate

6.Click on ‘MyPass’

‘Blocking’

‘Match Commands’

49

Match Design - Unduplicate

8.Save Match Specification

50

Match Design - Unduplicate

9.Give Name and ‘Save’

51

Match Design - Unduplicate

10. Configuration

52

Match Design - Unduplicate

11. Data Sample

12. Data Frequency

13. Data Source Name14. User Name (qsmatch)15. Password (qsmatch)

53

Match Design - Unduplicate

16. Add Blocking Columns

54

Match Design - Unduplicate

17. Select Column

55

Match Design - Unduplicate

18. Add MATCH Column

56

Match Design - Unduplicate

19. Business Name

57

Match Design - Unduplicate

20. Compare Type

58

Match Design - Unduplicate

21. Data ColumnRight-Click

59

Match Design - Unduplicate

Frequencies

60

Match Design - Unduplicate

22. Select

23. Parameter

61

Match Design – Unduplicate (Fully Configured)

62

Match Design – Unduplicate

Grouping option:Match Sets: See all matches and duplicates togetherMatch Pairs+Sort: See the master record repeated

63

Match Design – Unduplicate

Default Display (Grouped by Match Sets)

Grouped by Match Pairs and then sorted Ascending by Weight

64

Match Design – Unduplicate

Compare Weights:See how any two records score

65

Match Design – Unduplicate

Statistics Tab

Change What Shows

66

Match Design – Unduplicate

Change How Shows

67

Match Design – UnduplicateTOTAL Statistics Tab

Change What Shows

Change How Shows

68

Match Implementation - Unduplicate

69

Uduplication Implementation

Job: Tech Symposium\QualityStage\3.Unduplicate\Unduplicate

1. Double Click

70

Uduplication Implementation

2. Click ‘…’

71

Uduplication Implementation

8. Output Tab to map columns

72

Survive

73

Survive

Job: Tech Symposium\QualityStage\4.Survive\Survive

1. Double Click

74

Survive

3. Highlight and‘Modify Rule’

2. Select Group Identification Column

75

Survive

4. Output Column5. Technique

76

Survive

Out-of-the-boxTechniques

77

Survive

‘Complex’ available

78

Single Design Environment

All phases of data quality:– Investigate

– Standardize

– Match• Unduplicate• Reference

– Survive

79