Post on 05-Jun-2018
transcript
© 2006 IBM Corporation
IBM Information IntegrationCapabilities
Toon LatinneIBM Software Group
Grimbergen, November 2006
Overview ofIBM Information Server
2
Customer Business Issues
� Too much information and not knowing what’s important
– Not using demand signals to drive supply chain
– Not using customer analysis to tailor marketing and sales
– Not leveraging valuable unstructured information
� Multiple versions of the truth
– Problems managing customer, product and partner interactions
– Regulatory compliance inhibited by poor transparency
� Lack of trusted information
– Incomplete, out-of-date, inaccurate, misinterpreted data
– Difficult to understand or control how information is used
� Lack of agility
– Inability to take advantage of opportunities for innovation
– Escalating costs due to inflexible systems and changing needs
3
� Where is my information?
� How do I get it when I need it?
� What does it mean?
� Can I trust it?
� How do I get it in the form I need?
� How do I get it where it needs to go?
� How do I control it?
Why Is it Important to Start with Understanding?
4
Michael Johnson
User id :mjohnson
JP Morgan
Contract:: JP987
eCommerce
JP Morgan, USA
Cust ID : JP003
ERP
Mike Johnson
JP Morgan Chase
Last Interaction: 4/11/03 (product not
received)
Customer
SupportJP Morgan & Chase
Contact : Michael A Johnson, CIO
270 West St
NY
Sales
Michael Johnson
User ID: Mjohnso
! Personalized access
! Gold Customer
! Sub: Newsletter 1
Portal
Michael Johnson
! Opt-Out flag
! No Promotion flag
Marketing
Multiple information silos across company
On line ordering
Payements Service & Support OrdersOn-line
processingSeminaires, Newsletter, Promotion
Specifics Systems
Customer data into many
databases
Corp
DW
5
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand Cleanse Transform Deliver
IBM Information Server
Discover, model, and govern information structure and content
Standardize, merge,and correct information
Combine and restructure information
for new uses
Synchronize, virtualizeand move information for in-line delivery
Platform Services
Parallel
Processing
Services
Connectivity
Services
Metadata
Services
Deployment
Services
Administration
Services
6
WebSphere
InformationAnalyzer
WebSphere
BusinessGlossary
WebSphere
QualityStage
Introducing the IBM Information Server
Rational
Data Architect
WebSphere
DataStage
Understand Cleanse Transform
WebSphere
Federation
Federate
Platform Services
Parallel
Processing
Services
Connectivity
Services
Metadata
Services
Deployment
Services
Administration
Services
7
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand Cleanse Transform Deliver
IBM Information Server
Discover, model, and govern information structure and content
Standardize, merge,and correct information
Combine and restructure information
for new uses
Synchronize, virtualizeand move information for in-line delivery
Platform Services
Parallel
Processing
Services
Connectivity
Services
Metadata
Services
Deployment
Services
Administration
Services
8
Data ProfilingCritical Problems:
� You don’t know what data is really in your legacy systems
� Sources have changed or are new and unknown
Why?
� Data values and relationships are inconsistent and divergent from documented rules
� Incomplete and missing documentation
� Data sources are never static and frequently change without warning
Alternative Approach
� Labor intensive, resource devouring process
� Never review 100% of data elements
� No standardized approach across projects
� 1st generation tools document but don’t address the problem resolution
Mainframe manufacturing system
Demographic
Contact
Billing / Accounts
External Lists
Distribution
ERP from acquisition
Parts BOM
Data SourcesData Sources
9
IBM’s solution : Information Analyzer
What is it?What is it?
Next generation data profiling and analysis tool for
heterogeneous enterprise data sources
• Integrates profiling capabilities from three distinct products
What does it do?What does it do?
Analyzes data sources to discover structure, contents and
quality of information
• Infers the “reality” of the data, not just the data definition
• Finds and reports missing, inaccurate and inconsistent data
• Allows review of the quality of data throughout the life cycle
• Review 100% of data elements
• Standardized approach
• Automated
Who uses it?Who uses it?
Business Analyst & Data Analysts
10
Column Analysis – ExampleStatus Code is ‘I’, ‘A’ or ‘ACCT_STATUS’Why ?
11
Column Analysis – ExampleZipCode has different formatsWhy ?
12
What does Information Analyzer do ?
1. Source System Analysis - provides the key understanding of the source data
a) Column analysis
b) Primary Key analysis
c) Foreign Key analysis
d) Cross Domain analysis
2. Monitoring and Integration- leverages the analysis to expedite the development
a) Baseline Analysis to track changes
b) Reference tables for Transformation, Completeness and Validity enforcement
c) Integrated Metadata – results visible in rest of Suite
Foreign Key &
Cross Domain Analysis
Primary
Key Analysis
Column
Analysis Source 1 Source 2
13
How Does WebSphere Information Analyzer Work?
Column Analysis Table Analysis Cross-Table Analysis
Frequency Distribution Analysis
Class, Properties, Format and Domain & Completeness Analysis
Annotate & Flag Fields for Review
Foreign Key & Similarity AnalysisPrimary Key Analysis
Run Baseline Comparisons Report Results
14
15
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand Cleanse Transform
IBM Information Server
Discover, model, and govern information structure and content
Standardize, merge,and correct information
Combine and restructure information
for new uses
Synchronize, virtualizeand move information for in-line delivery
Platform Services
Parallel
Processing
Services
Connectivity
Services
Metadata
Services
Deployment
Services
Administration
Services
Deliver
16
Need for Data Quality
Critical Problems
� Need to create & maintain 360 degree views of customers, suppliers, products, locations, events
� Need to leverage data - make reliable decisions, comply with regulations, meet service agreements
Why?� No common standards across organization
� Unexpected values stored in fields
� Required information buried in free-form fields
� Fields evolve - used for multiple purposes
� No reliable keys for consolidated views
� Operational data degrades 2% per month
Alternative Approaches
� Denial – problem misunderstood and ignored until too late; load and explode
� Hand-coding - clerical exception processing; very time consuming and resource intensive
� Simplistic cleansing apps - evolved from direct marketing & list hygiene, lack flexibility
Kent Fried Chick
Kentucky Fried
Kentucky Fried Chicken
KFC
Molly Talber DBA KFC
Mrs. M. Talber
John & Molly Talber
Talber, KFC, ATIMA
Data Sources Data ValuesData Sources Data Values
227G CB&NATURAL STICK
MOZZ WRAPPER
227G CB&NAT STICK P
QUE/MOZZ WRAPP.
17
Database with Consolidated
Views
1. Free Form Investigation
2. Data Standardization
3. Data Matching
4. Data Survivorship
WebSphereQualityStage Process
Customers
Transactions
Vendors / Suppliers
Target
Products / Materials
How will you get an accurate, consolidated view of your business?
18
� Investigation - Free Form
Parsing:Separating multi-valued fields into individual pieces
“The instructions for handling the data are inherent within the data itself.”
123 | St. | Cecilia | St.
CeciliaCecilia
Lexical analysis:Determining business significance of individual pieces
Context Sensitive:Identifying various data structures and content
number street City streettype type
123 | St. | Cecilia | St.
House Street StreetNumber Name Type
123 | St. Cecilia | St.
123123 St.St. St.St.
19
Pattern Code Frequency Percent Data Sample
FI+ 428,700 42.9% Eileen M. Rutherford
F+ 278,900 27.9% Robert Johnson
FI+G 103,700 10.4% Charles S. Horton Jr
F+G 60,200 6.0% Ralph Gabrik Ii
FI+II 47,000 4.7% Ben C Tancino M.D.
++&F+ 35,000 3.5% Maryl Danta and Kay Longo
F&F+ 29,000 2.9% Nicolas & Claire Moore
PFI+ 1,000 0.1% Rev. Pascal L Acquavia
FI+BBB 570 0.1% Richard H Carr et al Trust
TBB 1 0.0% Attn: Accounts Payable
LegendF= First Name
+ = Alpha
I = Initial
P = Name Prefix
G = Generation
& = And
S = Name Suffix
T = Attention
B = Business
Words
� Free Form Investigation – Word Structure
20
� Standardization - Example
Input File:
Address Line 1 Address Line 2
639 N MILLS AVENUE ORLANDO, FLA 32803306 W MAIN STR, CUMMING, GA 301303142 WEST CENTRAL AV TOLEDO OH 43606843 HEARD AVE AUGUSTA-GA-309041139 GREENE ST ACCT #1234 AUGUSTA GEORGIA 309014275 OWENS ROAD SUITE 536 EVANS GA 30809
Result File:
House # Dir Str. Name Type Unit No. NYSIIS City SOUNDEX State Zip ACCT#
639 N MILLS AVE MAL ORLANDO O645 FL 32803 306 W MAIN ST MAN CUMMING C552 GA 30130
3142 W CENTRAL AVE CANTRAL TOLEDO T430 OH 43606843 HEARD AVE HAD AUGUSTA A223 GA 30904
1139 GREENE ST GRAN AUGUSTA A223 GA 30901 1234 4275 OWENS RD STE 536 ON EVANS E152 GA 30809
21
What Constitutes a Good Match?
W HOLDEN 12 MAIN ST
W HOLDEN 12 MAINE ST
Which of the following record pairs is a match? And how do you know?
� Do you compare all the shared or common fields?� Do you give partial credit?� Are some fields (or some values) more important to you than others? Why?� Do more fields increase your confidence?� By how much? What is enough?
W HOLDEN 128 MAIN PL 02111 12/8/62
W HOLDEN 128 MAINE PL 02110 12/8/62
WM HOLDEN 128A MAIN SQ 02111 12/8/62 338-0824
WILL HOLDEN 128A MAINE SQ 02110 12/8/62 338-0824
� Matching
22
WILLIAM J KAZANGIAN 128 MAIN ST 02111 12/8/62
WILLAIM JOHN KAZANGIAN 128 MAINE AVE 02110 12/8/62
Are these two records a match?
Deterministic Decisions Tables:• Fields are compared• Letter grade assigned• Combined letter grades are compared to a vendor delivered file• Result: Match; Fail; Suspect
B B A A B D B A = BBAABDBA
+5 +2 +20 +3 +4 -1 +7 +9 = +49
Probabilistic Record Linkage:• Fields are evaluated for degree-of-match• Weight assigned: represents the “information content” by value• Weights are summed to derived a total score• Result: Statistical probability of a match
� Two Methods to Decide a Match
23
� Probabilistic Scoring Yields More Matches (Less Under-Matching)
Deterministic Decisions Tables apply the same “rule” regardless of the difference in
information content; to be safe, decision tables must forgo many good matches.
But Probabilistic Linkage “sees” the difference between these two pairs. Rare
values can compensate for missing and conflicting fields. The 2nd pair is a
good household match, the first is not.
In the following household match, the deterministic pattern ABBCB is a non-match
(Fail), but the probabilistic cutoff score for 95% certainty is any weighted score > 21
ABBCB
Rec-3 YUSKA 5401 VETCH 818A 02112Rec-4 YUSKA 5410 VEECH 81A 02111Pattern A B B C BWeight 7 3 8 2 4 24
L-Name Hse# Street Apt# Zip
Rec-1 SMITH 123 BEECH 18A 02112Rec-2 SMITH 132 BEACH 18 02111Pattern A B B C B ABBCBWeight 5 2 7 1 4 19 Reject
Pass
Erroneous
Reject
24
WILLIAM J KAZANGIAN 128 MAIN ST 02111 12/8/62
WILLAIM JOHN KAZANGIAN 128 MAINE AVE 02110 12/8/62
The weighted score is a
relative measure of the
probability of a match; it
expresses the amount of
information content for all
of the fields compared
The CUTOFF is the
score above which good
matches are found
+5 +2 +20 +3 +4 -1 +7 +9 = 49
� A Closer Look at Probabilistic Matching
Histogram of Weights
0
500
1000
1500
2000
2500
3000
3500
4000
-50 -40 -30 -20 -10 0 10 20 30 40 50 60
# of Pairs
UnMatched
Matched
25
� Survivorship - Example
Survivorship Input (Match Output)Group Legacy First Middle Last No. Dir. Str. Name Type Unit No.1 D150 Bob Dixon 1500 SE ROSS CLARK CIR1 A1367 Robert Dickson 1500 ROSS CLARK CIR
23 D689 William A Obrian 5901 SW 74TH ST STE 20223 A436 Billy Alex O’Brian 5901 SW 74TH ST23 D352 William Obrian 5901 74 ST # 202
Consolidated OutputGroup First Middle Last No. Dir. Str. Name Type Unit No.1 Robert Dickson 1500 SE ROSS CLARK CIR
23 William Alex O’Brian 5901 SW 74TH ST STE 202
Group Legacy1 D1501 A1367
23 D68923 A43623 D352
26
Industry Leading Parsing & Matching
� Probabilistic Matching Engine
� Intuitive match designer, with data sampling, visual fine tuning and extensive reporting
� Parsing handles any number of fields & free-form fields
� Unified parallel processing framework
27
Seamless Integration of Data Quality
� Single design paradigm across data quality and ETL
� Granular design integration of data quality logic with data processing functions
� Unified metamodel with ETL and profiling and active metadata sharing
� Easy SOA deployment of data quality logic
28
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand Cleanse Transform
IBM Information Server
Discover, model, and govern information structure and content
Standardize, merge,and correct information
Combine and restructure information
for new uses
Synchronize, virtualizeand move information for in-line delivery
Platform Services
Parallel
Processing
Services
Connectivity
Services
Metadata
Services
Deployment
Services
Administration
Services
Deliver
29
What is WebSphere DataStage?
� Provides codeless visual design of data flows with hundreds of built-in transformation functions
– Optimized reuse of integration objects
– Supports batch & real-time operations
– Produces reusable components that can be shared across projects
� Complete ETL functionality with metadata-driven productivity
� Supports team-based development and collaboration
� Provides integration from across the broadest range of sources
� Extreme data volumes!
Transform
Transform and aggregate any volume
of information in batch or real time
through visually designed logic
Hundreds of Built-in
Transformation Functions
ArchitectsDevelopers
WebSphere DataStage®
Deliver
30
Easy Design of Complex Data Processing
� Graphical, top-down design metaphor with extensive pre-built functions
� Extensible, component-based architecture
� Strong reuse capabilities
� Broad and deep connectivity
� Rapid SOA deployment capability
31
Massive Scalability
� Design serially, deploy in parallel
� Parallel design managed at deployment time
� Supports both parallel pipelining and partitioning
32
Metadata-Driven Integration
� Unified metamodel across IBM Information Server
� Active metadata analysis, including diff, impact, and lineage
� Carries forward annotations and analysis from profiling
� Provides instant in-tool access to metadata
33
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand Cleanse Transform
IBM Information Server
Discover, model, and govern information structure and content
Standardize, merge,and correct information
Combine and restructure information
for new uses
Synchronize, virtualizeand move information for in-line delivery
Platform Services
Parallel
Processing
Services
Connectivity
Services
Metadata
Services
Deployment
Services
Administration
Services
Deliver
34
What is IBM WebSphere Federation Server?
� Provides access to diverse & distributed information as if it were in one system
– Single SQL query access to diverse sources
– Provides visual tools for defining federated queries
� Includes industry-leading query optimization with single sign-on, unified views, and function compensation
� Supports transactional write capabilities across heterogeneous sources
� Enables bi-directional data access services to be published in a SOA
Deliver
IBM WebSphere Federation Server
Access and integrate heterogeneous
information across multiple sources
as if they were a single source
Extend value of existing analytical
applications by providing real-time
access to integrated information
35
Federated Queries Make Integration as Easy as SQL
SELECT parameters_return_billto_key as BILL_TO_KEY,billto_company_name,parameters_return_shipto_key as SHIP_TO_KEY,CASES_SHIPPED,GROSS_SALES,URL
FROM GETKEYSSOAP_GETKEYSREALTIME_NN,
GLOBAL_SALES_TRAN_NN,
BILLTO_DIMENSION,
URL_INVOICES
WHERE getkeysrealtime_ship_to_number = '13546'and getkeysrealtime_ship_to_number = URL_INVOICES.shipnoand ltrim(rtrim(translate(ship_to_number, ' ', x'0a')))
= getkeysrealtime_ship_to_numberand parameters_return_billto_key = billto_keyand ltrim(rtrim(translate(sales_order_number, ' ', x'0a')))
= URL_INVOICES.orderno;
Single SQL Query Joins:
� Web Service
� XML Documents
� Data Warehouse
� Unstructured Data
36
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand Cleanse Transform Deliver
IBM Information Server
Discover, model, and govern information structure and content
Standardize, merge,and correct information
Combine and restructure information
for new uses
Synchronize, virtualizeand move information for in-line delivery
Platform Services
Parallel
Processing
Services
Connectivity
Services
Metadata
Services
Deployment
Services
Administration
Services
37
Three Elements of a Scalable Enterprise
Scalable Relational Database
The relational database vendors
have offered scalable, parallel
relational databases for more
than 5 years.
Scalable Hardware
The hardware vendors
have offered scalable
parallel computers
for more than 5 years.
Scalable Data Integration and Applications
Other than the relational database vendors,
IBM is now the only other mainstream
software vendor in the data warehousing
market offering a scalable software platform with no
limitations on throughput and performance.
Scalability: Three Elements
38
Develop Once - Deploy for Maximum Performance w/Enterprise Edition
39
Uniprocessor SMP SystemMPP, GRID, and
Clustered Systems
How Do Parallel Processing Services Work?
40
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand Cleanse Transform Deliver
IBM Information Server
Discover, model, and govern information structure and content
Standardize, merge,and correct information
Combine and restructure information
for new uses
Synchronize, virtualizeand move information for in-line delivery
Platform Services
Parallel
Processing
Services
Connectivity
Services
Metadata
Services
Deployment
Services
Administration
Services
41
How Do Connectivity Services Work?
General Access
Sequential File
Complex Flat File
File / Data Sets
Named Pipe
FTP
Compressed / Encoded Data
External Command Call
Parallel/wrapped 3rd party apps
EMC InfoMover
Web logs
Unstructured: e-mail, docs, etc.
Content Management Systems
Life Sciences
Standards & Real Time
WebSphere MQ
Java Messaging Services (JMS)
Java
XML & XSL-T
EBXML
Web Services (SOAP)
Enterprise Java Beans (EJB)
EDI
FIX
SWIFT
HIPAA
Enterprise Applications
JDE/PeopleSoft EnterpriseOne
Oracle Applications
PeopleSoft Enterprise
SAS
SAP R/3 & BI
SAP XI
Siebel
JDA
Ariba
Manugistics
I2
And more…
Legacy
Allbase/SQL
C-ISAM
D-ISAM
Datacom/DB
DS Mumps
Enscribe
Essbase
FOCUS
IDMS/SQL
ImageSQL
Infoman
KSAM
M204
MS Analysis
Nomad
Nucleus
RMS S2000
Supra
TOTAL
TurboImage
Unify
And many more….
RDBMS
DB2 (on Z, I, P or X series)
Oracle
Informix (IDS and XPS)
Ingres
MySQL
Netezza
Progress
RDB
RedBrick
SQL/DS
SQL Server
Sybase (ASE & IQ)
Teradata
Universe
UniData
NonStopSQL
And more…..
CDC / Replication
DB2 (on Z, I, P, X series)
Oracle
SQL Server
Sybase
Informix
IMS
VSAM
ADABAS
IDMS
NonStopSQL
Enscribe
42
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand Cleanse Transform Deliver
IBM Information Server
Discover, model, and govern information structure and content
Standardize, merge,and correct information
Combine and restructure information
for new uses
Synchronize, virtualizeand move information for in-line delivery
Platform Services
Parallel
Processing
Services
Connectivity
Services
Metadata
Services
Deployment
Services
Administration
Services
43
What are Metadata Services?
� A set of shared services that actively manage metadata through the development and runtime process
� Provide shared understanding, acceleration, automation, and operational visibility across roles and technologies
� Provide in-tool metadata visibility and analysis
Business Users
DataAnalysts
Architects Developers
Terms &
TaxonomiesBusiness Rules
Data
Analysis
Sources & Data Rules
ModelsTarget Tables
Shared Understanding,
Acceleration, Automation
Impact Analysis, Lineage, Operational Insight
Table Definitions,
Transformation
Logic,
Data Flows
44
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand Cleanse Transform Deliver
IBM Information Server
Discover, model, and govern information structure and content
Standardize, merge,and correct information
Combine and restructure information
for new uses
Synchronize, virtualizeand move information for in-line delivery
Platform Services
Parallel
Processing
Services
Connectivity
Services
Metadata
Services
Deployment
Services
Administration
Services
45
What Makes Administration Services Different?
Reduces risk, Reduces administration costs
Unified user management
Provides stronger collaboration capabilities, Reduces training requirements
Common graphical reporting tool with options for on-screen, HTML, or PDF formats
Reduces risk, Improves controlStrong, granular security control
Reduces risk, Leverages existing investments, Reduces administration burden
LDAP and Active Directory integration
Reduces debugging time, Reduces administration costs
Unified logging
Powerful Reporting
Unified Administration
BenefitsCommon Security Model
46
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand Cleanse Transform Deliver
IBM Information Server
Discover, model, and govern information structure and content
Standardize, merge,and correct information
Combine and restructure information
for new uses
Synchronize, virtualizeand move information for in-line delivery
Platform Services
Parallel
Processing
Services
Connectivity
Services
Metadata
Services
Deployment
Services
Administration
Services
47
What is WebSphere Information Services Director?
� Packages information integration logic as services that insulate developers from underlying sources
� Allows these services to be invoked as Enterprise Java Beans or Web services
� Provides load balancing & fault tolerance for requests across multiple Information Servers
� Provides foundation infrastructure for Information Services
Flexibly deploy and manage reusable
information services without hand
coding
ArchitectsDevelopers
WebSphere Information Services Director
Rapid SOA Deployment
48
Easy, Flexible Service Deployment
� Easy, quick service deployment
� Unified deployment for transformation, quality, and federation
� Supports Web services and EJB bindings with a single service definition, single point of maintenance
� Flexibility in defining service interface, including support for arrays and field defaults and overrides
49
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand Cleanse Transform Deliver
IBM Information Server
Discover, model, and govern information structure and content
Standardize, merge,and correct information
Combine and restructure information
for new uses
Synchronize, virtualizeand move information for in-line delivery
Platform Services
Parallel
Processing
Services
Connectivity
Services
Metadata
Services
Deployment
Services
Administration
Services
50