1 June 2018© MARKLOGIC CORPORATION
Kasey AldereteSenior Product Manager,
MarkLogic
@kaseya
Damon FeldmanSolutions Director,
MarkLogic
@damonfeldman
Master Data In Minutes with Smart Mastering
SLIDE: 2 1 June 2018© MARKLOGIC CORPORATION
Data Integration AspectsSOLVING THE SILOS PROBLEM
Match different inputs to a single 360o record
Merge duplicates
Increase data quality
Refining the system and correcting mistakes
DISCOVERYFEEDBACK
ADJUSTMENT
SLIDE: 3 1 June 2018© MARKLOGIC CORPORATION
Why MDM?• Valid reporting and analytics
• Find correct values for Data Quality
• More data beats a better algorithm
• So does correct data
Equipment Location InspectionFailures
PSV-11 North Sea 1 2PSV-NS-11 North Sea 1 2PSV-11 N-Sea 1 3
Equipment Location FailuresPSV-11 North Sea 1 7
Type 4 pressure sensor valves fail 3x more often in operating temperatures below 15 oC
SLIDE: 4 1 June 2018© MARKLOGIC CORPORATION
MarkLogic Mastering – How?{"Person": {"givenName": "Bob", "familyName": "McDougal", "phone": "415 826-3389", "city": "Laguna Beach""zip": "92652""favoritePet": "cat"}
}
{"Person": {"givenName": ”Robert", "familyName": "MacDougal", "phone": "415 826-3389","city": "Laguna Beach”"zip": "92652”"favoritePet": ”dog”}
}Weight Incoming Data Existing Master Match Type15 Bob Robert20 McDougal MacDougal4 92652 926522 Laguna Beach Laguna Beach45 415 826-3389 415 826-3389
SLIDE: 5 1 June 2018© MARKLOGIC CORPORATION
MarkLogic Mastering – How?{"Person": {"givenName": "Bob", "familyName": "McDougal", "phone": "415 826-3389", "city": "Laguna Beach""zip": "92652""favoritePet": "cat"}
}
{"Person": {"givenName": ”Robert", "familyName": "MacDougal", "phone": "415 826-3389","city": "Laguna Beach”"zip": "92652”"favoritePet": ”dog”}
}Weight Incoming Data Existing Master Match Type5 (was 15) Bob Robert Nickname3 (was 20) McDougal MacDougal Metaphone4 92652 92652 Exact0 (was 2) Laguna Beach Laguna Beach Exact but redundant45 415 826-3389 415 826-3389 Exact
SLIDE: 6 1 June 2018© MARKLOGIC CORPORATION
MarkLogic Data HubsIngest Your
Data Model Harmonize Customized Processes
QuerySearch
Use
Agile Modeling
Harmonize each source
Ingest as-is• Silos: busted Operationalize
Clean, merge, enrich, refine
SLIDE: 7 1 June 2018© MARKLOGIC CORPORATION
MarkLogic and MDMIngest Your
Data Model Harmonize Match MergeQuerySearch
Use
DiscoverAdjust
Ingest as-is• Silos: busted
• One 360o view• Weighted score per field• Additional smart rules
Merge, retaining context• All values or best value• Maintain lineage, history
Operationalize
Adjust and Refine• Accuracy and Recall• Evaluate merge results
Agile Modeling
Harmonize each source
SLIDE: 8 1 June 2018© MARKLOGIC CORPORATION
MarkLogic and MDM
Deduplication
Source Record 1 Single 360o
RecordSource
Record 2
BI and Reporting
Security Policy + Data Lenses
Data HubREST
SQL
SPARQL
Ingest Your Data Model Harmonize Match Merge
QuerySearch
Use
DiscoverAdjust
Search and Discovery
SLIDE: 9 1 June 2018© MARKLOGIC CORPORATION
Best Practices Have Emerged Over TimeDATA INTEGRATION DATA MASTERING
Multiple US StateHealth & Human Services
agencies
SLIDE: 10 1 June 2018© MARKLOGIC CORPORATION
Unfortunately, Traditional MDM Usually FailsTraditional MDM Why?
Slow to implement • Big Modeling Up Front (BMUF)• Map and ETL all the data
Occasional update, slow query • “Query and Scan” is inherently slow
Weak Provenance • Timestamps on every table is difficult to model• … and slow to query
Re-engineer the business • MDM + Operational access + Data Warehouse• More silos
Security risk • Each additional system = more risk
Not trusted • Lineage and history lost
SLIDE: 11 1 June 2018© MARKLOGIC CORPORATION
Unfortunately, Traditional MDM Usually FailsTraditional MDM Why?
Slow to implement • Big Modeling Up Front (BMUF)• Map and ETL all the data
Occasional update, slow query • “Query and Scan” is inherently slow
Weak Provenance • Timestamps on every table is difficult to model• … and slow to query
Re-engineer the business • MDM + Operational access + Data Warehouse• More silos
Security risk • Each additional system = more risk
Not trusted • Lineage and history lost
Traditional MDM has a 76% Failure Rate
SLIDE: 13 1 June 2018© MARKLOGIC CORPORATION
NEW FEATURE
MarkLogic’s Smart Mastering Fast integration into a Data Hub
Operational (thousands of TPS)
More data than traditional MDM
Track provenance and lineage
Maintain data security
SLIDE: 14 1 June 2018© MARKLOGIC CORPORATION
NEW FEATURE
MarkLogic’s Smart Mastering Fast integration into a Data Hub
Operational (thousands of TPS)
More data than traditional MDM
Track provenance and lineage
Maintain data security
SLIDE: 15 1 June 2018© MARKLOGIC CORPORATION
Use Case – HealthCare.gov Master Person Index The problem
- Person records from multiple sources
- Reporting? Find people? Analytics?
- At massive scale, in real time
Temporal history tracked for all Person updates
Lessons learned
- Use indexes for speed
- Same algorithms for matching and search
SLIDE: 16 1 June 2018© MARKLOGIC CORPORATION
Use Case – Oil and Gas
Keeping everyone safe
- Equipment, inspections, repairs, risk
- Names and locations vary
- PSV-11 (Argentina, Rig 11)
- PS Valve11 (Ireland, Oil Platform 11)
- Analyze by equipment item
- Or model number, or oil rig
Design docs, inspections work orders, reports
Lesson learned: Data cleansing + fuzzy match
SLIDE: 17 1 June 2018© MARKLOGIC CORPORATION
Raw Harmonized
Use Case – Insurance with Existing MDMExisting MDM system predating MarkLogic
Slow updates as insured are added
Low query rate
MarkLogic holds the full data set
Claims, doctors, policies
MDM data is “just another” data source
Used to link all other items
Result: Operational Data Hub + External MDM
Demographic row
Demographic row
Demographic row
Demographic row
Demographic row
MasterID | Sys1-ID | Sys2-ID
MasterID | Sys1-ID
MasterID | Sys1-ID | Sys2-ID
MasterID
MasterID | | Sys2-ID
Raw Record(src 1)
HarmonizedRecord
Raw Record(src 2)
MasterIDSet
SLIDE: 18 1 June 2018© MARKLOGIC CORPORATION
MarkLogic Mastering – Process
Raw Record(src 1)
HarmonizedRecord
Raw Record(src 2)
Harmonize first
Raw Record(src 3)
SLIDE: 19 1 June 2018© MARKLOGIC CORPORATION
MarkLogic Mastering – Process
Raw Record(src 1)
HarmonizedRecord
Raw Record(src 2)
CandidateMaster
CandidateMaster
Harmonize first
Fast matching during ingest
- Likely matches from the indexes
- Scan each ”candidate” match using rules
- Nickname
- Sounds-alike
- Typo
- Wrong-data penalty
Raw Record(src 3)
SLIDE: 20 1 June 2018© MARKLOGIC CORPORATION
MarkLogic Mastering – Process Harmonize first
Fast query on ingest
- Get likely matches using the indexes
- Scan each ”candidate” match using rules
- Nickname
- Sounds-alike
- Typo
- Wrong-data penalty
Thresholds for automatic merging
- With configurable merge rules
Raw Record(src 1)
HarmonizedRecord
Raw Record(src 2)
CandidateMaster
CandidateMaster
New, UpdatedMaster
(with lineage and history)Raw Record(src 3)
SLIDE: 21 1 June 2018© MARKLOGIC CORPORATION
MarkLogic Mastering – Process Harmonize first
Fast query on ingest
- Get likely matches using the indexes
- Scan each ”candidate” match using rules
- Nickname
- Sounds-alike
- Typo
- Wrong-data penalty
Thresholds for automatic merging
- With configurable merge rules
Raw Record(src 1)
HarmonizedRecord
Raw Record(src 2)
CandidateMaster
CandidateMaster
New, UpdatedMaster
History and LineageRaw Record
(src 3)
SLIDE: 22 1 June 2018© MARKLOGIC CORPORATION
Raw Record(src 3)
MarkLogic Mastering – Process Harmonize first
Fast query on ingest
- Get likely matches using the indexes
- Scan each ”candidate” match using rules
- Nickname
- Sounds-alike
- Typo
- Wrong-data penalty
Thresholds for automatic merging
- With configurable merge rules
Human review of sensitive or low-score matches
Raw Record(src 1)
HarmonizedRecord
Raw Record(src 2)
CandidateMaster
CandidateMaster
New, UpdatedMaster
SLIDE: 23 1 June 2018© MARKLOGIC CORPORATION
MarkLogic Mastering – API First Enterprise, Enterprise, Enterprise
APIs and Data Lenses to Use Operationally
- Beyond overnight batch processes – though we do that
- Real-time
- Shipping system
- Customer portal
- Analytic dashboards
- Contact preferences
Raw Record(src 1)
HarmonizedRecord
Raw Record(src 2)
Master
Master
Master
REST SQL SPARQL Export
Raw Record(src 3)
APIs
SLIDE: 25 1 June 2018© MARKLOGIC CORPORATION
Smart MasteringWHAT IT INCLUDES
Extensible framework to match & merge –efficiently address duplicate, incomplete, and partial entities
Out of the box rule configurations, set of APIs and a visual interface to get started
MarkLogic
Smart Mastering APIs
Demo GUI
SLIDE: 26 1 June 2018© MARKLOGIC CORPORATION
Smart Mastering
SmartEnable – don’t eliminate – humans.
NEW FEATURE
TrustedBuild confidence in curated data via a trust-based process.
Score based on configurable rules
Keep all data in its original form
Track all processing for auditability
Protect data from end to end
SLIDE: 27 1 June 2018© MARKLOGIC CORPORATION
THE GOAL
Determine Deal Eligibility for Each CustomerCreate a 360° View to determine the best deal available for each car insurance customer…
Is this a preferred customer?
Are there any driving infractions?
Has this person applied for a car policy in the past, possibly with one of our affiliates?
Lillian
Mr. Pollak
SLIDE: 28 1 June 2018© MARKLOGIC CORPORATION
MarkLogic Operational Data Hub
VALIDATION
INDEXING
REFERENCE DATA DENORMALIZATION
HARMONIZATION
POLICY APPLICATION
SMART MASTERING
RELATIONAL VIEWS
SEMANTIC VIEWS
TEMPORAL TRACKING
PROVENANCE & LINEAGE
ACCESSPRIVILEGES & PERMISSIONS
AS IS DATA CURATIONIN
GES
TIO
NCURATED DATA
SOURCE 1DATA
SOURCE N DATA
METADATA
ENVELOPE (ENTITY 1)
ENVELOPE (ENTITY 2)
ENTITY N
MESSAGE BUS
RDBMS
CONTENT FEED
TRANSACTIONAL APPSOPERATIONAL, SEARCH, SEMANTIC APPS
ANALYTICS & BIREAL-TIME TRENDS, BI TOOLS
DOWNSTREAM SYSTEMSERP PROCESSING, ARTIFICIAL INTELLIGENCE
BUSINESS PROCESSESDATA SERVICES, MICROSERVICES
BUSINESS APIs
STANDARD INTERFACES
REST, SQL, SPARQL, OPTIC
EXPORT APIs
SLIDE: 29 1 June 2018© MARKLOGIC CORPORATION
Unpacking Smart MasteringAdmin Screens
Built-in APIs & FunctionsWeighted
QueryMerge
Algorithms
Process Match, MergeLineage
Custom Functions
Harmonize Flow
Match & Merge ConfigurationMatch Scoring
& RulesMerge Policies & Thresholds
<scoring><add property-name="last-name"
weight="8"/><add property-name="first-name"
weight="6"/><add property-name="city"
weight="3"/><expand property-name="first-name"
algorithm-ref="thesaurus" weight="6"> <thesaurus>/mdm/config/thesauri/first-name-synonyms.xml</thesaurus>
<distance-threshold>50</distance-threshold></expand>
</scoring>
Matching Issue
Feature Weight
Bill, William –Nickname
Thesaurus expansion
6
Chrissy, Krissy –Sounds like
Double metaphone
5
Same last name & address –Household
Reduction -5
<merge property-name="address" max-values="1"><postal-code prefer="zip+4" />
</merge>
Merge Algorithms
Length
Max Values
Source
Date Time (Recency)
SLIDE: 30 1 June 2018© MARKLOGIC CORPORATION
SMART MASTERING
Faster MDM Without Buying MDM Fast integration into a multi-model data hub
Lightweight matching and merging
Configuration-driven, with extension points
Track provenance and lineage
Maintain data securityhttps://github.com/marklogic-
community/smart-mastering-core
Future updates: Roadmap in progress: non-person entities, enhance scoring intelligence, iterative mastering