+ All Categories
Home > Documents > Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

Date post: 04-Jan-2016
Category:
Upload: miranda-calvin
View: 26 times
Download: 0 times
Share this document with a friend
Description:
ChoiceMaker Technologies. The NY Citywide Immunization Registry’s MEDD De-Duplication Project. Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*. § ChoiceMaker Technologies, Inc. [email protected]. *New York City Department of Health - PowerPoint PPT Presentation
Popular Tags:
31
Andrew Borthwick, PhD§ Vikki Papadouka, PhD, MPH* Deborah Walker, PhD* *New York City Department of Health [email protected] [email protected] § ChoiceMaker Technologies, Inc. [email protected] Adapted from a presentation at the 34 th National Immunization Conference Washington, DC July 7, 2000 The NY Citywide Immunization Registry’s MEDD De-Duplication Project ChoiceMaker Technologies
Transcript
Page 1: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

Andrew Borthwick, PhD§Vikki Papadouka, PhD, MPH*

Deborah Walker, PhD*

*New York City Department of [email protected]@dohlan.cn.ci.nyc.ny.us

§ ChoiceMaker Technologies, [email protected]

Adapted from a presentation at the34th National Immunization Conference

Washington, DCJuly 7, 2000

The NY Citywide Immunization Registry’sMEDD De-Duplication Project

ChoiceMaker Technologies

Page 2: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

The NYC CIR

New York Citywide Immunization Registry was mandated in January 1997

All health-care providers are required to submit immunizations

Goals of the system:Doctors look up kids’ immunization statuses to

determine which shots to giveNotify parents when their children are due for an

appointmentIdentify citywide immunization trends

Similar registries are being built at the state and local level around the country

ChoiceMaker Technologies

Page 3: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

NYC CIR Background

About 122,000 children are born in NYC every year

Each month the CIR receives: 50-100,000 patient records and

80-200,000 immunization records

From >1,100 institutions and private providers

Given this volume, hand-matching each new record before it enters the CIR is unrealistic

ChoiceMaker Technologies

Page 4: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

NYC CIR: Background

Contains 1.8 million records

Very high duplication rate estimated at 3 records: 2 children because of very strict criteria for automatic merging

During April-September 1998 CIR staff reviewed and manually de-duplicated about 260,000 record pairs: spent 1,700 hours

ChoiceMaker Technologies

Page 5: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

MEDD: What it is

A system for deciding when two records represent the same child

Fast and accurateReplicates the human decision-making process

ChoiceMaker Technologies

Page 6: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication ProjectMEDD’s Decision-Making Process

For every record pair, MEDDMEDD computes a probability between 0 and 100% that the pair should be merged

High probabilities “mergemerge”

Low probabilities “don’t mergedon’t merge” Intermediate probabilities (close to 50%) indicate

“don’t knowdon’t know” and require human reviewThresholds dividing the merge/ merge/ don’t know/ don’t know/ don’t don’t

merge merge cases are set by the user

ChoiceMaker Technologies

Page 7: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

Maximum Entropy ModelingMEDD uses “Maximum Entropy Modeling”

A new statistical decision-making techniqueLearn the human judgment process by training from examplesHas been used in sentence parsing, computer vision, financial modeling, and proper-name identification

Has achieved state-of-the-art results on these problems

ChoiceMaker Technologies

Page 8: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

Maximum Entropy Modeling: Features

Maximum Entropy uses “Features”Feature = a function which looks at specific fields in the pair of records to make a “merge” or “don’t merge” decisionMEDD has many different features, each of which is assigned a “weight” during training

ChoiceMaker Technologies

Page 9: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

Sample MEDD Features

Mother’s BirthdayMatch of Mom’s B’day predicts “Merge” Mismatch of Mom’s B’day predicts “No-Merge”Neither feature fires if Mom’s B’day wasn’t filled in on both records

We have no evidence in this caseMany other features

Child’s birthdayChild’s first and last nameMedicaid Number

ChoiceMaker Technologies

Page 10: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

Record pairshand-marked withmerge/no-merge decisions

A weight foreach feature

A set of features

Maximum Entropy

ParameterEstimator

New York Citywide Immunization Registry:The MEDD De-duplication Project

Training the System

ChoiceMaker Technologies

Page 11: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

Probability Computation

Merge = product of weights of all features predicting “mergemerge” for the

pairNoMerge = product of weights of all features

predicting “no mergeno merge” for the pair

For a pair of records, MEDD computes the probability that the pair should be merged as:

NoMergeMerge

Merge

ChoiceMaker Technologies

Page 12: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

Field Name Record Feature Weight Prediction

1 2

Last name Smith Smith Match 1.153 Merge

First name Emily Emely No-matchSoundex

1.3504.708

No-mergeMerge

DOB [04/28/97] [04/28/97] Match 1.138 Merge

Multiple birth N N

Mom’s Maiden Name CRUZ

Mother’s DOB 12/04/76

Street 4528 3rd Ave 4528 3rd Ave Match 4.342 Merge

City Bronx Bronx Match 1.103 Merge

State NY NY

Zip 10462 10462 Match 3.013 Merge

Phone 718-123-4567 718-123-6789 No-match 2.130 No-merge

Med Rec Number 11856437503 11856437503 Match 6.587 Merge

High Probability. Human Decision: Merge

Merge Total = 587.2

No-merge total = 2.9995.0

9.22.587

2.587

MEDD predicts “Merge” with 99.5% confidenceMEDD predicts “Merge” with 99.5% confidence

Page 13: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

Field Name Record Feature Weight Prediction

1 2

Last name Lopez Lopez Match 1.153 Merge

First name Girl Susan

DOB [1/11/97] [1/2/97] No-match 28.949 No-merge

Multiple birth N N

Mom’s Maiden Name

Lopez

Mother’s DOB

Street 987 Cornelia 456 Park No-match 2.937 No-merge

City Brooklyn Brooklyn Match 1.103 Merge

State NY NY

Zip 11211 11211 Match 3.013 Merge

Phone 718-123-4567 718-234-5678 No-match 2.130 No-merge

Med Rec Number 1001002 567435

Low Probability. Human Decision: No-Merge

Merge Total = 3.8

No-merge total = 181.1021.0

8.31.181

8.3

MEDD predicts “No-merge” with 97.9% confidenceMEDD predicts “No-merge” with 97.9% confidence

Page 14: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

Field Name Record Feature Weight Prediction

1 2

Last name Hernandez Hernandez Match 1.153 Merge

First name Boy David

DOB [2/14/97] [2/14/97] Match 1.138 Merge

Multiple birth N N

Mom’s Maiden Name

Hernandez

Mother’s DOB 11/4/78

Street 142 4th Ave 142 4th Ave Match 4.342 Merge

City Bronx Bronx Match 1.103 Merge

State NY NY

Zip 11051 11052 No-match 2.551 No-merge

Phone 718-524-4879 718-524-4878 No-match 2.130 No-merge

Med Rec Number 1001002 567435

Intermediate Probability. Human Decision: Merge

Merge Total = 6.3

No-merge total = 5.4539.0

4.53.6

3.6

Predicts “Merge” with 53.9% confidence (Human review)Predicts “Merge” with 53.9% confidence (Human review)

Page 15: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

ChoiceMaker Technologies

New York Citywide Immunization Registry:The MEDD De-duplication Project

Sophisticated MEDD features:Name Frequency

Name Frequency“Rodriguez” is 9 times more common than “Walker” in

NYCLess than 3 kids per year are born with the names

“Borthwick” and “Papadouka”Hence we build features categorizing names as “very

common”, “somewhat common”, “very rare”, etc.Given that we have a name match, the fact that the names

are very common is a feature predicting “don’t merge”A match between rare names is a feature predicting “merge”

Page 16: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

ChoiceMaker Technologies

New York Citywide Immunization Registry:The MEDD De-duplication Project

Sophisticated MEDD features:Partial Name Match

Soundex: A phonetic representation of namesConnor = Conor = Conner = CNRWhen the Soundex representation of two

names matches, a feature fires predicting “merge”

Edit Distance: Features firing based on two names having an edit distance of 1

Borthwich Borthwick Bortwick

Page 17: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

ChoiceMaker Technologies

New York Citywide Immunization Registry:The MEDD De-duplication Project

Special Situation Features

Every database has its quirksHMO XYZ always sends its data to the CIR with Day of

Birth = “1”Birthday = July 1, 1998 not July 15, 1998

We have a special feature:If Provider = “HMO XYZ” AND Day of Birth = 1 AND

dates differs only on day of birth, THEN predict merge

We plan to allow users to define these types of features themselves

Page 18: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

Test Procedure

MEDD MEDD tested on c. 3,000 pairs under NYC DOH supervisionPairs were carefully hand-scored by NYC DOH as Merge/Don’t Merge

ChoiceMaker never saw the test data

ChoiceMaker Technologies

Page 19: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

MEDD Evaluation Results

RequestedAccuracy

% of Records Needing Human Review

1% False Positive1% False Negative

1.4%

0.5% False Positive0.5% False Negative

2.6%

0.3% False Positive0.3% False Negative 3.2%

Even with double-checking, humanerror rate is no better than 0.3%

Even with double-checking, humanerror rate is no better than 0.3%

ChoiceMaker Technologies

Page 20: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

Summary: What MEDD Offers

Can be trained on just 3,000 record pairs Judges nearly 1,000 record-pairs per secondAchieves very high accuracy by finding the optimal

weighting of the different clues (“features”) indicating

mergemerge/don’t mergedon’t merge Says “mergemerge”, “don’t mergedon’t merge”, or “I don’t knowI don’t know”Can be rigorously testedRegistry management can make informed judgments

regarding the effort vs. accuracy trade-off

ChoiceMaker Technologies

Page 21: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

The 5 Stages of the De-duplication Process

1. “Blocking”: Identify list of possible duplicates (SmartSearch)

2. “Decision-Making”: Identify a definitive list of duplicate records (MEDD)

3. Human Review ofa. Records marked as “don’t know” by MEDDb. Records held by special filters (twins, scanty records, etc.)

4. Linkage: Link records that belong to the same child together (if A=B and B=C then A=C)

5. Update the CIR

ChoiceMaker Technologies

Page 22: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

Project Avalanche

Project AvalancheProject Avalanche: A project by which we systematically de-duplicate the whole CIR by comparing every record to every record meeting certain criteria

Uses our querying tool Smart Search and our de-duplication tool MEDD

Project Avalanche I: February-April 2000Project Avalanche II: May-July 2000

ChoiceMaker Technologies

Page 23: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

Project Avalanche I

Used strict blocking criteria for finding possible duplicates to be passed on to MEDD such as:

Exact match on DOB+Medical Record orExact match on Medicaid number orFirst name+gender+DOB+last name=maiden name (and vise versa) orLast name+First name+DOB

Used 98% as the cut-off for automatic mergingHand-reviewed records produced by the filters

ChoiceMaker Technologies

Page 24: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

Project Avalanche I: Results

CohortBefore A1 After A1 # Dups* # % Dups1996 203,000 187,000 68,000 16,000 251997 216,000 195,000 81,000 21,000 261998 208,000 184,000 73,000 24,000 321999 158,000 143,000 ? 15,000 ?

TOTAL 785,000 709,000 223,000 77,000 avg=28

# of Records Dups removed

ChoiceMaker Technologies

* Estimated

Page 25: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

Project Avalanche II

In April 2000 we loaded 4 months worth of data that were held due to Y2K problems

Used more liberal blocking criteria:Medical Record Number+

month and year of DOB orday and year of DOB orday and month of DOB orfirst name

Used 90% as the cut-off for automatic mergingCurrently hand-reviewing records produced by

the filters

ChoiceMaker Technologies

Page 26: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

Project Avalanche II: Results

Cohort Before A2 After A2 # Dups* # % Dups1996 190,000 182,000 55,000 9,000 161997 196,000 183,000 61,000 13,000 221998 206,000 182,000 71,000 24,000 341999 210,000 182,000 75,000 28,000 37

TOTAL 802,000 728,000 262,000 74,000 avg=27

# of Records Dups removed

ChoiceMaker Technologies

*Estimated

Page 27: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

Project Avalanche: Discussion

Using a very conservative cut-off for automatic merging we reduced the duplicates by about 27.5% each time, more than 30% including human review

As a result of Project Avalanche 81% of records now have immunizations vs. 58% 6 months ago

Since MEDD is not yet implemented on the front end of the CIR, you don’t see the total number of duplicates decreasing over time in these early runs

ChoiceMaker Technologies

Page 28: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

Future of MEDD at the CIR

As part of the Lead and CIR integration MEDD will be inserted on the front end, thus reducing the number of duplicates being created

Improving MEDD’s performance will enable us to automatically merge more duplicates with the same error rate

Will continue with Project Avalanche until we bring the duplication rate down to an acceptable level

ChoiceMaker Technologies

Page 29: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication ProjectSummary: ChoiceMaker Status

Currently have two employeesAndrew Borthwick, Ph.D.Prof. Arthur Goldberg

Have several major contracts with New York City Dept. Of Health

Good prospects of finding similar work with other state and municipal health departments

ChoiceMaker Technologies

Page 30: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication Project

Summary: De-duplication Marketplace

Immunization Registries have very difficult duplicate record problems

Many others have similar problemsMedical researchers (correlating birth

certificate and maternal death records)Banks, phone companies (correlating clients

from different lines of business)Direct marketers (merging mailing lists)

ChoiceMaker Technologies

Page 31: Andrew Borthwick, PhD § Vikki Papadouka, PhD, MPH* Deborah Walker, PhD*

New York Citywide Immunization Registry:The MEDD De-duplication ProjectSummary: ChoiceMaker’s Plans

Do further research to decrease the amount of consulting time needed to deploy MEDD

Seeking first-round investors to fund expansion of R&D and marketing

Have an opening for someone with an M.S. in C.S. or similar qualifications, starting 10/1/2000 and a C.S. Ph.D. starting 11/1/2000

ChoiceMaker Technologies


Recommended