+ All Categories
Home > Documents > NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS...

NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS...

Date post: 05-Oct-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
49
NLP Tools Multiwords (The MEDLINE N-Gram Set) By: Dr. Chris J. Lu The Lexical Systems Group NLM . LHNCBC . CGSB June, 2015 Lexical Systems Group: http://umlslex.nlm.nih.gov The SPECIALIST NLP Tools: http://specialist.nlm.nih.gov
Transcript
Page 1: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

NLP ToolsMultiwords

(The MEDLINE N-Gram Set)

By: Dr. Chris J. Lu

The Lexical Systems Group

NLM. LHNCBC. CGSB

June, 2015

• Lexical Systems Group: http://umlslex.nlm.nih.gov

• The SPECIALIST NLP Tools: http://specialist.nlm.nih.gov

Page 4: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

NLP – Concept Mapping

Normalization:• A term might have many different variations, such as

inflectional variants, spelling variants, synonyms, abbreviations (expansions), cases, ASCII conversion, etc.

• Normalize different forms of a concept to a same formQuery Expansion:

• Expand a term to its equal terms, such as subterm substitution of synonyms, derivational variants, spelling variants, abbreviations, etc.

• To increase recall

POS tagger:• Assign part of speech to a single word or multiword in a text.• To increase precision

Page 5: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

NLP - Norm

disease hodgkin

• Hodgkin Disease

• HODGKINS DISEASE

• Hodgkin's Disease

• Disease, Hodgkin's

• HODGKIN'S DISEASE

• Hodgkin's disease

• Hodgkins Disease

• Hodgkin's disease NOS

• Hodgkin's disease, NOS

• Disease, Hodgkins

• Diseases, Hodgkins

• Hodgkins Diseases

• Hodgkins disease

• hodgkin's disease

• Disease;Hodgkins

• Disease, Hodgkin

• …

•C0019829

•Hodgkin Disease

normalize

Indexed Database Normalized String

Index

Terms in Corpus

Page 10: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

The SPECIALIST NLP Tools

The SPECIALIST LEXICON Text Tools

Lexical ToolsLexBuild

LVG

NLP Applications

Page 11: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

Lexicon Coverage – by Word Count

Total word count for MEDLINE (2014): 2,725,710,505 Lexicon covers ~98% from MEDLINE

Types Word Count Percentage % Accu. %

LEXICON 2,542,758,048 93.2879% 93.2879%

NUMBER 7,797,019 0.2861% 93.5740%

DIGIT 126,635,190 4.6460% 98.2200%

MULTIWORD 18,549,715 0.6805% 98.9005%

NEW 29,970,533 1.0995% 100.0000%

Total 2,725,710,505

* Using Element Words to Generate (Multi)Words for the SPECIALIST Lexicon

Lu, Chris J.; Tormey, Destinee; McCreedy, Lynn; and Browne, Allen C.

AMIA 2014 Annual Symposium, Washington, DC, November 15-19, 2014, p. 1499

Page 14: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

Words in Lexicon

Part of speech, inflection, lexical meaning

• saw|noun|singular|E0054443

• saw|verb|infinitive|E0054444

• saw|verb|past|E0055007

High frequency co-occur words?collocation is a sequence of words or terms that co-occur more often than would be expected by chance

• “non”, DC: 46,138, WC: 46,139• “study was to”, DC: 592,752, WC:593,718• “undergoing cardiac surgery”, DC: 2,589, WC: 3,135• “adverse cardiac”, DC: 4,405, WC:5,725

• “in the house”, DC: 1,170, WC: 1,298• in house|adj|positive|E0555310, DC: 1,681, WC: 2,129

Page 18: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

LexBuild Process

Built by linguists

LexBuild: a web-based computer-aided tool

Resources: a list of words (element words)• Add new lexical records if no exact/close match

• Update existing lexical records if related records are found by close match

• Multiwords that contain these words are reviewed through the Essie search engine*, Google Scholar, dictionaries, biomedical publications, domain-specific databases, nomenclature guidelines, and books, etc.

* N.C. Ide, R.F. Loane, D.D. Fushman, “Essie: A Concept-based Search

Engine for Structured Biomedical Text”, JAMIA, Vol. 14, Num. 3, May/June,

2007, p.253-263

Page 23: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

N-Gram Model Approach

Get all N-Grams from MEDLINE documents• No MEDLINE N-Gram set available for public

Filter out N-Grams that are invalid words• Exclusive Filter: focus on not to drop recall, and then

increase precision Retrieve word candidates by patterns, rules, etc.

• Inclusive filter: focus on precision Expert validation

• Very expensive, minimize manual process

To bridge the gap between N-grams (statistical co-occurrence) and our term-based Lexicon.

Page 24: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

N-Gram

An n-gram is a contiguous sequence of n items from a given sequence of text or speech• An n-gram of size 1 is referred to as a "unigram“

• Size 2 is a "bigram" (or a "digram");

• Size 3 is a "trigram".

• Larger sizes are sometimes referred to by the value of n, e.g., "four-gram", "five-gram", and so on.

Example:• to be or not to be

N = 1 Unigram to, be, or, not, to, be

N = 2 Bigram to be, be or, or not, not to, to be

N = 3 Trigram to be or, be or not, or not to, not to be

Page 27: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

MEDLINE N-Gram Set

PMID- 961031

OWN - NLM

STAT- MEDLINE

DA - 19761020

DCOM- 19761020

LR - 20041117

PUBM- Print

IS - 0042-2835 (Print)

VI - 10

IP - 1

DP - 1976 Jan-Feb

TI - Postoperative arrhythmias in open-heart surgery, A study on fifty cases.

PG - 30-7

AB - 50 consecutive patients undergone open heart surgery were analyzed

regarding postoperative arrhythmias in the first postoperative 3 days.

Disturbances of rhythm occurred in each case of our group, serious or not

serious (100%). Ventricular premature beats were the most frequent type of

arrhythmia in the first and second postoperative days (80%). Two cases

expired postoperatively. In one of them complete atrioventricular block

developed after double valvular replacements (mitral and tricuspid). The

other died of low cardiac output syndrome. Etiology of the arrhythmias

...

TI & AB N-Gram Table Print OutTokenization

Page 28: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

MEDLINE N-Gram Set

PMID- 961031

OWN - NLM

STAT- MEDLINE

DP - 1976 Jan-Feb

TI - Postoperative arrhythmias in open-heart surgery, A study on fifty cases.

PG - 30-7

AB - 50 consecutive patients undergone open heart surgery were analyzed

regarding postoperative arrhythmias in the first postoperative 3 days.

Disturbances of rhythm occurred in each case of our group, …

JT - Vascular surgery

JID - 0103277

TI & AB N-Gram Table Print OutTokenization

Page 29: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

MEDLINE N-Gram Set

PMID- 961031

TI - Postoperative arrhythmias in open-heart surgery, A study on fifty cases.

AB - 50 consecutive patients undergone open heart surgery were analyzed

regarding postoperative arrhythmias in the first postoperative 3 days.

Disturbances of rhythm occurred in each case of our group, …

TI & AB N-Gram Table Print OutTokenization

S-ID Sentences

1 Postoperative arrhythmias in open-heart surgery, A study on fifty cases.

2 50 consecutive patients undergone open heart surgery were analyzedregarding postoperative arrhythmias in the first postoperative 3 days.

3 Disturbances of rhythm occurred in each case of our group, …

… … | …

Page 30: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

MEDLINE N-Gram Set

PMID- 961031

TI - Postoperative arrhythmias in open-heart surgery, A study on fifty cases.

AB - 50 consecutive patients undergone open heart surgery were analyzed

regarding postoperative arrhythmias in the first postoperative 3 days.

Disturbances of rhythm occurred in each case of our group, …

TI & AB N-Gram Table Print OutTokenization

Apply sentence tokenizer on TI and AB• Check Ending

o Ends with “.”, “?”, “!”

o Not abbreviation, U. S. Army

o …

• Check Beginning

o Starts with Upper case, digit,

o Not “:”, “-”, “ “, “\t”

o ...

Unrecognized sentence pattern (~0.01%)

Page 31: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

MEDLINE Unigram

TI & AB N-Gram Table Print OutTokenization

S-ID Sentences- PMID: 961031

1 Postoperative arrhythmias in open-heart surgery, A study on fifty cases.

2 50 consecutive patients undergone open heart surgery were analyzedregarding postoperativearrhythmias in the first postoperative 3 days.

3 Disturbances of rhythm occurred in each case of our group, …

4 …

Key Value (DC, WC)

postoperative 1 3

arrhythmias 1 2

in 1 3

open-heart 1 1

surgery 1 1

a 1 1

study 1 1

on 1 1

… … …

Page 33: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

MEDLINE Bigram

TI & AB N-Gram Table Print OutTokenization

S-ID Sentences

1 Postoperative arrhythmias in open-heart surgery, A study on fifty cases.

2 50 consecutive patients undergone open heart surgery were analyzedregarding postoperative arrhythmias in the first postoperative 3 days.

3 Disturbances of rhythm occurred in each case of our group, …

… … | …

Key DC, WC

postoperative arrhythmias 1 2

arrhythmias in 1 2

in open-heart 1 1

open-heart surgery 1 1

surgery, a 1 1

A study 1 1

study on 1 1

… … …

Page 35: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

MEDLINE N-Gram Set

Approach – Split, Group, Filter, Combine, and Sort*

Fourthgrams (automatic):• Split MEDLINE documents into 12 sections

and get the fourthgrams for each section

• Group fourthgrams from all 12 sections with

specified (10) alphabetic range, such as a-c,

c-e, e-f, etc.

• Apply WC (> 30) filter on all 10 groups

• Combine all 10 alphabetic ranges groups to

N-Gram set

• Sort

Split

Group

Filter

Combine

Sort

* AMIA 2015: Generating the MEDLINE N-Gam Set

Lu, Chris J.; Tormey, Destinee; McCreedy, Lynn; and Browne, Allen C.

Page 39: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

Filter

MEDLINE

N-Gram Set

Enhanced

MEDLINE

N-Gram Set

Filter efficiency = trap terms / total terms

Filter passing rate = pass-through terms / total terms

Good filters have high efficiency and accuracy

Accuracy Test: apply filters on Lexicon (valid word set)

• Accuracy = TP + TN / TP + TN + FP + FN

= TP / TP + FN ….. TN & FP are 0

= trap / total terms

= pass rate

Trap (not retrieved) Pass (retrieved)

Valid (relevant) FN TP

Invalid (not relevant) TN FP

Page 40: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

A distilled N-Gram set by filtering out invalid words.

Applied high accuracy filter (V0 = V1 = … = Vn; I0 > I1 > … > In)

Higher precision with same recall rate (if filter has high accuracy rate)

N-Gram Precision n = Vn / (Vn + In)

= V0 / (V0 + In) ….. Vn is same as V0 (high accuracy)

> V0 / (V0 + I0) ….. I0 is bigger than In (high efficiency)

N-Gram Recall n = Vn / (Vn + FNn)

= Vn / (Vn + FN0) ….. FNn is a constant (0), same as FN0

= V0 / (V0 + FN0) ….. Vn is same as V0 (high accuracy)

MEDLINE

N-Gram Set

Enhanced

MEDLINE

N-Gram Set

N-Gram Filter-1 Filter-2 … Filter-N Distilled

Valid (TP) V0 V1 V2 … Vn Vn

Invalid (FP) I0 I1 I2 … In In

Serial Filters

Page 42: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

General Exclusive Filters

Filter Accuracy(875,890)

Pass RateN-Gram set

AccumulatedPass Rate

Trapped Examples

Pipe 100.0000% (0)

100.0000%(6)

100.0000% • 38|44|(|r| • 33|37|Ag|AgCl

Punctuation or space

100.0000%(0)

99.9977%(386)

99.9977% • 1259147|3690494|= • 604567|2377864|+/-

Digit 99.9999%(1)

99.3141%(116,772)

99.3118% • 1404799|2062240|2 • 239725|499064|95%

Number 99.9953%(41)

99.9760%(4,056)

99.2879% • 2463066|3359594|two • 18246|20674|first and

second

Digit and Stopword

99.9993%(6)

99.1595%(142,067)

98.4534% • 3155416|4125616|on the• 11180|12722|1, 2, and

Page 43: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

Pattern Exclusive Filters

Filter Accuracy(875,890)

Pass RateN-Gram set

AccumulatedPass Rate

Trapped Examples

Parenthetic Acronym- (ACR)

100.0000% (0)

99.0232%(163,714)

97.4917% • 33117|33381|chain reaction (PCR)

• 30095|30315|polymerase chain reaction (PCR)

Indefinite article 99.9985% (13)

98.1703%(303,679)

95.7079% • 270384|292590|a case• 40271|40512|A series

UPPERCASE Colon 99.9999%(1)

99.4302%(92,841)

95.1625% • 2069343|2070116|RESULTS:• 18015|18016|AIM: The

Disallowed punctuation

99.9978%(19)

99.3020%(113,073)

94.4983% • 324405|719011|(n =• 86525|133350|(P < 0.05)

Measurement 99.9967%(29)

98.1947%(290,421)

92.7924% • 154905|181001|two groups• 12160|15197|10 mg/kg

Incomplete 99.9999%(1)

97.8470%(340,109)

90.7945% • 482021|1107869|(P • 25347|25992|years) with

Page 44: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

Lead-End-Terms Exclusive Filters

Filter Accuracy(875,890)

Pass RateN-Gram set

AccumulatedPass Rate

Trapped Examples

Absolute Invalid Lead-Term

99.9947% (46)

73.0945%(4,158,702)

66.3658% • 2780043|3451203|of a • 432921|434591|this study

was

Absolute Invalid End-Term

99.9997% (3)

78.8984%(2,384,059)

52.3615% • 1878109|3534031|patients with

• 1062545|1261445|between the

Lead-End-Term 99.9992%(7)

99.9741%(2,312)

52.3480% • 2578756|3106139|in a• 1733|1744|For one

Lead-Term no SpVar 99.9887%(99)

85.6678%(1,277,229)

44.8454% • 658430|708246|to determine• 533913|554628|In addition,

End-Term no SpVar 99.9975%(22)

83.1945%(1,283,001)

37.3089% • 1009451|1295670|number of • 726|734|(HPV) in

Page 46: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

Core-term

Strip initial and/or final punctuation from n-grams by

coreterm normalization• Strip initial chars if they are punctuation except for closed brackets

• Strip final chars if they are punctuation except for closed brackets

• Recursively strip close brackets of (), [], {}, <> at both ends

• trim

Input nGram Core-term

-in details in details

in details: in details

(in details:) in details

(in details:)) in details:)

(-(in details)%^) in details

{in (5) days}, in (5) days

((clean room(s))) clean room(s)

Page 48: NLP Tools Multiwords (The MEDLINE N-Gram Set)...disease hodgkin • Hodgkin Disease • HODGKINS DISEASE • Hodgkin's Disease • Disease, Hodgkin's • HODGKIN'S DISEASE • Hodgkin's

Future Work

Inclusive Filters• Parenthetic Acronym Pattern

o computed tomography (CT)o magnetic resonance imaging (MRI)o polymerase chain reaction (PCR)o …

• EndWord Patternso Syndrome: migraine syndrome, contiguous gene syndrome, ...o Center: Heart Information Center, Veteran’s Affairs Medical Center,…o Disease: Fabry disease, Devic disease, …o …

• Spelling Variant Patterns (use distilled n-gram set)o SpVar normalizationo MES (Metaphone, Edit Distance, Sorted Distance)o ES (Edit Distance and Sorted Distance)


Recommended