Date post: | 19-Jan-2016 |
Category: |
Documents |
Upload: | linette-rosalind-hodges |
View: | 220 times |
Download: | 0 times |
By:By:
Chris LuChris Lu
Guy DivitaGuy Divita
Allen BrowneAllen Browne
Date: 12.13.2004Date: 12.13.2004
Remove Parenthesis Plural Forms Remove Parenthesis Plural Forms of (s), (es), and (ies)of (s), (es), and (ies)
• BackgroundBackground• ProblemsProblems• ObjectiveObjective• MethodsMethods• ResultsResults• Future workFuture work
Table of Content
Norm: Norm: • is the most common used program in Lvgis the most common used program in Lvg• is used to create the normalized string and word is used to create the normalized string and word
indexes to UMLS Metathesaurusindexes to UMLS Metathesaurus• is used to access those indexes in UMLS Metathesaurusis used to access those indexes in UMLS Metathesaurus• includes 10 lvg flows (2004)includes 10 lvg flows (2004)
Background
Norm:Norm:
1.1. Remove genitivesRemove genitives
2.2. Replace punctuations with spaceReplace punctuations with space
3.3. Remove stop wordsRemove stop words
4.4. Strip diacriticStrip diacritic
5.5. Split ligaturesSplit ligatures
6.6. LowercaseLowercase
7.7. Uninflect each wordsUninflect each words
8.8. Retrieve citation Retrieve citation
9.9. Word sortWord sort
10.10. Retrieve Unicode symbolRetrieve Unicode symbol
Background – Cont.
Plural forms with parenthesisPlural forms with parenthesis• (s):(s):
Accessory finger(s)Accessory finger(s) Addiction, drug(s)Addiction, drug(s) Burn of wrist(s) and hand(s)Burn of wrist(s) and hand(s)
• (es):(es):• Abdomen CT Adrenal Mass(es) BilateralAbdomen CT Adrenal Mass(es) Bilateral• Provide picture of fetus(es), as appropriateProvide picture of fetus(es), as appropriate• sequelae of; injury, nerve, roots and plexus(es), spinalsequelae of; injury, nerve, roots and plexus(es), spinal
• (ies):(ies):• Donor pneumonectomy(ies) with preparation and Donor pneumonectomy(ies) with preparation and maintenance pf allograft (cadaver)maintenance pf allograft (cadaver)• Orthotic(s) fitting and training, upper extremity(ies), Orthotic(s) fitting and training, upper extremity(ies), lower lower extremity(ies), and/or trunk, each 15 minutesextremity(ies), and/or trunk, each 15 minutes
Background – Cont.
• No flow in lvg to handle this issueNo flow in lvg to handle this issue• Can we just simply remove (s), (es), (ies) ?Can we just simply remove (s), (es), (ies) ?
to get the uninflected formto get the uninflected form without change the wordwithout change the word
• (es), (ies): no problem(es), (ies): no problem• (s): ?(s): ?
Problems
How about:How about:• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine • 9(s)-erythromycylamine 9(s)-erythromycylamine • anatoxin-b(s) anatoxin-b(s) • Ap(s)pCHClpp(s)A Ap(s)pCHClpp(s)A • Bacillus phage rho11(s) Bacillus phage rho11(s) • Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe • EAV G(s) glycoprotein EAV G(s) glycoprotein • G(s), alpha Subunit G(s), alpha Subunit • Histone H1(s) Histone H1(s) • J(s)(b) ANTIBODY J(s)(b) ANTIBODY • N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer • natoxin-a(s) natoxin-a(s) • Salmonella II 6,7:(g),m,(s),t:1,5 Salmonella II 6,7:(g),m,(s),t:1,5 • (s)-(+)-citreofuran (s)-(+)-citreofuran • su(s) protein, Drosophila su(s) protein, Drosophila • XLalpha(s) proteinXLalpha(s) protein• [X]O spontn disrptn/lig(s)knee [X]O spontn disrptn/lig(s)knee • O spontn disrptn/lig(s)kneeO spontn disrptn/lig(s)knee
Challenge
• Not to remove (s) in chemical, Protein, Gene, mathematics, etc. Not to remove (s) in chemical, Protein, Gene, mathematics, etc. • Sometimes, (s) should be replaced by a space instead of removalSometimes, (s) should be replaced by a space instead of removal
Challenge – Cont.
• Remove parenthesis plural forms of (s), (es), (ies)Remove parenthesis plural forms of (s), (es), (ies)• Do not remove (s) in chemical, protein, gene, etc..Do not remove (s) in chemical, protein, gene, etc..• Replace (s) with a space appropriatelyReplace (s) with a space appropriately• Fast performance Fast performance • High precisionHigh precision
Objective
• UMLS Metathesaurus: 2.8 M termsUMLS Metathesaurus: 2.8 M terms• Lexicon: 0.8 M inflected termsLexicon: 0.8 M inflected terms• Total: 3.6 M termsTotal: 3.6 M terms• Terms with (s), (es), (ies) patterns: ~ 2800Terms with (s), (es), (ies) patterns: ~ 2800
Scope
Methods - Pattern ObservationMethods - Pattern Observation
• XLalpha(s) protein
• su(s) protein, Drosophila
• (s)-(+)-citreofuran
• Salmonella II 6,7:(g),m,(s),t:1,5
• natoxin-a(s)
• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer
• J(s)(b) ANTIBODY
• Histone H1(s)
• G(s), alpha Subunit
• EAV G(s) glycoprotein
• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe
• Bacillus phage rho11(s)
• Ap(s)pCHClpp(s)A
• anatoxin-b(s)
• 9(s)-erythromycylamine
• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine
Pattern Observation – (1)Pattern Observation – (1)
• XLalpha(s) protein
• su(s) protein, Drosophila
• (s)-(+)-citreofuran
• Salmonella II 6,7:(g),m,(s),t:1,5
• natoxin-a(s)
• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer
• J(s)(b) ANTIBODY
• Histone H1(s)
• G(s), alpha Subunit
• EAV G(s) glycoprotein
• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe
• Bacillus phage rho11(s)
• Ap(s)pCHClpp(s)A
• anatoxin-b(s)
• 9(s)-erythromycylamine
• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine
Sample TermSample Term Word SizeWord Size DistanceDistance
9(s)-erythromycylamine9(s)-erythromycylamine 11 11
Ap(s)pCHClpp(s)AAp(s)pCHClpp(s)A 22 11
EAV G(s) glycoproteinEAV G(s) glycoprotein 11 11
G(s), alpha SubunitG(s), alpha Subunit 11 11
Histone H1(s)Histone H1(s) 22 11
J(s)(b) ANTIBODYJ(s)(b) ANTIBODY 11 11
N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomerN(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer 00 11
(s)-(+)-citreofuran(s)-(+)-citreofuran 00 11
su(s) protein, Drosophilasu(s) protein, Drosophila 22 11
• The size of the word in front of (s) must be less than/equal to 2
Pattern Observation – (1)Pattern Observation – (1)
Pattern Observation – (2)Pattern Observation – (2)
• XLalpha(s) protein
• su(s) protein, Drosophila
• (s)-(+)-citreofuran
• Salmonella II 6,7:(g),m,(s),t:1,5
• natoxin-a(s)
• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer
• J(s)(b) ANTIBODY
• Histone H1(s)
• G(s), alpha Subunit
• EAV G(s) glycoprotein
• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe
• Bacillus phage rho11(s)
• Ap(s)pCHClpp(s)A
• anatoxin-b(s)
• 9(s)-erythromycylamine
• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine
Sample TermSample Term CharacterCharacter DistanceDistance
9(s)-erythromycylamine9(s)-erythromycylamine Arabic number 9Arabic number 9 11
Bacillus phage rho11(s)Bacillus phage rho11(s) Arabic number 1Arabic number 1 11
Histone H1(s)Histone H1(s) Arabic number 1Arabic number 1 11
• The character in front of (s) is an Arabic number
Pattern Observation – (2)Pattern Observation – (2)
Pattern Observation – (3)Pattern Observation – (3)
• XLalpha(s) protein
• su(s) protein, Drosophila
• (s)-(+)-citreofuran
• Salmonella II 6,7:(g),m,(s),t:1,5
• natoxin-a(s)
• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer
• J(s)(b) ANTIBODY
• Histone H1(s)
• G(s), alpha Subunit
• EAV G(s) glycoprotein
• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe
• Bacillus phage rho11(s)
• Ap(s)pCHClpp(s)A
• anatoxin-b(s)
• 9(s)-erythromycylamine
• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine
Sample TermSample Term CharacterCharacter DistanceDistance
1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine Punctuation -Punctuation - 11
anatoxin-b(s)anatoxin-b(s) Punctuation -Punctuation - 22
Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMeCbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe Punctuation (Punctuation ( 11
natoxin-a(s)natoxin-a(s) Punctuation -Punctuation - 22
Salmonella II 6,7:(g),m,(s),t:1,5Salmonella II 6,7:(g),m,(s),t:1,5 Punctuation ,Punctuation , 11
• Punctuation is in front of (s) within distance 1 or 2
Pattern Observation – (3)Pattern Observation – (3)
Pattern Observation – (4)Pattern Observation – (4)
• XLalpha(s) protein
• su(s) protein, Drosophila
• (s)-(+)-citreofuran
• Salmonella II 6,7:(g),m,(s),t:1,5
• natoxin-a(s)
• N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer
• J(s)(b) ANTIBODY
• Histone H1(s)
• G(s), alpha Subunit
• EAV G(s) glycoprotein
• Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe
• Bacillus phage rho11(s)
• Ap(s)pCHClpp(s)A
• anatoxin-b(s)
• 9(s)-erythromycylamine
• 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine
Sample TermSample Term PatternPattern DistanceDistance
Ap(s)pCHClpp(s)AAp(s)pCHClpp(s)A pppp 11
XLalpha(s) proteinXLalpha(s) protein alphaalpha 11
• The word in front of (s) ends with: pp alpha
Pattern Observation – (4)Pattern Observation – (4)
Pattern Observation – (5)Pattern Observation – (5)
Sample TermSample Term PatternPattern DistanceDistance
[X]O spontn disrptn/lig(s)knee[X]O spontn disrptn/lig(s)knee Followed by a wordFollowed by a word 11
O spontn disrptn/lig(s)kneeO spontn disrptn/lig(s)knee Followed by a wordFollowed by a word 11
• (s) followed with an English word• An English word begins with a letter
if (s) followed with a letter, replace (s) with a space
• Exceptions: Ap(s)pCHClpp(s)A G(s)alpha
Implementation – Wild CardsImplementation – Wild Cards
Wild Card Definition:• ^: start, starting mark of the term• $: end, ending mark of the term right before (s) • C: any character• D: any digit, [0-9] • L any letter, [a-z] • P: punctuation: [- ( ,] • S: space: [ ]
Implementation – Rule RepresentationsImplementation – Rule Representations
PatternPattern Sample TermSample Term RuleRule
11 (s)-(+)-citreofuran(s)-(+)-citreofuran ^$^$
11 J(s)(b) ANTIBODYJ(s)(b) ANTIBODY ^C$^C$
11 EAV G(s) glycoproteinEAV G(s) glycoprotein SC$SC$
11 su(s) protein, Drosophilasu(s) protein, Drosophila ^CC$^CC$
11 Histone H1(s)Histone H1(s) SCC$SCC$
22 9(s)-erythromycylamine9(s)-erythromycylamine D$D$
33 Salmonella II 6,7:(g),m,(s),t:1,5Salmonella II 6,7:(g),m,(s),t:1,5 P$P$
33 natoxin-a(s)natoxin-a(s) PC$PC$
44 Ap(s)pCHClpp(s)AAp(s)pCHClpp(s)A pp$pp$
44 XLalpha(s) proteinXLalpha(s) protein alpha$alpha$
.... …… ……
RuleRule
^$^$
^C$^C$
SC$SC$
^CC$^CC$
SCC$SCC$
D$D$
P$P$
PC$PC$
pp$pp$
alpha$alpha$
……
Implementation – Reversed Trie TreeImplementation – Reversed Trie Tree
D ^
^S
C S ^
b
t
g
m
l
h
a
p
a
m
a
e
Etc.
p
p
C
P
$
Implementation – Reversed Trie TreeImplementation – Reversed Trie Tree
• Example: anatoxin-bExample: anatoxin-b(s)(s)
D ^
^S
C S ^
b
t
g
m
l
h
a
p
a
m
a
e
Etc.
p
p
C
P
$
Implementation – Reversed Trie TreeImplementation – Reversed Trie Tree
• Example: anatoxin-Example: anatoxin-b(s)b(s)
D ^
^S
C S ^
b
t
g
m
l
h
a
p
a
m
a
e
Etc.
p
p
C
P
$
Implementation – Reversed Trie TreeImplementation – Reversed Trie Tree
• Example: anatoxinExample: anatoxin-b(s)-b(s)
D ^
^S
C S ^
b
t
g
m
l
h
a
p
a
m
a
e
Etc.
p
p
C
P
$
Implementation – Algorithm FlowImplementation – Algorithm Flow
Find (s), (es), and (ies)
if (s)
Remove (es) and (ies) Go through the reversed trie
if patternmatch
End
Start
If followingcharacter a letter
Remove (s) Repalce (s)with a space
No
No
No Yes
Yes
Yes
ResultsResults
• Remove (s) properly• Remove (es) properly• Remove (ies) properly• Replace (s) with space properly
• A fast, precise, and expandable system
Future WorkFuture Work
• More testing cases, update more rules• Implement this feature to both Norm and LuiNorm• Apply to (ing), (ed), (en)