INSTITUTE OF INFORMATION AND
COMMUNICATION TECHNOLOGIES
BULGARIAN ACADEMY OF SCIENCE
Automatic Information Extraction from Patient Records
in Bulgarian Language
1
http://www.iict.bas.bg/acomin6/27/2013
Galia Angelova
Institute of Information and Communication Technologies
Bulgarian Academy of Sciences
AComIn: Advanced Computing for Innovation
Outline
• First: thanks a lot for the invitation!
• Automatic analysis of medical texts,
extraction of patient-related data - WHY?
• Specific features of Bulgarian clinical
narratives and hospital discharge letters
6/27/20132
http://www.iict.bas.bg
narratives and hospital discharge letters
• Achievements
• Current work
• Conclusion
• Acknowledgements
AComIn: Advanced Computing for Innovation
Biomedical NLP (Nat Lang Proc)
• Started in the 80-ies last century in the USA
• Task 1: automatic encoding of diagnoses,
procedures, …, described in clinical texts
• Hot branch in “secondary use” of EHR data
6/27/20133
http://www.iict.bas.bg
Bulgarian clinical texts
• The Latin terminology is a real challenge
• Mixture of medical terminology in Latin and
Bulgarian. Example:
Angiosclerosis vas. retinae hypertonica. Начални
6/27/20134
http://www.iict.bas.bg
Angiosclerosis vas. retinae hypertonica. Начални
промени по типа на диабетна ретинопатия.
• Latin terms transliterated with Cyrillic letters
Диагноза: Хипотиреоидизмус постоператива
компенсата.
Bulgarian clinical texts
•• About About 1/4 1/4 nonnon--Bulgarian wordforms and Bulgarian wordforms and
strings (without counting misspellings)strings (without counting misspellings)
6/27/20135
http://www.iict.bas.bg
Bulgarian clinical texts
•• Corpus of 6200 anonymised hospital discharge Corpus of 6200 anonymised hospital discharge
letters of diabetic patientsletters of diabetic patients
Terms Wordforms Basic words Abbreviations
Bulgarian 601 233 12 009 (63%) 1 471
6/27/20136
http://www.iict.bas.bg
Bulgarian 601 233 12 009 (63%)
>50% “unknown”
1 471
Latin 18 926 560 (3%) 1 189
Translite-
rations
179 589 6 465 (34%) 982
Total 799 748 19 034 3 642
Bulgarian clinical texts
Common features of a medical sublanguage:
• Phrases instead of complete sentences
• A lot of implicit or tacit knowledge needed for proper understanding
• Only few types of negation
• Mostly the facts relevant to the focus are
6/27/20137
http://www.iict.bas.bg
• Mostly the facts relevant to the focus are documented – e.g., in a specialised diabetic hospital, facts related to other diseases might be ignored
• Many results of clinical tests are entered as free texts when they are done outside the hospital
BG hospital discharge letters
• 2-3 pages, structured into sections (by law)
6/27/20138
http://www.iict.bas.bg
BG hospital discharge letters• 77г. - ж
• гр. София
• Диагноза: Захарен диабет тип 2, с вторична резистентност към СУП. Полиневропатия диабетика. Нефропатия диабетика инципиенс. Тиреоидитис Хашимото – хипотиреоиден стадии. Анемия пернициоза. Двустранна глухота.
• Анамнеза: Постъпва за пореден път в клиниката за контрол на състоянието. Зах. диабет тип ІІ с 20г. давност, открит случайно при изследвания по друг повод. От 11г. е на лечение с инсулин, …. Оплаквания
6/27/20139
http://www.iict.bas.bg
изследвания по друг повод. От 11г. е на лечение с инсулин, …. Оплаквания при постъпването изцяло от страна на крайниците, изброени по – горе.
• Минали заболявания: Нефролитиазис билатералис.
• Фамилна обремененост :отрича.
• Рискови фактори – алергия към пеницилини и Аналгин.
• Статус: Жена на видима възраст около действителната, в задоволително общо състояние, ориентирана, …
• Изследвания: СУЕ – 22 , Хб - 133 , Ер – 4,6, Хт – 0,42 , Левк – 4,8 , МСV – 91,4; Тр - 258, HDL-chol – 1.28, общ хол. – 4,8, 3-гл – 1,07.; …..
• Обсъждане: …..
PSIP: an IP in 7FP ICT eHealth
• Patient Safety through Intelligent Procedures in medication
• Extension of a running project with 14 core partners
• 1.8 drugs per patient in the HIS; 5.6 in the free text
6/27/201311
http://www.iict.bas.bg
BG-Resources – ICD & Tabular Index
• ICD-10: 14439 lines ‘code’ - ’text description’
E66.8 'Other obesity'
• ‘Instructions’ for using ICD
– 19 161 BG-words
– 291 116 BG wordforms
6/27/201312
http://www.iict.bas.bg
– 291 116 BG wordforms
– 2 221 Latin Terms (11.59%)
– 83 713 occurrences of
Latin terms (28.76%)
– 76 939 descriptions of
9044 ICD codes
BG-Resources – Drug names
• 1500 drug names manually translated to Bulgarian, to fill
in the ATC classification with Bulgarian drug names
• Training to acquire grammatical patterns
6/27/201313
http://www.iict.bas.bg
Results for 6200 discharge letters
• Precision - % correct among all found
• Recall - % correct among all available
Occurren-
ces
Precision Recall F
6/27/201314
http://www.iict.bas.bg
ces
Diagnoses 26 826 97.30% 74.69% 84.50%
Drugs 160 892 97.28% 99.59% 98.42%
Drugs at hospitalisation day 0
• Contextualisation: timing of drug events
• Using the Anamnesis (Case History) section only
• 500 drugs in 6200 discharge letters
• Careful training on suitable phrases
6/27/201315
http://www.iict.bas.bg
• Precision: 88%
• Recall: 92,45%
• F (harmonic mean): 90,17%
• Award for best paper on EHR at EFMI 2011
EVTIMA: Building event timelines
• Events are important (with their time,modality)
• A Primitive Event (in our context) is a:
• (1) a diagnose,
• (2) a drug,
• (3) a condition: can be a complaint, a symptom, a
6/27/201316
http://www.iict.bas.bg
• (3) a condition: can be a complaint, a symptom, a change in the status that signals abnormality
– high BP
– decompensation of diabetes mellitus
– increased levels of serum creatinine
• Complex event – aggregation of e.g. all drugs
Temporal expressions
• Dates day/month/year
• Year or month only
• Prepositional phrases containing temporal
information
Classified into
6/27/201317
http://www.iict.bas.bg
• Classified into
– Absolute
– Relative according to hospitalisation date,
birthdate, events like e.g. previous moment
“since then” or other (“since puberty”)
Ordering events on time lines
• Algorithm based on directed multi-graph
representation
• Time markers are nodes (states)
• The edges represent primitive events incident
6/27/201318
http://www.iict.bas.bg
• The edges represent primitive events incident
with the beginning and end time nodes
• Two graphs are generated – one for relative
and one for absolute time scales
Evaluation: Training/test sets 1300/6200
• Average Primitive events per Discharge Letter: 20,69
• In the training/test set: 371/565 different diagnoses (patients have similar diagnoses and treatments)
• In the test set:
– 1,349 dates (day/month/year),
– 2,698 markers (year and/or month only),
6/27/201320
http://www.iict.bas.bg
– 2,698 markers (year and/or month only),
– 2,362 markers for relative time periods
– 2,351 concerning the admission date
• Distribution of temporal markers:
– 38% to events presenting diagnoses
– 47% to events expressing drug admission / change
– 15% to complaints and conditions
Accuracy
Precision % Recall % F %
Drugs 97.28 99.59 98.42
Event Diagnoses 97.30 74.68 84.50
Complaints 97.98 96.82 97.40
6/27/201321
http://www.iict.bas.bg
Complaints 97.98 96.82 97.40
Dates 98.86 98.21 98.53
Time Duration 99.14 98.26 98.70
Frequency 92.25 95.51 93.85
Current work
• Reimbursement of Diabetic patients (ICD E11) is a
major budget of the Health Insurance Fund (HIF):
– 2011: > 61 Mio lv
– 2012: > 77 Mio lv
– 2013/Jan-March: > 20 Mio lv
6/27/201322
http://www.iict.bas.bg
– 2013/Jan-March: > 20 Mio lv
– tendency to use more expensive drugs
• Principal application goal: to do something useful
with the millions of Records stored in the Health
Insurance Fund (there is much text there)
• Ambition of the very active medical partners
Files submitted for reimbursement<Pay>1</Pay>
- <Patient>
<EGN>29d53d021a8ea04f8a58b0b7b17ca901d471c111</EGN> PSEUDONYM
<RZOK>22</RZOK>
<ZdrRajon>01</ZdrRajon>
- ……
<age>68</age>
<gender>2</gender> </Patient>
- <MainDiag>
<imeMD>Неинсулинозависим захарен диабет с неврологични усложнения</imeMD>
<MKB>E11.4</MKB> </MainDiag>
6/27/201323
http://www.iict.bas.bg
<MKB>E11.4</MKB> </MainDiag>
- <Diag>
<imeD>Диабетна полиневропатия (Е10-Е14 с общ четвърти знак .4)</imeD>
<MKB>G63.2</MKB> </Diag>
- <Diag>
<imeD>Тиреоидит, неуточнен</imeD>
<MKB>E06.9</MKB> </Diag>
- <Diag>
<imeD>Хипертонична болест на сърцето</imeD>
<MKB>I11</MKB> </Diag>
- <Diag>
<imeD>Стенокардия</imeD>
<MKB>I20</MKB>
</Diag>
Files submitted for reimbursement
• <Anamnesa>От 9 год. има Захарен диабет, установен по повод умерена полидипсо-полиурия,при наднормено тегло. Приема Метфогамма 3 х 1000 мг. дн. Установена Невропатия и провежда лечение с вливания с Тиоктацид.Има отпадналост, неспокоен сън. Има Хипертония , ИБС. Приема Енап и Верапамил. Кр. захар най-често е около 10. Ф.А.-баща й бил с хипертония.</Anamnesa>
• <HState>Ръст 164 см., тегло 81 кг, ИТМ 30.6 кг/м2.Кожа-леко суховата. Щит. жл.- суспекция за възел в десния лоб /на ехография възел в десния лоб и в левия и хипоехогенна зона в л. лоб/. Дих. с-ма-б.о.Сърд. дейност е правилна, ритмична 80 в мин., ясни тонове, кр. нал. 150/90. Ч. дроб и
6/27/201324
http://www.iict.bas.bg
е правилна, ритмична 80 в мин., ясни тонове, кр. нал. 150/90. Ч. дроб и слезка-не се опипват. Сетивност-запазена.</HState>
• <Examine>КЗП 9.3-8.13, 8.96 ,НвА1с 6.9% , пик. к-на 277.8 ехография на щит. жл.-д.лоб увеличени р-ри,нехомогенна с-ра, хипоехогенен възел с р-ри 21/18 мм, л. лоб-норм.р-ри в средна трета хипоехогенен възел 8/5 мм , в основата некапсулирана хипоехогенна зона 20/11мм-закл. Струма нодоза, диф. д.-Тир. на Хашимото-нод.форма [ 10.09 - сума: 10,34 ] ТАТ, МАТ</Examine>
• - <Therapy> <Nonreimburce>Да бъде на хипокалоричен диетичен режим-дадени указания.Да приема Метфогамма 3 х 1000 мг.След изследванията ще се прецени тир. функция. Отказва ТАБ. Води се на диспансеризация от ОПЛ.</Nonreimburce>
</Therapy>
Possible findings
• When diabetic patients come second time for
control examinations, what is the reason for the
worsened lab test results?
• Does it depend on the drugs (giving more
expensive drugs does not always mean better
6/27/201326
http://www.iict.bas.bg
expensive drugs does not always mean better
compensation)
• Grouping patients by: gender, age, region, drugs,
accompanying diseases … but only after
automatic analysis of the free text presenting
clinical tests
Conclusion
• Medicine is a quite large domain, progress might be only incremental
• Diabetes is a relatively narrow “genre”
• Medical experts learn the potential of text analytics and plan how to use it practically
6/27/201327
http://www.iict.bas.bg
• From application perspective, what we want do is a typical example for secondary use of EHR data
• Principal theoretical goal: … to help computers understand biomedical language and natural language in general ☺☺☺☺
Acknowledgements
• Dr Dimitar Tcharaktchiev, University SpecialisedHospital for Endocrinology, Medical Univ. Sofia
• Dr Svetla Boytcheva, AUBG
• Ivelina Nikolova, PhD student IICT-BAS
• Hristo Dimitrov, PhD student MU-Sofia
6/27/201328
http://www.iict.bas.bg
• Hristo Dimitrov, PhD student MU-Sofia
• Dr Zhivko Angelov, Adiss Lab Ltd.
• All starring in the movie:
Информатиката в полза на здравеопазването
http://www.youtube.com/watch?v=K7m3JY9ekHA&feature=youtu.be
Acknowledgements
• AComIn (Advanced Computing for Innovation),
FP7-REGPOT-2012-2013-1 grant 316087
• PSIP (Patient Safety through Intelligent
Procedures in medication), FP7 ICT eHealth grant
216130
6/27/201329
http://www.iict.bas.bg
216130
• EVTIMA (Effective search of conceptual
information with applications in medical
informatics), Bulgarian National Science Fund DO
02-292