Post on 14-Jul-2020
transcript
ParsingMetaMap FilesinHadoopAmyL.Olex,M.S.,AlbertoCano,Ph.D.,BridgetMcInnes,Ph.D.
DepartmentofComputerScience
The deluge of data in today’s information-centric world requires bigger andbetter computing resources for processing. This can be a limiting factor in howmuch data labs with limited computing resources are able to handle.This project explores the pitfalls of a serial program that parses MetaMap files toidentify UMLS CUI bigrams in a large set of scientific literature. The algorithm isre-implemented in Hadoop MapReduce to overcome resourcebottlenecks.
HadoopfortheDesktop MapReduce in30seconds
ConclusionandFutureWork
References[1]Bodenreider O.“Theunifiedmedicallanguagesystem(UMLS):integratingbiomedicalterminology.”Nucleicacidsresearch,2004,32,D267-D270.[2]AronsonARandLangF.“AnoverviewofMetaMap:historicalperspectiveandrecentadvances.”JAMIA,2010,17:3,229-236.
Contactalolex@vcu.edu orbtmcinnes@vcu.edu formoreinformation.
Results
UMLS,CUIs,andMetaMap….OhMy!
• Unified Medical Language System• Repositoryofbiomedicalvocabularies• Facilitatesautomatedinformationretrieval(e.g.linkinghealthinformationandbillingcodesacrosssystems).• Provideslanguage normalization bylinkingsimilartermstosameconcept/meaning.
UMLS1
• Concept Unique Identifier• ID assigned toeachunique concept intheUMLS.• E.g. HeadacheandCranialPainarebothassignedtotheCUIC0018681
CUI
• Toolthatidentifies UMLS concepts inbiomedicaltexts.• OutputmappedtexttocompressedMetaMap MachineOutput(MMO)files.•Parsed23,343,329citationstocreatethe2015MedLine/PubMedBaselinedataset--779MMOfilescompressedto132GB.
MetaMap2
UMLS::AssociationUMLS CUIs can be used to normalize biomedical and clinical text for use innatural language processing applications. By counting CUI Bigramfrequency—the number of times two CUIs appear close to each other in text--the UMLS::Association package can identify related concepts. E.g. headache and aspirin.
MetaMap FileAnatomy
AdjacentCUIBigramsC1710187- C0868928C0868928 - C0018081C0018081 - C0259800C1710187- C1533148C1533148- C0018081
SerialLimitations• Processesone file/one utterance atatimewithnestedforloops.• RegularlywritesnestedbigramhashtabletoMySQLdatabasedueto
memory limitations,introducingDB communication latency.• Perlcodeisnot paralellizable duetolimitationsinsharingnestedhashes
acrossthreads.
Utterance…CUI1…CUI2…EOU
Utterance…CUI1…CUI2…EOU
Utterance…CUI1…CUI2…CUI3…CUI4…EOU
Utterance…CUI1…CUI2…EOU
CUI1-CUI2,4
CUI2-CUI3,1
CUI3-CUI4,1
<K,V>CUI1-CUI2,1
<K,V>CUI1-CUI2,1
<K,V>CUI1-CUI2,1CUI2-CUI3,1CUI3-CUI4,1
<K,V>CUI1-CUI2,1
Sum
Input Output
Map Reduce
Splitinputbyrecord -->identify<Key, Value> pairsinparallel--> sumvalueswithsamekey
CuiCollectorMapReduceExtracts CUI bigrams usingHadoopMapReduce framework.
Desktop implementation (asinglenodeHadoopsystem).
MapReduceAdvantages• Inherentandscalable parallelization.• Writes results to disk--allintermediateandfinal.• No MySQL communication.
MapReduce
8hrs
Serial(MySQL)
229hrs
Serial(noMySQL)
22hrs
MetaMap 2015MEDLINEBaseline
Dataset
28.6xSpeedup!
2.8xSpeedup!
Parsing CUI Bigrams in Hadoop on a desktop computer resulted in significantspeedup, and enables parsing larger datasets that were not previouslyfeasible. Algorithm improvements include a window size to collect distant CUIbigrams and crossing utterances to process full PubMed citations. Future workincludes testing the scalability on a Hadoop cluster, resolving an issue with thecompressed input file format to improve mapping efficiency, identifying optimalHadoop settings for a desktop implementation, and re-implementing in SPARKto take advantage of its in-memory storage of intermediate results.
Inconclusion,desktop implementations ofHadoopcanresolve computing resource problems andprocess data faster,openingup
more research areas inbigdataprocessingforsmallerlabs.