Parsing MetaMapFiles in Hadoop · Map Reduce Split input by record--> identify pairs in parallel...

Post on 14-Jul-2020

7 views 0 download

transcript

ParsingMetaMap FilesinHadoopAmyL.Olex,M.S.,AlbertoCano,Ph.D.,BridgetMcInnes,Ph.D.

DepartmentofComputerScience

The deluge of data in today’s information-centric world requires bigger andbetter computing resources for processing. This can be a limiting factor in howmuch data labs with limited computing resources are able to handle.This project explores the pitfalls of a serial program that parses MetaMap files toidentify UMLS CUI bigrams in a large set of scientific literature. The algorithm isre-implemented in Hadoop MapReduce to overcome resourcebottlenecks.

HadoopfortheDesktop MapReduce in30seconds

ConclusionandFutureWork

References[1]Bodenreider O.“Theunifiedmedicallanguagesystem(UMLS):integratingbiomedicalterminology.”Nucleicacidsresearch,2004,32,D267-D270.[2]AronsonARandLangF.“AnoverviewofMetaMap:historicalperspectiveandrecentadvances.”JAMIA,2010,17:3,229-236.

Contactalolex@vcu.edu orbtmcinnes@vcu.edu formoreinformation.

Results

UMLS,CUIs,andMetaMap….OhMy!

• Unified Medical Language System• Repositoryofbiomedicalvocabularies• Facilitatesautomatedinformationretrieval(e.g.linkinghealthinformationandbillingcodesacrosssystems).• Provideslanguage normalization bylinkingsimilartermstosameconcept/meaning.

UMLS1

• Concept Unique Identifier• ID assigned toeachunique concept intheUMLS.• E.g. HeadacheandCranialPainarebothassignedtotheCUIC0018681

CUI

• Toolthatidentifies UMLS concepts inbiomedicaltexts.• OutputmappedtexttocompressedMetaMap MachineOutput(MMO)files.•Parsed23,343,329citationstocreatethe2015MedLine/PubMedBaselinedataset--779MMOfilescompressedto132GB.

MetaMap2

UMLS::AssociationUMLS CUIs can be used to normalize biomedical and clinical text for use innatural language processing applications. By counting CUI Bigramfrequency—the number of times two CUIs appear close to each other in text--the UMLS::Association package can identify related concepts. E.g. headache and aspirin.

MetaMap FileAnatomy

AdjacentCUIBigramsC1710187- C0868928C0868928 - C0018081C0018081 - C0259800C1710187- C1533148C1533148- C0018081

SerialLimitations• Processesone file/one utterance atatimewithnestedforloops.• RegularlywritesnestedbigramhashtabletoMySQLdatabasedueto

memory limitations,introducingDB communication latency.• Perlcodeisnot paralellizable duetolimitationsinsharingnestedhashes

acrossthreads.

Utterance…CUI1…CUI2…EOU

Utterance…CUI1…CUI2…EOU

Utterance…CUI1…CUI2…CUI3…CUI4…EOU

Utterance…CUI1…CUI2…EOU

CUI1-CUI2,4

CUI2-CUI3,1

CUI3-CUI4,1

<K,V>CUI1-CUI2,1

<K,V>CUI1-CUI2,1

<K,V>CUI1-CUI2,1CUI2-CUI3,1CUI3-CUI4,1

<K,V>CUI1-CUI2,1

Sum

Input Output

Map Reduce

Splitinputbyrecord -->identify<Key, Value> pairsinparallel--> sumvalueswithsamekey

CuiCollectorMapReduceExtracts CUI bigrams usingHadoopMapReduce framework.

Desktop implementation (asinglenodeHadoopsystem).

MapReduceAdvantages• Inherentandscalable parallelization.• Writes results to disk--allintermediateandfinal.• No MySQL communication.

MapReduce

8hrs

Serial(MySQL)

229hrs

Serial(noMySQL)

22hrs

MetaMap 2015MEDLINEBaseline

Dataset

28.6xSpeedup!

2.8xSpeedup!

Parsing CUI Bigrams in Hadoop on a desktop computer resulted in significantspeedup, and enables parsing larger datasets that were not previouslyfeasible. Algorithm improvements include a window size to collect distant CUIbigrams and crossing utterances to process full PubMed citations. Future workincludes testing the scalability on a Hadoop cluster, resolving an issue with thecompressed input file format to improve mapping efficiency, identifying optimalHadoop settings for a desktop implementation, and re-implementing in SPARKto take advantage of its in-memory storage of intermediate results.

Inconclusion,desktop implementations ofHadoopcanresolve computing resource problems andprocess data faster,openingup

more research areas inbigdataprocessingforsmallerlabs.