Key medical applications beyond...

4/18/2018

Copyright © Daniel Rubin and Imon Banerjee 2018 1

Frontiers of AI in Medical Imaging:

Daniel L. Rubin and Imon Banerjee

Department of Biomedical Data Science

Department of Radiology

Stanford School of Medicine

O V E R C O M I N G C U R R E N T C H AL L E N G E S AN D M O V I N G B E YO N D

C L AS S I F I C AT I O N

Outline

1. Need for image interpretation beyond image classification

2. Overcoming the challenge of insufficient training data

3. Integrating multiple data types with images

4. Making AI clinical predictions and providing visualizations for explanation

Outline





Benign

BenignCancer

Cancer

Image classification: Medical imaging

“Benign or cancer lesion?”

Copyright © Stanford University 2018

There are other important medical needs beyond image classification…

Key medical applications beyond classification

1. Disease detection

2. Lesion segmentation

3. Treatment selection

4. Response assessment

5. Clinical prediction (of response or future disease)

4/18/2018


People (and their diseases) differ…


Disease varies in different people

Molecular diversity Heterogeneous genomic

aberration landscape of individual tumors*

Phenotypic diversity Variable appearance of

lesions on images

Clinical diversity Patients have different

response to treatment 116 GBM patients

Proneural Neural Classical Mesenchymal

The TCGA Research Network. Cancer Cell. 2010


“Precision Medicine”

• Patient care often lacks specificity (“One size fits all” does not usually apply in medicine)

• There are “subtypes” of disease (e.g., many types of “breast cancer” needing specific therapy for each type)

• Precise diagnoses based on “electronic phenotyping” and molecular profiling enables treatments that are tailored to unique characteristics of each patient

• Requires accurate methods of prediction based on disease phenotypes

› Key opportunity for Big Data and AI methods


“Precision Health”

• A paradigm shift, focusing on prediction and prevention, rather than relying exclusively on diagnosis and treatment of existing disease

• Prevents or forestalls the development of disease

• Reduces costs and morbidity and improves patient care

• Requires accurate methods of prediction based on monitoring people’s health status

› Key opportunity for Big Data and AI methods


These are prediction problems…

Explosion in electronically-accessible medical records data provides opportunity to learn models to help with these prediction problems

Growth in electronic patient data

Source: https://www.healthit.gov/sites/default/files/data-brief/2014HospitalAdoptionDataBrief.pdf


4/18/2018


Data-driven, precision medicine

Mine medical data to create AI models

Precision medicine: match patients to treatments

Precision health: predict future disease

Deploy AI models to aid medical decision making in people


Computerized model:

Integration and phenotype extraction

Integrating various types of data (e.g., images + clinical notes) is needed

Disease detectionDiagnosisTreatment response

evaluationClinical prediction

Molecular Characterization

Laboratory and Clinical Testing

Radiology Imaging and Reports

Pathology Imaging/Reports

Clinical Decision SupportRadiomics and Deep

Learning


Deep learning: Image classification

Inputimage Output

label

Modified after slide by Jeff Dean, Google

Low levelrepresentation

Edges, shapes, object parts

Recognizable objects

High level representation

Unsupervised learned quantitative

features

• High-level abstractions of image features hierarchical, non-linear transformations

• Higher-level features (layers) are defined from lower-level ones, and represent higher levels of abstraction

• Most suitable for classification problems


Beyond classification: Annotation of images

• In order to build classifiers for images, we need to collect lots of annotated images

• Creating large annotated image datasets are costly

• Scalable NLP methods applied to radiology reports associated with images could generate image annotations in large volume

• Word embeddings are a promising NLP strategy for this purpose

• Disease in patients evolves over time (longitudinally)

• Patient data (images and text reports/notes) are acquired longitudinally

• We need prediction models need to account for longitudinal data inputs (e.g., RNN, LSTM)

Beyond classification: Prediction Deep learning with texts: Word embeddings

CBOW Neural Network Model

• Simple neural network of a single hidden layer with a linear activation function.

Skip-Gram: Input to model is wi, and the output is context, e.g., wi−1,wi−2,wi+1,wi+2"predicting the context given a word“

Continuous bag-of-words (CBOW): Input to model is context, e.g., wi−2,wi−1,wi+1,wi+2, and output is wi. "predicting the word given its context"

• Unsupervised learning from large corpora; word vectors (embedding) learned by stochastic gradient descent


4/18/2018


Identifying core terms from unstructured narrative text

Unsupervised deep learning algorithms can discover annotation from texts without the need of supplying specific domain knowledge

Word embedding using deep learning (4,442 words) projected in two dimensions

Imon Banerjee, JDI 30:506-518, 2017Copyright © Stanford University 2018

Outline





• Each PACS database contains millions of images “labeled” in the form of unstructured notes.

• Why not to use the notes for annotating the images?

• Unstructured free text cannot be directly interpreted by a machine due to the ambiguity and subtlety of natural language.

• How to extract the semantic information from the clinical notes?

Radiological image annotation: leveraging clinical notes

Radiologist’s noteCT image


Representing radiology notes for classification

f( )=y?

What is the best representation for the document x being

classified by machine?


Dictionary-based approach

Construct a pattern dictionary (template)

Extracts and structures clinically relevant information from textual radiology reports.

Translates the information to terms in a controlled vocabulary.

MedLEE – Friedman et. al. (1995)


Rule-based systems

Creates annotations for text that matches pre-defined regular-expression patterns Different from the dictionary based

method, the rule based method use several general rules instead of dictionary to extract information from text

The regular expressions are usually organized into rule files.

Bozkurt et. al. Using automatically extracted information from mammography reports for decision-support. 2016


4/18/2018


Unsupervised feature extraction (BoW and tf-idf)

Consider the document collection has distinct words

Each document is characterized by an n-dimensional vector whole ithcomponent is the frequency of word or weight of word

Example:

Report 1: [No tumor seen]

Report 2: [Large tumor detected]

W(vocabulary) = [No, tumor, seen, Large, detected] ( 5)

Report_V1 = [1,1,1,0,0]

Report_V2 = [0,1,0,1,1]


Word embedding + classification model

• Stores each word in as a point in space, where it is represented by a vector of fixed number of dimensions.

• Unsupervised, built just by reading huge corpus

• Can be used as features to train a supervised model with a small subset of annotation.

Word embedding

CorpusDocument embedding

Classifier

Positive

Negative

Document classificationMikolov, Distributed representations of words and phrases and their compositionality


Word2Vec


Limitation of existing NLP methods

Dictionary-based and Rule-based NLP systems

limited coverage and generalizability.

Requires tremendous manual labor to construct the dictionary or rule files.

Unsupervised feature extraction (BoW and tf-idf)

encode every word in the vocabulary as one-hot-encoded vector, but clinical vocabulary may potentially run into millions;

vectors corresponding to same contextual words are orthogonal;

don’t consider the order of words in the phrase.

Word embedding techniques

assign random vector for out-of-vocabulary (OOV) words and morphologically similar words

not directly suitable for radiology medicine as synonyms and related words are used widely, and words may have been used infrequently in a large corpus.


Intelligent Word Embedding: pipeline


Ontocrawler: generation of domain dictionary

Created an ontology crawler using SPARQL that grabs the sub-classes and synonyms of the domain-specific terms from NCBO bio-portal.

Generate a focused dictionary for each domain of radiology.

• {‘apoplexy’, ‘contusion’, ‘hematoma’, ...} ‘hemorrhage’


4/18/2018


Context-depended document vector creation

For Document vector creation also used:

I. Averaging II. Max poolingIII. Min pooling


IWE for classifying hemorrhage riskGoalsGoals

1. Extract 10,000 CT head reports from Stanford hospital repository.

2. 1,188 reports annotated with hemorrhage risk.

3. Tailor the popular word2vec method to be applicable in medical domain, particularly radiology.

Can be used at scale in EMR for: Reduce the requirement of manual annotation by using only a small subset of annotated reports.

Study 1: CT head exam reports

Banerjee I, Madhavan S, Goldman RE, Rubin DL. Intelligent Word Embeddings of Free-Text Radiology Reports. AMIA Annual Symposium 2017, arXiv preprint arXiv:1711.06968. 2017 Nov 19.


Case-study

Task: Classification of radiology reports by confidence in the diagnosis of intracranial hemorrhage by the interpreting radiologist.

Dataset:

Unannotated corpus: Captured 10,000 radiology reports of CT Head, CT Angiogram Head, and CT Head Perfusion study.

Gold-standard annotation: a subset of 1,188 of the radiologic reports were labeled independently by two radiologists and agreement was high ~0.98 kappa score.

Classification labels:

1) No intracranial hemorrhage;

2) Diagnosis of intracranial hemorrhage unlikely, though cannot be completely excluded;

3) Diagnosis of intracranial hemorrhage possible;

4) Diagnosis of intracranial hemorrhage probable, but not definitive;

5) Definite intracranial hemorrhage.


Classification

Aim is to assign a ‘risk’ label to the free-text CT radiology reports while being trained on the subset of reports with the ground truth labels created by the experts.

Performed experiments using three well-known classifiers - Random Forests, Support Vector Machines, K-Nearest Neighbors (KNN) in their default configurations.

36

AMIA 2017 | amia.org

No risk946 reports

High risk199 reports

Medium risk43 reports


Hyperparameter tuning

We used the grid search approach to tune the two main hyperparameters of our embedding for the targeted annotation: Window Size and Vector Dimension.

37

AMIA 2017 | amia.orgCopyright © Stanford University 2018

Results

Word analogies – Can be derived by computing the cosine similarity

.

| |

∑

∑ ∑


Word 1 Word 2 Similarity

new recent 0.941

infarction acute infarction 0.928

hemorrhage hemorrhage 0.964

hemorrhage subarachnoid hemorrhage

0.968

Word 1 Word 2 Similarity

hemorrhage NEGEX QUAL hemorrhage -0.074

large NEGEX enlarged -0.245

abnormalities NEGEX QUAL abnormalities

-0.283

mass effect NEGEX QUAL mass effect -0.170


4/18/2018


Word embeddings in 2D dimensions

39


Constructed using the t-SNE approach where each data point represents a word. A total of 4,442 words are visualized in the figure


Document embeddings in 2D dimensions

40


1,188 expert annotated CT radiology report vectors visualized in two dimensions


Baseline performance with unigrams

1. Baseline model – Bag-of-words with >10,000 words in the vocabulary

41


Classifier Precision Recall F1-score

Random Forest 87.5% 66.03% 75.26%

KNN (n = 10) 64.79% 80.49% 71.8%

KNN (n = 5) 82.62% 82.36% 75.9%

SVM (Radial kernel) 60.52% 77.80% 68.08%

SVM (Polynomial kernel) 69.52% 77.80% 68.08%


Comparative performance

1. Out-of-box word2vec – without semantic mapping

2. Proposed model - with semantic mapping

42


Out-of-box word2vec Proposed model

Classifier Precision Recall F1-score Precision Recall F1-score

Random Forest 87.59% 89.17% 87.78% 88.64% 90.42% 89.08%

KNN (n = 10) 86.73% 88.90% 87.47% 88.60% 89.91% 88.88%

KNN (n = 5) 87.52% 88.65% 87.74% 88.54% 89.62% 88.76%

SVM (Radial kernel) 63.98% 79.96% 71.07% 64.19% 80.09% 71.25%

SVM (Polynomial kernel) 62.40% 78.97% 69.70% 63.25% 79.49% 70.43%


IWE for annotating Pulmonary EmbolismGoalsGoals

1. Extract 100k+ de-identified Thoracic CT free text reports from Stanford hospital repository

2. Used ~900 CT free text reports from University of Pittsburgh Medical Center.

3. Design machine learning algorithms to retrospectively classify PE-CTA imaging reports

4. Compare performance to published state-of-the-art rules based information extraction for PE

Can be used at scale in EMR for: cohort analysis, machine vision, cost-effectiveness / utilization analysis

Study 2: Chest CT image (Stanford & UPMC)


PE study results

Banerjee I, Chen MC, Lungren MP, Rubin DL. Radiology Report Annotation using Intelligent Word Embeddings. Journal of Biomedical Informatics November 2017

Proposed modelsProposed models


4/18/2018


ResultsResults

Benchmarked the performance against PeFinder and word2vec model.

IWE model had lowest generalization error with highest F1 scores.

Of particular interest the IWE model (trained on the Stanford dataset) outperformed PeFinder on the UPMC dataset (used to create the PeFinder model).

PeFinder: B.E. Chapman, S. Lee, H.P. Kang, W.W. Chapman, Document-level classification of ct pulmonary angiography reports based on an extension of the context algorithm J. Biomed. Inform., 44 (5) (2011), pp. 728-737


Clustering of the word embedding space using K-means++

Clustered word vector space

Clustered elements


Unsupervised IWE reports embedding – holdout test-set

PE positive - Stanford test set (on left) and UPMC dataset (on right)


Unsupervised IWE reports embedding – holdout test-set

PE acute - Stanford test set (on left) and UPMC dataset (on right)


ROC curve measures

Stanford dataset UPMC dataset


ResultsResults

On Stanford dataset


4/18/2018


ResultsResults

On UPMC dataset


IWE for inferring LI-RADS scoresGoalsGoals

1. Extract 200K de-identified ultrasound reports from Stanford hospital repository

2. Used 2000 free text reports from UT Southwestern.

3. Design a scalable computerized approach for large-scale inference of Liver Imaging Reporting and Data System (LI-RADS) final assessment categories in narrative ultrasound (US) reports

4. Infer LI-RADS scoring for unstructured reports that were created before the LI-RADS guidelines were established

Can be used at scale in EMR for: large scale text mining and data gathering opportunities from standard hospital clinical data repositories

Study 3: Liver Ultrasound HCC screening


PE study results

Liver Ultrasound Cohorts

Stanford Data

Used for testingYear 2007 - 2016 (without LI-RADS template)

-11,154

Year 2017 (without LI-RADS template)962

Used for training and validationYear 2017 (with LI-RADS template)

TotalLI-RADS 1 1589LI-RADS 2 93LI-RADS 3 62

UT Southwestern Data

Used for testing

Year 2017 (with LI‐RADS

template)

Total

LI-RADS 1 1867LI-RADS 2 162LI-RADS 3 118


Proposed modelsProposed models

[2007 -2017]Learning

word semantics

Learning LI-RADS coding

Infer LI-RADS coding

With LI-

RADS?

Annotated US reports(not formatted with LIRADS template)

[2007 – 2016]

Trained classifier

LIRADS vocabulary


Synonyms of LI-RADS terminology derived by the model

Categories LI-RADS Lexicon

Synonyms generated

Echogenicity hyperechoic hyperechogenic, hyperechoisoechoic isoechohypoechoic hypoechogenicity, hypoechogen, hypoechocystic anecho, anechoicnonshadowing

non_shadowing

Doppler vascularity hypovascular

nonenhancing

avascular nonvascularhypervascular

hypervascularity

Architecture septation septat, septations, multicystic, septa, complex_cyst, intern_septation, thin_septation, multispet, reticul, fishnet, multilocul

complex complicated, solid_and_cysticMorphology lobulated bilobe, macrolobulated, microlobulated

round oval, rounded, ovoid, oblongill-define vague, indistinctexophytic bulgewell_defined well_circumcribed, marginated


Ensemble classifier

Ensemble classifier is a weighted combination of:

1. Section embedding classifier: takes vector representation of the liver section recorded in US exams report as input

2. Lesion measure classifier: takes the two quantitative lesion measures as input

1. Number of lesion present in the liver

2. Long axis length of largest lesion

Vectors of liver section


4/18/2018


Validation on LIRADS formatted reports

Machine inferred annotation Human annotationBOW classifier Proposed model Rater 1 Rater 2

Average precision 0.59 0.93 0.92 0.93Average recall 0.49 0.88 0.92 0.92Average f1 score 0.52 0.90 0.92 0.92

LI‐RADS category reported by the original interpreting radiologist as the true label. Performance of the human raters and the proposed model on the validation set (147 reports)


Liver Segment Originalcoding

Machine derivedprobability

Imputed label

Rater1

Rater 2 Reason

1 2 3liver length: 14.2 cm. liver appearance: mild steatosis. segment 5 lesion, previously characterized as a hemangioma on february 2014 mr now measures 1.0 x 1.0 x 0.9 cm, previously 0.7 x 0.6. previously seen optn class 5a lesion on february 2014 mr is not well seen on ultrasound. null a small right hepatic lobe cyst measures 7 x 6 x 7 mm, previously 10 x 9 x 9 mm. no new hepatic lesions.

3 0.56 0.06 0.38 1 1 1 Previously characterized as hemangioma, therefore should be categorized as benign

liver length: 17.6 cm. liver appearance: normal. liver observations:0.6 x 1.1 x 1.3 cm hyperechoic focus in the right liver was minimalflow likely representing a hemangioma or focal fat. liver doppler:hepatic veins: patent with normal triphasic waveforms.

1 0.55 0.06 0.4 1 2 3 Includes hemangioma or fat (both benign), but this is not definite and needs characterization

liver length: 12.1 cm. liver appearance: mild steatosis. hypoechoicleft hepatic lobe lesion measures 1.2 x 0.5 x 0.7 cm, decreased from3/8/2017 ct and not significantly changed from more recent pet/ct.

1 0.45 0.10 0.44 1 3 3 Observation is stablefrom prior imaging,but not definitelybenign

liver length: 16 cm.liver appearance: severe steatosis. null no surface nodularity. liverobservations: 1.7 x 1.4 x 1.0 cm hypoechoic focus in the gallbladderfossa likely reflects focal fatty sparing. null no surface nodularity.liver doppler: hepatic veins: patent with normal triphasicwaveforms.

3 0.43 0.17 0.40 3 1 1 Fatty sparing is benign

Disagreement between original US image reader and raters

Class probability value computed by the model clearly shows that model is deciding between reader’s defined label and raters’ observation


Test on non-LIRADS formatted reports

HCC screening Ultrasound Reports formatted without LIRADS template (2007 - 2016) [11,154 exams]

No LIRADS scoring available from the US image reader

Asked raters to annotated 216 reports where the model’s predicted highest probability is either <0.5 (152 reports) or >0.9 (64 reports) .

Confusion Matrix on 216 reports Prediction confidence


Test on UT Southwestern data

• Without retraining, applied on a different institutional data

• Tested on 2381 reports coded with LIRADS template

Machine inferred annotationProposed model

Average precision 0.89Average recall 0.84Average f1 score 0.85

LIRADS 2 is predicted as LIRADS 3


Longitudinal patient tracking: HCC screening

1 1 1 1 1 1

2 2

1 1 11

0

1

2

3

LI-

RA

DS

sc

ore

1

3

1/1/09 1/1/10 1/1/11 1/1/12 1/1/13 1/1/14 1/1/15

LIR

AD

S s

co

res

Patient 1 [2007 – 2016] Patient 2 [2009 – 2015]


Outline





4/18/2018


Probabilistic Prognostic Estimates of Survival in Metastatic Cancer Patients Utilizing Free-Text Clinical Narratives

Only in United States around 500,000 patients develop metastatic cancer every year.

Several studies have shown overutilization of aggressive medical interventions and protracted radiation treatment in patients close to the end of life.

Inability to accurately estimate patient life expectancy likely explains why physicians tend to choose overly-aggressive treatments for some patients.

Leads to increased morbidity and healthcare costs, while other patients may be under-treated and denied access to effective treatments that could reduce symptoms or even extend survival.

69

A robust ML model that predicts patient survival would have major impact on the quality of care and quality of life in metastatic cancer patients.

Banerjee I, Gensheimer MF, Wood DJ, Henry S, Chang D, Rubin DL. Probabilistic Prognostic Estimates of Survival in Metastatic Cancer Patients (PPES-Met) Utilizing Free-Text Clinical Narratives. AMIA Informatics 2018, arXiv preprint arXiv:1801.03058. 2018 Jan 9.


The number of natural language processing (NLP)-related articles compared to the number of electronic health record (EHR) articles from 2002 through 2015

Yanshan Wang et. al., Clinical information extraction applications: A literature review, JBI 2018

Under-utilization of NLP in EHR-based research

70Copyright © Stanford University 2018

Objective

Created a dynamic model that takes as input a sequence of free-text clinical visit narratives ordered according to the date of visits.

Computes as output a probability of short-term life expectancy (> 3 months) for each visit considering the current and all the historic time points.

71

Visit note t1

Visit note t2

Visit note t3

Visit note tn

Score t1

Score t2

Score t3

Score tn

Input: Unstructured visit notes Ordered based on time stamp

Output: Probability of survival

Model: Analyse current and historicvisit data


Major challenges

How to create machine-intelligible dense vector representations of unstructured clinical notes?

How to model irregular time gaps between the longitudinal clinical events?

How to infer human interpretable justification of prediction while using longitudinal data?


Dataset used in the study

73

Characteristic Metastatic cancerdatabase

(MetDB)

Palliative radiationdataset

(PrDB)No. of patients 13,523 899

Age 61.5 (IQR 51.2 – 70.5) 65.0 (IQR 55.8 – 72.2)

Sex M: 6621 (49%);

F: 6902 (51%)

M: 460 (51.1%);

F: 439 (48.9%)

Primary site Breast: 1493 (11.0%) Endocrine: 211 (1.6%) Gastrointestinal: 3575 (26.4%) Genitourinary: 1504 (11.1%) Gynecologic: 849 (6.3%) Head and neck: 506 (3.7%) Skin: 453 (3.3%) Thorax: 2178 (16.1%) Other/Multiple/Unknown: 2754 (20.4%)

Breast: 141 (15.7%) Endocrine: 0 (0%) Gastrointestinal: 145 (16.1%) Genitourinary: 112 (12.5%) Gynecologic: 50 (5.6%) Head and neck: 57 (6.3%) Skin: 122 (13.6%) Thorax: 252 (28.0%) Other/Multiple/Unknown: 20 (2.2%)

Note typesOncology notes, progress notes, radiology reports, discharge summary, nursing notes, critical care notes

Distribution of visits


Survival data - challenges

Patient 1: Dense follow up(multiple visits on same day)

Patient 3: Sparse follow up(long and variable gaps between visits)

Patient 2: Minimal information(only 3 days)

Patient 4: No death info – long follow up


4/18/2018


PPES-Met model

75

Unsupervised embeddingOf free-text notes

Many-2-many RNN model


RNN model with time distributed weighted loss


Training and Evaluation

77

Category 1: “Survival - positive” stands forsurvival up to 3months starting from the currentvisit date;

Category 2: “Survival - negative” flagged thenon-survival;

Category 3: “Zero padding” padded each inputsequences when is shorter than 1000 andtruncated the historic visits when sequence islonger than 1000

Model training and validation on MetDB

Training: 10,293

patients;Validation:

1,938 patients;

Model evaluation: dual strategy

1. Quantitative: measure the overall prognosis estimation accuracyusing the standard statistical metrics

2. Qualitative: evaluate the patient-level performance and performerror analysis with intelligible longitudinal graph summary forunderstanding the basis of prediction.

Test: 1818 patients;

899 from PrDB+ 919 Randomly

selected from MetDB


Results: Quantitative Evaluation on PrDB

78

Overall ROCAUC for predicting 3 mo. survival -0.89; Confidence interval [0.884 - 0.897]

Tested on 1818 patients with multiple visits


Results: Quantitative Evaluation on PrDB

79

ROC based multiple primary site

Comparing with systematic therapy:Shows model’s prediction outperformed oncologist’sexpectation of survival and can contribute in treatment planning

Tested on 1818 patients with different primary sites


Results: Qualitative Evaluation on PrDB

80

Patient 1 Patient 2

Patient-level performance analysis


4/18/2018


Results: Qualitative Evaluation on PrDB

81

Patient 3 Patient 4

Patient-level performance analysis


Hover & discover

82

Intelligible longitudinal survival curve of a patient


Prediction test with 30% polluted visit note at the end of the sequence

83

Patient 5 Patient 6


Prediction test with 30% polluted visit note at the end of the sequence

84

Patient 9 Patient 10


Future works

Extend our AI framework by integrating imaging and non-imaging multi-source data for predictions of future hospitalization and ER visits.

Integrate semantic data mining and deep learning analysis for combining structured and unstructured clinical data

85

Future vision


Conclusions

• There are important medical tasks for deep learning beyond classification

• Much informative medical data is longitudinal electronic patient records data

• Word embeddings are a powerful technique

• Information extraction

• Image annotation

• For generating features for prediction models

• Text data can be integrated with image data for prediction models

• Prediction models leveraging longitudinal patient data are promising

4/18/2018


Thank you.

Contact info:[email protected]

[email protected]

Date post:	28-May-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Key medical applications beyond...

Documents