Download - 1 Automatic Classification of Pathology Reports into SNOMED Codes Automatic Classification of Pathology Reports into SNOMED Codes June 2008 By Weihang.

1

Automatic Classification of Automatic Classification of Pathology Reports into Pathology Reports into

SNOMED CodesSNOMED Codes

June 2008

By Weihang ZHANG (MIT)

Supervisors : Prof. Jon PATRICK

Dr. Irena KOPRINSKA

2

INTRODUCTION

Motivation SWAPS Clinical Notes - 400K Pathology Texts

Context Text Categorization (TC) SNOMED Project TTSCT

3

INTRODUCTION – Motivation

SWAPS - South West Area Pathology Service Natural language medical records – 400K pathology

texts contains a great deal of formal terminology but used in an

informal and haphazard way. Medical records need to be converted to the formal

terminology: to enable accurate retrieval to compile aggregated statistics of the medical care

4

Text Categorization Definition:

Given a collection of documents D = {d1, d2, . . . , dn} , and a pre-defined category set C = {c1, c2, . . . , cm} ,

assign a True or False value for each pair <di , cj> D × C∈[Sebastiani F., 2002]

i.e. Text Categorization (TC) assigns meaningful categories to text Topics (politics, sports, entertainment, etc.) Opinions (negative, neutral, positive) Spams, Child Safety, Scam

A successful project: ScamSeek Project

INTRODUCTION – Context

5

SNOMED - Systematized Nomenclature of Medicine

Concepts: Basic unit of meaning designated by a unique numeric code, unique name (Fully Specif

ied Name), and descriptions, including a preferred term and one or more synonyms.

INTRODUCTION – Context (Cont’)

6

TTSCT - Text to SNOMED CT* A system which automatically maps free text into a m

edical reference terminology NLP*-technique enhanced lexical token matcher Qualifier identifier Negation identifier

[J. Patrick et al., 2006]

INTRODUCTION – Context (Cont’)

*CT – Clinical Terms

*NLP - Natural Language Processing

7

OBJECTIVES Explore an effective information retrieval

mechanism for medical notes classification

Evaluate the performance of classifiers with TTSCT support

Develop a SNOMED auto-coding system which helps clinician on research and decision making

8

RESEARCH METHOD Data Inspection System Design Classification Evaluation Machine Learners Comparison Feature Selection Methods Comparison

Indexing Methods Stemming Strategy Dimension Reduction N-Gram Text Subsection

9

Data Inspection– The Pathology Text• 400K of pathology texts from the SWAPS Anatomical Pathology Database

•A set of diagnoses for each report (pathology text), presented as SNOMED codes

<Title>CLINICAL HISTORY</Title>

Biopsy of discoid erythematosus like lesion from right cheek ? DLE.

<Title>MACROSCOPIC</Title>

LABELLED `RIGHT CHEEK LESION'. An ellipse 12 x 3mm with subcutis to 3mm. A poorly defined pale nodular lesion 3 x 3mm. It appears to abut the surgical margin. Representative sections embeded, A tips face on, B lesion and surgical margin. (MR 17/4)<DOT>TA</DOT>

<Title>MICROSCOPIC</Title>

Section shows hyperkeratosis with occasional follicular plugging, epidermal atrophy and severe sundamage to dermal collagen. A dense chronic inflammatory cell infiltrate, both superficial and deep is present, mainly in a perivascular and periadnexal distribution. No liquefaction degeneration of the basal layer, no dermal oedema and no interface dermatitis are seen. PAS stain reveals no thickening of the epidermal basement membrane and only an occasional fungal spore on the skin surface.

Immunofluorescence for immunoglobulins and complement fractions are negative.

The differential diagnosis rests between chronic discoid erythematosus, lymphocytic infiltration of skin of Jessner and the plaque type of polymorphous light eruption. The presence of marked solar damage to collagen, the absence of basal liquefaction degeneration and the negative immunofluorescence favours polymorphous light eruption. A reaction to drugs or an insect bite is also a possibility. No evidence of malignancy.

Reported 24/4/98

10

Data Inspection – SNOMED Codes Distribution

•867 types of codes occurred, and 30K codes have been assigned for the 10K texts

• The 9 codes with highest frequency are selected for experiments

•All the left codes are considered as “others”

Uniformly Random Select 10K pathology texts from Uniformly Random Select 10K pathology texts from

400K texts in database400K texts in database

11

System Design – TC Work Flow

Read Document

Text Tokenization

Lexical Verification

Remove Stopwords

Stemming

Vector Representation of Text

(Indexing)

Feature Filtering to Reduce Dimensionality

Machine Learning(classifiers)

SNOMED Code

12

System Design – Product Codes Assembling

Assembly line Text-to-Vector conversion TextVector Delivering

Classifiers – the workers All classifiers work together, assign the vector their classification result

s Product – a collection of codes

Classifier_1 Classifier_2 Object_3 Object_K

NewTextclassify

CODE_1Positive

CODE_2Negative

CODE_3Negative

CODE_KPositive

13

EXPERIMENTS Machine Learners:

SVM-Light, MaxEnt, J48(Weka-DT) Indexing Methods:

Boolean Weight Word Frequency Entropy Weight

Stemming: all words, none of words

N-Gram: Unigram, Bigram, Trigram

Dimension Reduction: Frequency threshold: >=1, >= 4 Information Gain: top 100, 200, 500, 1000, 2000, 4000

TTSCT ConceptID integration (Negation): Keep text and add Concepts Replace concept words with Concepts Only Concepts, no text at all

Text Subsection Hidding: <Clinical History>, <Microscopic>, <Macroscopic>

14

RESULTS & DISCUSSIONMeasurement for classification :

Recall:

Precision:

F-measure (F-value):

Standard Deviation:Measurement for System Performance :

Micro F:

15

EXPERIMENT

16

FUTURE WORK

17

Non-Word-Stemming performed better than Stemming

N-gram increased the correct-rate of classification with the increase of N (Trigram>Bigram>Unigram)

TTSCT indeed enhanced classification performance

Hiding misleading parts of text did raise the F-score

CONCLUSIONS