1
Automatic Classification of Automatic Classification of Pathology Reports into Pathology Reports into
SNOMED CodesSNOMED Codes
June 2008
By Weihang ZHANG (MIT)
Supervisors : Prof. Jon PATRICK
Dr. Irena KOPRINSKA
2
INTRODUCTION
Motivation SWAPS Clinical Notes - 400K Pathology Texts
Context Text Categorization (TC) SNOMED Project TTSCT
3
INTRODUCTION – Motivation
SWAPS - South West Area Pathology Service Natural language medical records – 400K pathology
texts contains a great deal of formal terminology but used in an
informal and haphazard way. Medical records need to be converted to the formal
terminology: to enable accurate retrieval to compile aggregated statistics of the medical care
4
Text Categorization Definition:
Given a collection of documents D = {d1, d2, . . . , dn} , and a pre-defined category set C = {c1, c2, . . . , cm} ,
assign a True or False value for each pair <di , cj> D × C∈[Sebastiani F., 2002]
i.e. Text Categorization (TC) assigns meaningful categories to text Topics (politics, sports, entertainment, etc.) Opinions (negative, neutral, positive) Spams, Child Safety, Scam
A successful project: ScamSeek Project
INTRODUCTION – Context
5
SNOMED - Systematized Nomenclature of Medicine
Concepts: Basic unit of meaning designated by a unique numeric code, unique name (Fully Specif
ied Name), and descriptions, including a preferred term and one or more synonyms.
INTRODUCTION – Context (Cont’)
6
TTSCT - Text to SNOMED CT* A system which automatically maps free text into a m
edical reference terminology NLP*-technique enhanced lexical token matcher Qualifier identifier Negation identifier
[J. Patrick et al., 2006]
INTRODUCTION – Context (Cont’)
*CT – Clinical Terms
*NLP - Natural Language Processing
7
OBJECTIVES Explore an effective information retrieval
mechanism for medical notes classification
Evaluate the performance of classifiers with TTSCT support
Develop a SNOMED auto-coding system which helps clinician on research and decision making
8
RESEARCH METHOD Data Inspection System Design Classification Evaluation Machine Learners Comparison Feature Selection Methods Comparison
Indexing Methods Stemming Strategy Dimension Reduction N-Gram Text Subsection
9
Data Inspection– The Pathology Text• 400K of pathology texts from the SWAPS Anatomical Pathology Database
•A set of diagnoses for each report (pathology text), presented as SNOMED codes
<Title>CLINICAL HISTORY</Title>
Biopsy of discoid erythematosus like lesion from right cheek ? DLE.
<Title>MACROSCOPIC</Title>
LABELLED `RIGHT CHEEK LESION'. An ellipse 12 x 3mm with subcutis to 3mm. A poorly defined pale nodular lesion 3 x 3mm. It appears to abut the surgical margin. Representative sections embeded, A tips face on, B lesion and surgical margin. (MR 17/4)<DOT>TA</DOT>
<Title>MICROSCOPIC</Title>
Section shows hyperkeratosis with occasional follicular plugging, epidermal atrophy and severe sundamage to dermal collagen. A dense chronic inflammatory cell infiltrate, both superficial and deep is present, mainly in a perivascular and periadnexal distribution. No liquefaction degeneration of the basal layer, no dermal oedema and no interface dermatitis are seen. PAS stain reveals no thickening of the epidermal basement membrane and only an occasional fungal spore on the skin surface.
Immunofluorescence for immunoglobulins and complement fractions are negative.
The differential diagnosis rests between chronic discoid erythematosus, lymphocytic infiltration of skin of Jessner and the plaque type of polymorphous light eruption. The presence of marked solar damage to collagen, the absence of basal liquefaction degeneration and the negative immunofluorescence favours polymorphous light eruption. A reaction to drugs or an insect bite is also a possibility. No evidence of malignancy.
Reported 24/4/98
10
Data Inspection – SNOMED Codes Distribution
•867 types of codes occurred, and 30K codes have been assigned for the 10K texts
• The 9 codes with highest frequency are selected for experiments
•All the left codes are considered as “others”
Uniformly Random Select 10K pathology texts from Uniformly Random Select 10K pathology texts from
400K texts in database400K texts in database
11
System Design – TC Work Flow
Read Document
Text Tokenization
Lexical Verification
Remove Stopwords
Stemming
Vector Representation of Text
(Indexing)
Feature Filtering to Reduce Dimensionality
Machine Learning(classifiers)
SNOMED Code
12
System Design – Product Codes Assembling
Assembly line Text-to-Vector conversion TextVector Delivering
Classifiers – the workers All classifiers work together, assign the vector their classification result
s Product – a collection of codes
Classifier_1 Classifier_2 Object_3 Object_K
NewTextclassify
CODE_1Positive
CODE_2Negative
CODE_3Negative
CODE_KPositive
13
EXPERIMENTS Machine Learners:
SVM-Light, MaxEnt, J48(Weka-DT) Indexing Methods:
Boolean Weight Word Frequency Entropy Weight
Stemming: all words, none of words
N-Gram: Unigram, Bigram, Trigram
Dimension Reduction: Frequency threshold: >=1, >= 4 Information Gain: top 100, 200, 500, 1000, 2000, 4000
TTSCT ConceptID integration (Negation): Keep text and add Concepts Replace concept words with Concepts Only Concepts, no text at all
Text Subsection Hidding: <Clinical History>, <Microscopic>, <Macroscopic>
14
RESULTS & DISCUSSIONMeasurement for classification :
Recall:
Precision:
F-measure (F-value):
Standard Deviation:Measurement for System Performance :
Micro F:
15
EXPERIMENT
16
FUTURE WORK
17
Non-Word-Stemming performed better than Stemming
N-gram increased the correct-rate of classification with the increase of N (Trigram>Bigram>Unigram)
TTSCT indeed enhanced classification performance
Hiding misleading parts of text did raise the F-score
CONCLUSIONS