Date post: | 20-Jan-2016 |
Category: |
Documents |
Upload: | beverley-oliver |
View: | 213 times |
Download: | 0 times |
Crystallization Image Analysis on the World Community Grid
Christian A. Cumbaa and Igor JurisicaJurisica Lab, Division of Signaling Biology
Ontario Cancer Institute, Toronto, Ontario
2
Why automate classification of protein crystallization trial images?
• Hauptman-Woodward has 65,000,000 images.– They want 65,000,000 outcomes.
clearphase separationprecipitateskincrystalX
garbageunsure
3
Why automate classification of protein crystallization trial images?
• Assist or replace human screening• Speed the search phase in protein crystallization• Improve throughput, consistency, objectivity• Enables data mining and statistical optimization
of the crystallization process
clearclear precipitateprecipitate crystalcrystal
4
Image classification
clear
phase separation
precipitate
skin
crystalX
garbage
unsure
100000s of numbers 7 numbers10s of numbers
feature 1feature 2
…feature k
feature extraction classification
5
Truth data
• 96 study– 96 proteins X 1536 images
hand-scored by 3 experts– Presence/absence of 7
independent outcomes
• NESG & SGPP– 15000 images– Hand-scored by 1 expert,
same scoring system
• 50% unanimously-scored images– 10 most interesting
compound categories
96-study
SGPP (crystals)
NESG (crystals)
6
Feature set
12375 features computed per image
– A few basic statistics– 50 microcrystal features– Euler number features,
two variations1. 11 Blur levels
2. 11 Blur levels X 4 thresholds
– Image “energy”• 11 blur levels
– 2925 Grey-Level Co-occurrence Matrix features
• 3 different grey-level quantizations
• 13 basic functions
• 25 sample distances
• ~100 directions– Computable from every
point in the image– Distilled to max range,
max mean, min mean
– ~9500 image-blob features• Radon & edge-detection
7
Our image analysis problem
• Computing all 12,375 features takes >5 hours for a single image
• We have 165,000 images in our training set• Features must be evaluated for quality• The best features (10s or low 100s) must be
computed for the remaining 65,000,000 images
Massive computing resources required!
8
Image analysis on the World Community Grid
• http://www.worldcommunitygrid.org– a global, distributed-computing platform for solving large
scientific computing problems with human impact– 377,627 volunteers contribute idle CPU time of 960,346
devices.
• Our project: Help Conquer Cancer* – launched November 2007.
• HCC has two goals:1. To survey a wide tract of image-feature space and identify
image analysis algorithms and parameters (features) that best determine crystallization outcome.
2. To perform the necessary image analysis on Hauptman Woodward’s archive of 65,000,000 crystallization trial images.
* fundraising slogan of the Ontario Cancer Institute and its parent organization.
9
Image analysis on the World Community Grid
• HCC has two phases– Phase I: calculate 12,375 features per image on
high-priority images, including 165,441 hand-scored images.– November 2007-May 2008– analysis on hand-scored images completed January 2008
– Phase II: calculate the best features from Phase I on the backlog of HWI images
• Grid members have contributed 8,919 CPU-years so far to HCC, an average of 55 CPU-years per day.
10
11
Phase I: feature assessment
13
Measuring feature quality
• Treat as random variables:– Image class– Feature value
• Measure the mutual information between them (unit: bits)= entropy(class) +
entropy(feature) – entropy(class,feature)
00.10.20.30.40.50.60.70.80.9
1
cle
ar
ph
ase
sep
ara
tion
pre
cip
itate
skin
crys
tal
ga
rba
ge
un
sure
En
tro
py
(b
its
)
feature entropy
class entropy
14
clear
precipitate (no crystal)
other
Measuring feature quality
15
Information density: microcrystal counts parameter space
Clear Precipitate Crystal
16
Information density: GLCM maximum range parameter space
Clear Precipitate Crystal
17
Information density: Radon-Sobel soft sum parameter space
Clear Precipitate Crystal
18
Information density: Radon-Sobel blob metrics (means) parameter
space
Clear Precipitate Crystal
Towards Phase II: image classification
20
Building classifiers
• handpicked 74 features from peaks in the clear, precipitate and other mutual information plots
• two classification schemesthree-way: clear, non-crystal precipitate, other
ten-way: clear, phase separation, phase + precipitate, skin, phase + crystal, precip, precip + skin, precip + crystal, crystal, garbage
• naïve Bayes model• leave-one-out cross-validation
21
Measuring classifier accuracy: precision and recall
precision
recall
crystals
“I think these are crystals”
truepositives
false negatives
false positives
22
Three-class distribution
Clear 24.3%
Precipitate AND NOT crystal 52.7%
Other 23.0%
1709552585109
15928451121819
61781727615clear
non-crystal precipitate
other
cle
ar
non
-cry
stal
p
reci
pita
te
oth
er
machine saystrue
class
Confusion matrix
23
Recall & precision
24
10-class distribution
Clear 33.83%
Phase separation 7.00%
Phase separation + precipitate 0.50%
Skin 0.79%
Phase separation + crystal 2.32%
Precipitate 34.25%
Precipitate + skin 4.95%
Precipitate + crystal 7.53%
Crystal 8.34%
Garbage 0.55%
25
3132002521490428
129312910729021964958656345888
8914285261110635621118522235
2930539520086923282433320512
38551240883440169075536174941972441
105512928875511853726874
2010505136372029126
331107819751632241
91503139752986682814024331446
1193920181501135122725585clear
phase separation
phase and precipitate
skin
phase and crystal
precipitate
precipitate and skin
precipitate and crystal
crystal
garbage
clea
r
phas
e se
para
tion
phas
e an
d
prec
ipita
te
skin
phas
e an
d cr
ysta
l
prec
ipita
te
prec
ipita
te a
nd s
kin
prec
ipita
te a
nd
crys
tal
crys
tal
garb
age
machine says
true
class
Confusion matrix
26
Recall & precision
27
AcknowledgementsHauptman-Woodward Medical Research Institute
George DeTitta, Joe Luft, Eddie Snell, Mike Malkowski, Angela Lauricella, Max Thayer, Raymond Nagel, Steve Potter, and the 96-study reviewers.
World Community GridBill Bovermann, Viktors Berstis, Jonathan D. Armstrong, Tedi Hahn, Kevin Reed, Keith J. Uplinger, Nels Wadycki
IBM Deep Computing: Jerry Heyman
Jurisica Lab: Richard Lu
All crystallization images were generated at the High-Throughput Screening lab at The Hauptman-Woodward Institute.
Funding fromNIH U54 GM074899Genome CanadaIBMNSERC
(and earlier work from)NIH P50 GM62413NSERCCITO