Romualdi Chiara
Improved detection of differentially expressed genes in microarray experiments through multiple scanning and image integration
NETTAB 2003 Workshop
Bioinformatics for the management, analysis and interpretation of microarray data
CRIBI Biotechnology Centre
Università di Padova, Italy
Genomic Research University of Padova
Microarray variability
1. Inter - experiment variability
Gene probes deposited in replicates
Replicates are deposited in different region of the chip
2. Intra - experiment variability
Swap of Cy3 and Cy5
Replicate of the experiment
3. Hybridisation, labelling, amplification … variability
Global, local and surface normalization
… and image variability ?
Each microarray is scanned with a single laser run for
each fluorochrome …
… intensity values of spots are calculated.
… if a single microarray undergoes multiple scanning runs, the DNA spot images obtained are not exactly superimposable…
SCANLaser
16-bit TIFFs
Log2(ch1/ch2)
DNA spot images obtained from multiple scanning runs, are not exactly superimposable
IVIIIIII
B = moderately expressed
A = weakly expressed
C = highly expressed
Serial scans
C
B
A
I II III IV V VI VII VIII IX X
spot
Differences in pixels intensities
Pixel intensities differences
Probably only a portion of the fluorochromes is excitable by the laser beam and measurable by the photomultiplier, while the confocal
scanning system is detecting the fluorescence of a spot subregion.
Image variability
Quantification output variability Different microarray results
4% FP
1) pot superimposes n Tif images (input microarray images)
VP=(pixel11, pixel12, … , pixel1n)
2) Calculates for each pixel vector of the n images:
- Pixel intensity mean (mean of VP)
- Pixel intensity maximum, exclusion of saturated pixels (Max of VP)
3) Develops a virtual Tif image that summarizes the n input ones
Novel software for image integrationhttp://muscle.cribi.unipd.it/microarrays/spot/
I1 I2 I3 I4
Max
.
Mea
n
I II III IV V VI VII VIII IX X
C
B
A
B = moderately expressed
A = weakly expressed
C = highly expressed
Resulting virtual image after ten serial scans
Resulting virtual image after ten serial scans:
entire microarray
Serial scans and image integration improve spot (A) and background (B) uniformity
range
II minmax1
B
N. o f sc a ns
1 2 4 6 8 1 0
1 2 4 6 8 1 0
A
0.9
90.
994
0.9
980.
990.
994
0.99
8
Image uniformity improves spot detection
and quantification
Serial scans and image integration improve reliability of microarray results
4% False Positives
< 1 % False Positives
Competitive hybridisation with the same mRNA
Two experiments where two equal aliquots of skeletal muscle RNA (A) and heart muscle RNA (B) were labelled with Cy3 and Cy5 and challenged in competitive hybridisation.
0
20
40
60
80
100
21 4 6 8 10
N. o f sc a ns
0
20
40
60
80
100
Perc
ent
ag
e o
f de
cre
ase
of o
utlie
r sp
ots
A
B
In these case, all the Cy3/Cy5 ratios of spot intensities should lie at around 1.
Due to experimental variability, a portion of spot intensity ratios are far from 1
Number of outliers decreases with image integration
meanmax
Variation of spot signal intensity with incremental number of scans
= RT-labelling of total RNA = Amino-allyl
= RT-labelling of aRNA = DNA dendrimer probe
= TSA
Spot Intensity ~40.000 units Spot Intensity ~500 units
0.4
0.6
0.8
1.0
1.2
1.4
1.6
14
2
3
5
A
2 4 6 8 10 12 14
N . O f scans
2 4 6 8 10 12 14
N . O f scans
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1
4
2
3
5
B
We performed and analysed two microarray experiments hybridised with a target made with RT labelling, and TSA methodology:
In the first experiment, we challenged RNAs of skeletal and heart muscle in competitive hybridisation.
1
In the second one, we compared RNAs of dystrophic (facioscapulohumeral muscular dystrophy) and normal muscle.
2
2 replicates for each experiment with dye swapping (4 spots replicates)
SNOMAD web tool (global and local options) for data normalization
SAM for identification of differentially expressed genes
Quantification of the efficacy of the multi-scans approach in detecting differentially expressed genes
We performed and analysed two microarray experiments hybridised with a target made with RT labelling, and TSA methodology:
Integrate the first 2, 4, 6, 8 and 10 serial scans and for each integration find genes differentially expressed
2
NFP: genes found to be differentially expressed with the i –th integration but not with all the subsequent
NFN: genes found to be differentially expressed with the i-th integration but not with the previous ones
4
Evaluation efficacy approach
Identify differentially expressed genes in 1 scan experiments1
CFP: genes found to be differentially expressed with 1 scans but not with the all the serial integrated images
CFN: genes found to be differentially expressed with the serial integrated images but not with 1 scan
3
CFP e CFN (consistent false positives and negatives) = genes found to be differentially expressed with the integration of n scans and confirmed by all the n-i ones
Increase of the number of differentially expressed genes skeletal muscle vs. heart – RT labelling -
1 scan: 200 transcripts over expressed e 31 underexpressed in the muscle
Overexpressed genes Underexpressed genes
0
5
10
15
20
25
30
35
40
2 4 6 8
Number of scans
Pe
rce
nta
ge
inc
rea
se
CFP CFN
0
2
4
6
8
10
12
14
16
18
2 4 6 8
Number of scans
Pe
rce
nta
ge
inc
rea
se
CFP CFN
NFP e NFN (novel false positives and negatives) = real improvement achieved by the inclusion of each additional serial microarray image
Increase of the number of differentially expressed genes skeletal muscle vs. heart – RT labelling -
1 scan: 200 transcripts over expressed e 31 underexpressed in the muscle
Overexpressed genes Underexpressed genes
0
5
10
15
20
25
30
2 4 6 8 10
Number of scans
Pe
rce
nta
ge
inc
rea
se
NFP NFN
0123456789
10
2 4 6 8 10
Number of scans
Pe
rce
nta
ge
inc
rea
se
NFP NFN
Increase of the number of differentially expressed genes FSHD vs. normal – TSA -
With 1 scan: 149 overexpressed and 107 underexpressed in normal muscle
CFP e CFN (consistent false positives and negatives) = genes found to be differentially expressed with the integration of n scans and confirmed by all the n-1 ones
0
50
100
150
200
250
300
2 4 6 8
Serial scans
Pe
rce
nta
ge
inc
rea
se
CFP CFN
0
50
100
150
200
250
300
2 4 6 8
Serial scans
Pe
rce
nta
ge
inc
rea
se
CFP CFN
Overexpressed genes Underexpressed genes
Increase of the number of differentially expressed genes FSHD vs. normal – TSA -
With 1 scan: 149 overexpressed and 107 underexpressed in normal muscle
NFP e NFN (novel false positives and negatives) = real improvement achieved by the inclusion of each additional serial microarray image
0
50
100
150
200
250
300
2 4 6 8 10
Serial scans
Pe
rce
nta
ge
inc
rea
se
NFP NFN
020406080
100120140160180200
1 2 3 4 5
Serial scans
Pe
rce
nta
ge
inc
rea
se
NFP NFN
Overexpressed genes Under expressed genes
2 4 6 8 10
Relationship between CFN and their spot intensities
Dystrophic vs. normal muscle Skeletal muscle vs. heart
Spot Intensity Spot IntensityF
req
uen
cy
Fre
qu
ency
The greatest improvement in differentially expressed genes revealed by multi-scan approach concerns weakly expressed
genes.
Cy5Cy3
ΣPOT results validation with RT-PCR semi-quantitative
sk. muscleheart
CFN, over expressed in sk. muscle
(1) myosin-binding protein C, fast type
(2) titin
(3) human DNA sequence
(4) human DNA sequence
(5) H.sapiens mRNA for striate muscle-specific hypothetical protein (ORF1), clone 00275
(6) human DNA sequence
(7) H.sapiens acetyl-coenzyme A transporter
(8) human autoantigen small nuclear ribonucleoprotein Sm-D
CFN, underexpressed in sk. muscle
(9) troponin T2, cardiac
(10) alpha-actin, cardiac muscle
(11) myosin-binding protein C, cardiac
(12) H.sapiens heat shock 90 kDa protein 1, alpha
(13) H.sapiens haplotype M*2 mitochondrion
(14) H.sapiens chromosome 5, BAC
(15) H.sapiens macrophage migration inhibitory factor (glycosylation-inhibiting factor)
(16) H.sapiens ring finger protein 28
CFP
(17) H.sapiens clone alpha_est218/52C1
(18) H.sapiens CD27-binding (Siva) protein transcript variant 1
(19) human skeletal muscle 1.3 kb mRNA for tropomyosin;
(20) H.sapiens cathepsin H
Conclusions
RT-labelling :Many FP (~ 10% of differentially expressed genes found with 1 scan)Many FN (~ + 50% of differentially expressed genes found with 1 scan)
TSA-labelling :Small number of FPHighly increasing of FN (~ + 200%)
4-6 scans seems to be the best number of scans required for a satisfactory inprovement in detecting differentially expressed genes
Maximum and mean results overlap for the 80% of FP and FN transcripts
Integration of pot into scanner softwares
Future work
Technical details
pot is written in C language with libtiff libraries, it runs on UNIX system
SAM http://www-stat.stanford.edu/~tibs/SAM/index.html
SNOMAD http://pevsnerlab.kennedykrieger.org/snomadinput.html
Spotting device: GenePackArray 21 with 16 stealth micro pins
Scanner: Perkin Elmer LITE dual confocal laser scanner with software Scan Array
Image analysis software: QuantArray
HumanMuscleArray: http://muscle.cribi.unipd.it/microarrays/human.html
Acknowledgements
Gerolamo Lanfranchi project supervisor
Microarray Team
Silvia Trevisan, Barbara Celegato,
Bioinformatics Team
Germano Costa, Micky Del Favero
Reference
Romualdi Chiara et al. (2003) Nucl. Acids. Res. 31: e149.
Web sites
http://muscle.cribi.unipd.it/microarrays/
http://muscle.cribi.unipd.it/microarrays/spot/
Genomic Research University of Padova
http://grup.cribi.unipd.it/
Increase of the number of differentially expressed genes skeletal muscle vs. heart – RT labelling -
1 scan: 200 transcripts over expressed e 31 underexpressed in the muscle
2 scans 4 scans 6 scans 8 scans 10 scans
Mean Max Mean Max Mean Max Mean Max Mean Max
OverExp.
FP 26 (13) 19 (10) 21 (11) 21 (11) 24 (12) 19 (10) 20 (10) 30 (15) 24 (12) 24 (12)
FN 18 (9) 41 (21) 37 (19) 36 (18) 36 (18) 53 (27) 50 (25) 18 (9) 34 (17) 29 (15)
UnderExp.
FP 6 (19) 7 (22) 2 (6) 5 (16) 2 (6) 3 (9) 4 (13) 3 (9) 4 (13) 1 (3)
FN 7 (22) 12 (38) 13 (41) 15 (47) 14 (44) 18 (56) 23 (72) 15 (47) 20 (63) 20 (63)
FP (false positives) = genes found to be differentially expressed with 1 scan but not confirmed with the integration of the others
FN (false negatives) = genes found to be differentially expressed with the integration of additional scans but not with 1 scan
FP (false positives) = genes found to be differentially expressed with 1 scan but not confirmed with the integration of the others
FN (false negatives) = genes found to be differentially expressed with the integration of additional scans but not with 1 scan
Overexpressed genes Underexpressed genes
Increase of the number of differentially expressed genes skeletal muscle vs. heart – RT labelling -
1 scan: 200 transcripts over expressed e 31 underexpressed in the muscle
0
10
20
30
40
50
60
70
80
2 4 6 8 10
Number of scans
Pe
rce
nta
ge
inc
rea
se
FP FN
0
10
20
30
40
50
60
70
80
2 4 6 8 10
Number of scans
Pe
rce
nta
ge
inc
rea
se
FP FN
2 scans 4 scans 6 scans 8 scans 10 scans
Mean Max Mean Max Mean Max Mean Max Mean Max
Over Exp.
FP 26 (13) 19 (10) 21 (11) 21 (11) 24 (12) 19 (10) 20 (10) 30 (15) 24 (12) 24 (12)
CFP - - 19 15 18 13 15 13 16 14
FN 18 (9) 41 (21) 37 (19) 36 (18) 36 (18) 53 (27) 50 (25) 18 (9) 34 (17) 29 (15)
CFN - - 15 20 22 20 24 17 34 15
UnderExp.
FP 6 (19) 7 (22) 2 (6) 5 (16) 2 (6) 3 (9) 4 (13) 3 (9) 4 (13) 1 (3)
CFP - - 2 4 2 3 2 2 2 1
FN 7 (22) 12 (38) 13 (41) 15 (47) 14 (44) 18 (56) 23 (72) 15 (47) 20 (63) 20 (63)
CFN - - 4 7 10 9 13 11 17 14
CFP e CFN (consistent false positives and negatives) = genes found to be differentially expressed with the integration of n scans and confirmed by all the n-1 ones
Increase of the number of differentially expressed genes skeletal muscle vs. heart – RT labelling -
1 scan: 200 transcripts over expressed e 31 underexpressed in the muscle
NFP e NFN (novel false positives and negatives) = real improvement achieved by the inclusion of each additional serial microarray image
2 scans 4 scans 6 scans 8 scans 10 scans
Mean Max Mean Max Mean Max Mean Max Mean Max
Over Exp.
NFP 26 19 2 6 3 2 1 7 2 2
NFN 18 41 10 5 4 10 4 0 0 3
UnderExp.
NFP 6 7 0 1 1 0 1 0 1 1
NFN 7 12 9 8 3 8 7 3 2 3
Increase of the number of differentially expressed genes skeletal muscle vs. heart – RT labelling -
1 scan: 200 transcripts over expressed e 31 underexpressed in the muscle
2 scans 4 scans 6 scans 8 scans 10 scans
Mean Max Mean Max Mean Max Mean Max Mean Max
Over Exp.
FP 0 (0) 0 (0) 0 (0) 0 (0) 1(1) 1(1) 0 (0) 1(1) 0 (0) 0 (0)
FN110 (74)
90(61)
154 (104)
131(89)
184 (124)
137(93)
198 (134)
158 (107)
207 (140)
169 (114)
UnderExp.
FP 2 (2) 2 (2) 2 (2) 2 (2) 1 (1) 2 (2) 2 (2) 2 (2) 2 (2) 2 (2)
FN175
(164)157
(147)214
(200)198
(185)229
(214)191
(179)263
(246)212
(198)255
(238)244
(228)
Increase of the number of differentially expressed genes FSHD vs. Normal – TSA -
With 1 scan: 149 overexpressed and 107 underexpressed in normal muscle
FP (false positives) = genes found to be differentially expressed with 1 scan but not confirmed with the integration of the others
FN (false negatives) = genes found to be differentially expressed with the integration of additional scans but not with 1 scan
Increase of the number of differentially expressed genes FSHD vs. normal – TSA -
With 1 scan: 149 overexpressed and 107 underexpressed in normal muscle
FP (false positives) = genes found to be differentially expressed with 1 scan but not confirmed with the integration of the others
FN (false negatives) = genes found to be differentially expressed with the integration of additional scans but not with 1 scan
0
50
100
150
200
250
300
2 4 6 8 10
Serial scans
Pe
rce
nta
ge
inc
rea
se
FP FN
0
50
100
150
200
250
300
1 2 3 4 5
Serial scans
Pe
rce
nta
ge
inc
rea
se
FP FN
Overexpressed genes Underexpressed genes
2 4 6 8 10
2 scans 4 scans 6 scans 8 scans 10 scans
Mean Max Mean Max Mean Max Mean Max Mean Max
Over Exp.
FP 0 (0) 0 (0) 0 (0) 0 (0) 1(1) 1(1) 0 (0) 1(1) 0 (0) 0 (0)
CFP - - 0 0 0 0 0 1 0 0
FN110 (74)
90(61)
154 (104)
131(89)
184 (124)
137(93)
198 (134)
158 (107)
207 (140)
169 (114)
CFN - - 107 85 152 117 175 131 189 150
Und.Exp.
FP 2 (2) 2 (2) 2 (2) 2 (2) 1 (1) 2 (2) 2 (2) 2 (2) 2 (2) 2 (2)
CFP - - 2 2 0 2 1 2 1 2
FN175
(164)157
(147)214
(200)198
(185)229
(214)191
(179)263
(246)212
(198)255
(238)244
(228)
CFN - - 170 149 203 173 223 179 241 197
Increase of the number of differentially expressed genes FSHD vs. normal – TSA -
With 1 scan: 149 overexpressed and 107 underexpressed in normal muscle
CFP e CFN (consistent false positives and negatives) = genes found to be differentially expressed with the integration of n scans and confirmed by all the n-1 ones
2 scans 4 scans 6 scans 8 scans 10 scans
Mean Max Mean Max Mean Max Mean Max Mean Max
Over Exp.
NFP 0 0 0 0 1 1 0 0 0 0
NFN 110 90 47 46 31 17 21 16 14 14
Und.Exp.
NFP 2 2 0 0 0 0 0 0 0 0
NFN 175 157 44 49 22 17 32 23 6 21
Increase of the number of differentially expressed genes FSHD vs. normal – TSA -
With 1 scan: 149 overexpressed and 107 underexpressed in normal muscle
NFP e NFN (novel false positives and negatives) = real improvement achieved by the inclusion of each additional serial microarray image