Essential Guide to Reading Biomedical Papers (Recognising and Interpreting Best Practice) ||...

CH23 09/26/2012 12:12:4 Page 203

23Transcriptome Analysis:Microarrays

Charles HindmarchClinical Sciences, University of Bristol, UK

23.1 Basic ‘how-to-do’ and ‘why-do’ sectionInvestigations into the genomic response to environmental fluctuation or pathology

have previously been limited to the weight of supporting literature, researcher

intuition and the availability of appropriate probes. Microarray allows a non-biased

approach to identifying which genes change their expression in a particular tissue or

cell type in response to a given physiological challenge.

A microarray is a technology that allows simultaneous measurement of the

expression of hundreds, thousands or tens of thousands of genes. In its simplest

form, a microarray is a library of cDNA or oligonucleotide probes that have been

immobilized onto a substrate such as nylon, glass or quartz. Each of these probes

is a sequence that will hybridize to a specific and known messenger ribonucleic

acid (mRNA) sequence according to Crick-Watson base pair rules – one probe,

one gene.

Because microarray interrogates the 1transcriptome, the first step is to extract and

purify RNA from each sample. Total RNA is composed of different populations, of

which only �5 per cent is considered to be coding (mRNA), with the remaining

fractions being ribosomal and non-coding small RNA species such as micro-RNAs.

Given that the total RNA concentration of a single cell is in the low picogram range,

obtaining sufficient mRNA to perform transcriptomic analysis can be a challenge.

However, amplification protocols address these needs. Selective amplification of

mRNA during reverse transcription is ensured through the use of specialized

Essential Guide to Reading Biomedical Papers: Recognising and Interpreting Best Practice, First Edition.

Edited by Phil Langton.

� 2013 by John Wiley & Sons, Ltd. Published 2013 by John Wiley & Sons, Ltd.

1 Transcriptome – all mRNA transcripts from the sample.

CH23 09/26/2012 12:12:4 Page 204

primers bearing a T7 promoter region that binds to the polyadenylation site on the

transcript. During the second round of amplification and subsequent in vitro

transcription, two goals are achieved:

1. Fluorescent label can be incorporated into the synthesized material.

2. Sufficient amount of this material is available for hybridization to the microarray.

Finally, it is necessary for the product of these reactions to be fragmented to allow

an efficient and reproducible hybridization to the probes on the array. You may

wonder why fragmentation is necessary, but the necessity emerges from the facts

that the probes on the microarray are of a limited size and that the second round

synthesis will faithfully reproduce the cDNA, regardless of length.

The earliest chips were ‘spotted’ arrays, which consisted of a library of probes

that were printed onto a glass slide using a head of needles. For these experiments,

the control and the treated sample needed to be labelled differently (e.g. Cy3 or Cy5)

and the samples combined prior to hybridization. Using different dyes carries the

potential for bias (that the dye might affect hybridization efficiency), and this was

controlled using a ‘dye-swap’ control that repeated the experiment, but with each

sample incorporating the ‘other’ dye.

These early array experiments were plagued with technical difficulties, such

as spotting irregularities (often called doughnuts, on account of their ring

shaped misprint), and analytical challenges such as normalization strategies

(see below). However, these pioneer chips became the basis upon which all

current whole genome analysis experiments are now based. Modern microarrays

eliminate many of the experimental problems that the spotted ancestors encoun-

tered, mainly because of the high throughput manufacturing with which the

technology is produced. These developments bring the following advantages:

� Chip-to-chip variation is so small that control and treated samples can be

hybridized to different microarrays.

� Separate chips are used for separate samples, a single label (biotin) can be used

and the issues that surrounded dye-bias are eliminated.

� Probes are often synthesized directly from sequences that have been uploaded

to public databases, so probe error is minimized and changes in gene

annotation/function can be easily updated.

� Individual probes are often made of multiple overlapping smaller probes,

so that some statistical inference of the level of non-specific hybridization can

be drawn.

Regardless of the technology employed, microarray data relies on the relative

hybridization ratio between the control sample and a treated sample to each single

204 CH23 TRANSCRIPTOME ANALYSIS: MICROARRAYS

CH23 09/26/2012 12:12:4 Page 205

probe in the library on the chip. The result of a microarray experiment comparing

two (or more) conditions is a list of probe identifiers that relate to genes, and

expression values in which the end user has confidence are:

� expressed in the tissue/sample;

� differentially expressed by some prescribed fold-change cut-off criterion;

� significantly different between the treatments.

Having satisfied these criteria, this list of genes represents the minimum usage of

the available data. More advanced bioinformatic analysis of such lists can establish

gene-function, generate functional networks and drive hypothesis detection. It is

important to note that while microarray is a hypothesis machine, it does not stand

alone in experimental biology. The validation of some of the identified genes using

an independent technique (e.g. qPCR – see Primer 20) will satisfy that the ‘false

discovery rate’ (see below) is low and will give confidence in those other

significantly regulated genes revealed in the experiment.

23.2 Required controls23.2.1 Sample collection

Special care is required when collecting biological samples, extracting RNA and

sample preparation, because RNA is very liable to enzymatic degradation by both

endogenous and introduced ribonucleases (RNases). Fortunately, several commer-

cially available chemicals ensure that RNases can be controlled and, together with

experimental diligence, degradation can be kept at bay. All tools required for the

dissection or culture of biological material under study must be free of RNase

contamination, and any reagents required must be made using RNase-free solutions

and chemicals and must be handled in an aseptic manner at all times. It is also

appropriate to claim an area of the laboratory within which only RNAwork will be

performed (Figure 23.1a).

Once extracted, tissue or cells should be stored in an appropriate manner to

protect against endogenous RNase degradation. Commercially available reagents

that RNase-protect samples are available, but ultra-low temperatures help to ensure

that samples are kept ‘safe’.

Successful microarray experiments are dependent on the quality and quantity of

the biological samples that are used. Experiments based on tissue dissections, are

subject to huge sources of error that can result from including RNA from

neighbouring tissues that either contaminate or dilute the biological signal being

studied. It is therefore important for dissections to be maintained by a single

competent operator, and for those dissections to be consistent between samples and

based on appropriate anatomical atlas for that species.

23.2 REQUIRED CONTROLS 205

CH23 09/26/2012 12:12:4 Page 206

Figure 23.1 Overview of the features of microarray experiments. A: Meticulous attention toRNase control is required in microarray experiments. It is a good idea to establish a clean spacewithin which RNase contamination can be controlled. B: The number of publications in theliterature that use the term ‘microarray’ in the text has increased dramatically in the past 10 years.With over 47,000 publications, the quantity of data available is immense. C: Before and afternormalisation of microarray data. In this box andwhisker plot, each sample represents the spread of31,099 data points and the bold bar represents the median expression value. Normalisation ofmicroarrays ensures that variations in the data that are due to some technical aspect or someuncontrolled biological aspect are ‘ironed out’ between samples. D: Principle components analysis(PCA) establishes the degree of variability between samples; each point on this PCA representsa single sample and the expression of over 30,000 probes. Four groups of samples emerge from thisanalysis that correspond to three different treatments. E: The Venn diagram, useful for comparingdifferent lists of regulated genes so that those commonly regulated elements may be distinguishedfrom those that are unique to a particular condition. F. Clustering of gene expression data can act asa quality control check and identify gene expression patterns across multiple datasets; here the fivegroups are from the same brain region but the top two on the right cluster differently because theyare from a different strain of animal than the other three. Courtesy of Dr. Charles Hindmarch.


CH23 09/26/2012 12:12:4 Page 207

RNA quality should be assessed following extraction to ensure that the material is

of sufficient quantity and quality for use on the microarray. Technologies that rely

on spectrophotometric analysis of RNA can establish the concentration, based on

the absorbance at 260 nm (A260¼ 1¼ 40mg/ml), and quality, based on the ratio

between the absorbance at 260 nm and 280 nm (indicating contamination with

proteins). RNA integrity (if the sample is degraded) can be assessed by running the

sample on a denaturing gel or a microcapillary system that will show bands (or

spikes) for the 18s and 28s ribosomal RNA in the samples whose integrity are a good

proxy for mRNA quality.

Usually, the handling of RNA in an array experiment is boiled down to just a

simple statement such as ‘all dissections were performed in an RNase-free manner’

or ‘quality control was assured in each sample using spectrophotometric analysis’.

It is not always the case that samples were collected and stored in laboratory

conditions (collection of RNA in the field or biopsy following surgery, for example).

In these cases, the integrity of the RNA might warrant specific reference to quality

and consistency between samples.

23.2.2 Sample replication and sample pooling

The microarray experiment is not unlike any other; each measurement (each

genome expression microarray) is a snapshot of a continuous biological process

that varies according to treatment. To ensure that this snapshot is an appropriate

one for the biological process, it is necessary to introduce both biological and

technical replicates into a microarray experiment (see also Primer 2). These

replicates are required to account for 2variations in the biology and 3technical

aspects of the experiment that cannot adequately be controlled for within the

experiment.

While both technical and biological replication can be achieved by increasing

the number of arrays used in the experiment, pooling of sample from different

animals onto a single microarray is a good way to reduce population level

effects on the experiment. When using an outbred animal population, for

example, each array can be turned into a microcosm of the animal population.

When pooling, each array should represent samples that are independent from

one another. The methodology of the paper should explicitly state the number of

microarrays that have been used for each condition and the exact nature of the

biological tissue that has been hybridized onto each array. Such information is a

requirement on the public databases to which most journals require microarray

data to be submitted.

2 ‘Variations in the biology’ include circadian rhythm, oestrous cycling, outbred strain, etc.3 ‘Technical aspects of the experiment’ include experimenter variation, surgical precision and other such

errors.


CH23 09/26/2012 12:12:4 Page 208

23.2.3 Normalization

Microarray data walks a fine line between the 4false positive result and the 5false

negative result. Because microarrays represent so many probes, it can often be

difficult to work out how to handle the data so as to avoid false positive results

without being so stringent that you introduce false negative results. Careful

normalization and statistical testing with appropriate multiple test correction

(see below) is critical for minimizing false results in array experiments.

The process of normalization ensures that any technical (e.g. background

signal between two arrays) or uncontrolled (e.g. different experimental batch)

aspect of the experiment is removed from the data, so that the differences that

remain result from the treatment under study. Figure 23.1C shows the expression

value of nine microarrays; four independent control replicates and five inde-

pendent treated replicates. The plot for each sample shows the range of

expression values and the bold horizontal line shows the median value of

each array. Clearly, the median value is more variable in the control state,

and a difference between the medians of the control and the treated arrays

appears to exist.

Two assumptions about microarray experiments need to be made in order to

understand the reasons for normalization:

� First, it is expected that all arrays within the same condition should be broadly

similar because they are replicates;

� Second, we expect the majority of the 31,099 probes on the array not to change,

even in response to a treatment.

With these assumptions made, it is generally accepted that variations in the

total expression range must come from non-controlled factors. Normalization

nudges all the microarrays onto the same playing field so that these experimental

artefacts do not overemphasize the expression of a gene (false positive result) or,

indeed, mask it (false negative result). There are several different normalization

strategies available, and their choice depends on various factors, including sample

size and data quality. These should be outlined in the methodology of the micro-

array paper.

23.2.4 Multiple testing correction

In any experiment in which differences between one or more treatments and a

control are sought, it is necessary to decide whether observed differences are

4 False positive: the data says the mRNA is up-regulated when it is not.5 False negative: the data says the mRNA is not regulated when it is.


CH23 09/26/2012 12:12:5 Page 209

likely to be due to real differences among the groups (see both Primers 2 and 3).

Statistical testing allows the following question to be answered; is the difference

between two sets of data a product of chance or is it significantly different? If the

probability that the result occurred by chance is less than five per cent (p< 0.05),

then we can be 95 per cent confident that the difference is real. This five per cent

can be considered a false discovery rate (FDR). A problem exists, however, when

more than one test is being performed, because the false discovery rate changes

with the number of tests, according the equation:

FDR ¼ p-value cut-of f � number of tests

The result is that when tens of thousands of tests are performed as with an array

experiment, the overall FDR will be near 100 per cent and there will be no

confidence in any result. Multiple test correction is a statistical technique that

accounts for the changing FDR by modifying the p-value threshold in proportion to

the number of tests being performed, with the result that the corrected false

discovery rate is always below 5 per cent. Most of the multiple test corrections

rank the observed p-values of each test performed and correct as a function of the

total number of tests. The main difference between the various correction protocols

available is their stringency; given the high number of tests performed in an array

experiment, even the seemingly smallest p-value can be rendered insignificant and

can produce a false negative result.

23.2.5 Microarray data

Each probe on a microarray is assigned a unique identifier to distinguish it from all

other probes and to provide a reference, so that they can be updated with changing

descriptions and annotation. The basic information usually desired includes, but is

not restricted to:

� Internal ID – unique identifier of probe assigned by the manufacturer

� Gene symbol

� Gene Description

� Genbank ID

� Unigene cluster ID

This is complemented by information gained from the experiment, such as:

� average raw expression value in a particular condition;

� p-values of probe;


CH23 09/26/2012 12:12:5 Page 210

� fold-change regulation between particular conditions;

� statistics involved in selecting differentially regulated genes.

23.2.6 Visualizing microarray data

Reading gene lists in journal articles is very boring – fact! It is a real challenge for

authors to find interesting ways both to share the findings of their data and to engage

with the scientific community about their meaning; figures in journal articles should

be designed to convey those patterns evident in the data. Microarray data is usually

presented in a few simple formats that have almost become as recognizable and easy

to interpret to the lay scientist as a Western blot or a histogram.

The principal component analysis (PCA) allows the expression profile of all

probes from each sample to be used to plot the relationship between the samples

in a three-dimensional space. Each ‘component’ allows the degree of variation in

a particular direction to be plotted, with the first component being the largest

degree of separation, and each subsequent component representing smaller

separations. Figure 23.1d represents 20 microarrays with four different treatments

under investigation, and the PCA is performed on all 31,099 probes on each array.

It is satisfying to see that the gene expression profiles correlate to treatment.

The PCA is also an excellent method to identify outlier microarrays whose

primary quality control should be checked to ensure that the array experiment was

a success.

The heat map is a way to show the expression change of individual genes

between two conditions using blocks of colour, but it is still just a gene list,

albeit nicely coloured, so it has a limit on the number of genes that the reader can

digest. A heat map becomes more useful when combined with a second analysis,

such as clustering (Figure 23.1f). Such analysis will allow the author to show

you that some conditions are more similar to each other than are others. For

example, Figure 23.1f shows five conditions that have clustered according to the

strain of animal used, suggesting that gene expression is highly strain-dependent

(Hindmarch et al., 2007).

Using conventional graphs can also be useful, but again only for relatively small

numbers of genes, otherwise they lose their usefulness. The Venn diagram

(Figure 23.1e) allows different lists of significantly regulated genes to be compared,

so that both the number and proportion of genes that overlap between the two

experiments can be visualized. In Figure 23.1e, a list of genes that are regulated by

treatment A is compared to a list of genes that is affected by treatment B. The

intersection between these two represents a list of genes that are regulated by

both treatments. The sections that do not intersect, therefore, list genes that are

affected by either treatment A or B. These lists now start to become more useful in

downstream and more advance analysis that, for example, may seek to establish the

function of those genes affected only by treatment B.


CH23 09/26/2012 12:12:5 Page 211

23.2.7 Advanced analysis

The advances in microarray experiments have certainly increased the amount of

expression information available, but these data are being been generated at a faster

rate than can be understood. The number of experiments that include microarrays

has been increasing sharply year-on-year since 2000. At the time this primer was

written (April 2012), the term ‘microarray’ returned over 47,000 hits on the website

Pubmed (Figure 23.1B; www.ncbi.nlm.nih.gov/sites/entrez), with each publication

potentially reporting the expression of hundreds or thousands of probes in a

particular paradigm.

The non-biased approach to data collection is essentially undermined by an

inadequate and heavily biased approach to investigation and analysis; scientists are

still resorting to studying the genes on the microarray that they know! In order to

investigate further the physiological functions of the genes regulated in an experi-

ment, a Gene Ontology analysis can be performed. Each gene is annotated with

various functional ontologies called GO terms, and such terms are split into three

broad domains:

� Cellular component.

� Molecular function.

� Biological Process.

Within these domains, the vocabulary is well controlled and subject to constant

flux in the light of novel research. GO analysis relies on the probability that a

particular term will appear in any given gene list above that of pure chance (on the

entire data set – see Hindmarch et al., 2011).

Searching for patterns in vast fields of data that may convey meaning is a

substantial challenge and has led to the development of novel approaches that have

become associated with the term bioinformatics. The novel analytical approaches

that characterize bioinformatics are ongoing and require the collaboration of

biologists, mathematicians and computer scientists. Using these new methods,

the complexity of life is being modelled, tested and re-modelled. Network recon-

struction strategies are starting to make patterns from transcriptomic (and proteo-

mic) data and are attempting to identify those genes in a list that are most important

for the stability of the network and, thus, which are best to target in order to modify

and manipulate a network and therefore, perhaps, the disease state that the network

represents.

23.2.8 Experimental pipeline

The pipeline presented here is highly simplified and every stage requires careful

optimization prior to the ‘final experiment’ being performed.Microarray experiments


CH23 09/26/2012 12:12:5 Page 212

can be expensive, so care and attention should be lavished on each of the following

steps:

1. Establish a biological question and design a robust experiment that will

incorporate both biological and technical replicates.

2. Dissect/obtain the biological material from the experimental groups in the

study and store in a manner that will reduce the risk of RNase degradation. Be

prepared to validate RNA stability.

3. Extract and purify the RNA from these samples in a single batch (if possible) to

reduce unnecessary experimental derived variability.

4. Selectively amplify the mRNA population from the total RNA population and

use this as a template for both amplification and label incorporation. Frag-

mentation of this material should precede hybridization to the microarray.

5. Perform post-hybridization washing to ensure low signal-to-background ratio.

Scanning imagesof arraywill allow feature extractionandproductionof rawdata.

6. Inspect data quality and conduct normalization of the arrays in the experiment,

so that artefacts such as background signal differences between arrays do not

affect the false discovery rate.

7. Apply a statistical test to establish whether, for each probe, the expression

levels are different between the control and the treated microarrays. Ensure

that a multiple test correction has been applied that will modify the p-value

threshold in proportion to the number of tests being performed, so as to ensure

that the corrected false discovery rate is always below five per cent.

8. Consider how to visualize the result. Most often, this will involve using fold-

change between the expression values so that the data represents both like-

lihood and magnitude of mRNA expression change as a consequence of the

experiment.

9. Apply advanced bioinformatic techniques, including gene set enrichment-

analysis (GSEA), biological pathways analysis and gene network

reconstruction.

10. Validate the findings of the microarray in the biological question and try to

alter the expression of the gene, in an attempt to establish physiological

function of that gene in the system of interest.

23.3 Common problems or errors in literature� Aswith any experiment, a good number of replicates are critical to the inference

of the data! A minimum for a microarray experiment is n¼ 3 per condition.


CH23 09/26/2012 12:12:5 Page 213

� Ensure that microarray experiments have been normalized. Without appro-

priate normalization, the data may represent experimental noise.

� If a multiple test correction has not been applied, then scepticism should be

applied with prejudice (Why was no multiple test correction been applied? Is

there a problem with the experiment?). Multiple test correction can be very

stringent and can result in a failure to identify 6significant genes, and authors

might have good reasons why they did not/could not perform the test. These

reasons should be stated in the text; if they are not, then ask why.

� Raw and processed data should always be published on a public database such

as theGene Expression Omnibus (referenced below). This database collects all

information about how the experiment was performed and what normal-

ization/analysis strategy was adopted.

� Sometimes, a microarray will not include a ‘7favourite’ gene (or the favourite

gene of the boss) in the library. This is a disadvantage of microarrays, rather

than a failing of the experiment – the microarray library is fixed prior to

hybridization, unlike next generation sequencing (NGS: see below). Besides,

finding a gene already known to be important in a biological system is just

validation for a microarray; the real magic of transcriptomics is the identi-

fication of genes that no one would have thought to look for.

23.4 Complementary and/or adjunct techniquesRecently, next generation sequencing (NGS) has become the new star on the

transcriptomics stage. Rather thanmaking a library on the chip (as with microarray),

the biologist makes the biological sample into the library and then literally grows

this library onto the chip. This approach has several advantages over microarrays:

the results are not limited by which probes were chosen for the arrays; and novel

splice variants and single nucleotide polymorphisms (SNIP; pronounced ‘snip’) can

be detected, as can insertions and deletions. These benefits must be weighed against

cost and expertise required to analyze such complex data.

Further reading, resources and referencesRNase control: http://www.invitrogen.com/site/us/en/home/References/Ambion-Tech-

Support/nuclease-enzymes/general-articles/the-basics-rnase- control.html

Gene expression omnibus (public database for array data): http://www.ncbi.nlm.nih.gov/

geo/

6 Significant genes: defined in this sense as genes whose expression is changed more than the acceptance

criteria.7 ‘Favourite’ in this sense is taken to mean a ‘likely candidate’ – a gene that may be suspected a priori for a

variety of reasons.

FURTHER READING, RESOURCES AND REFERENCES 213

CH23 09/26/2012 12:12:5 Page 214

Affymetrix rat 230 2.0 Genechip: http://media.affymetrix.com/support/technical/datasheets/

rat230_2_datasheet.pdf

Free analysis packages for microarray using the programming language ‘R’: http://www.

bioconductor.org/

Hindmarch, C.C., Fry, M., Smith, P.M., Yao, S.T., Hazell, G.G., Lolait, S.J., Paton, J.F.,

Ferguson, A.V. & Murphy, D. (2011). The transcriptome of the medullary area postrema:

the thirsty rat, the hungry rat and the hypertensive rat. Experimental Physiology 96(5),

495–504.

Hindmarch, C., Yao, S., Hesketh, S., Jessop, D., Harbuz, M., Paton, J. & Murphy, D. (2007).

The transcriptome of the rat hypothalamo-neurohypophyseal system is highly strain-

dependent. Journal of Neuroendocrinology 19(12), 1009–1012.


Date post:	04-Dec-2016
Category:	Documents
Upload:	phil
View:	212 times
Download:	0 times

Essential Guide to Reading Biomedical Papers (Recognising and Interpreting Best Practice) ||...

Documents