+ All Categories
Home > Documents > Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald...

Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald...

Date post: 20-Dec-2015
Category:
View: 216 times
Download: 1 times
Share this document with a friend
Popular Tags:
16
Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from http://www.biosci.utexas.edu/mgm/people/faculty/profiles/VIresearch.jpg Supported by NSF DMS 0728941. In collaboration with CU MCD Biology. 1
Transcript
Page 1: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

Eigensolvers for analysis of microarray gene expression data

Andrew Knyazev (speaker) and Donald McCuan

Image from http://www.biosci.utexas.edu/mgm/people/faculty/profiles/VIresearch.jpg

Supported by NSF DMS 0728941. In collaboration with CU MCD Biology.

1

Page 2: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

Eigensolvers for DNA microarrays:

Crash course on gene expression Microarrays---a massively parallel experiment Clustering: why? Clustering: how? Spectral clustering Connection to image segmentation Eigensolvers for spectral clustering

2

Page 3: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

Genes in DNA code for proteins Protein formation in a cell involves:

Transcription of DNA to mRNA (messenger RNA) Translation of mRNA to a protein

When proteins are being formed for a gene this is called gene expression

DNA sense strand .... ATA CGT ...antisense strand .... TAT GCA ...

mRNA .... AUA CGU ...

Protein .... Ile Arg ...

transcription

translation

Crash course on gene expression 1/33

Page 4: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

Crash course on gene expression 2/3Image Courtesy: cnx.org

4

Page 5: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

Gene expression in a cell depends on many factors, e.g., developmental stage, nutrition, environment, and diseases, so the level of gene expression may varyKnowing how genes are expressed helps to understand cellular processes and diagnose diseases Measurement of the concentration of proteins in a cell is complicated, so the concentration of mRNA is used instead, assuming that most mRNA created is actually translated to a protein DNA Microarrays (e.g., Affymetrix GeneChip arrays) measure the level of mRNA in a sample

Crash course on gene expression 3/3

5

Page 6: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

Affymetrix GeneChip DNA Microarrays

Image Courtesy: Affymetrix

Affymetrix GeneChip DNA Microarrays

Image Courtesy: Affymetrix

6

Page 7: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

GeneChip: oligonucleotide sequences are photo-lithographed on a quartz wafer in a pattern of ~10 micrometers dots.

Oligonucleotide sequences (oligos) probes: 25 nucleotide chains for selected parts of a gene complementary to mRNA.

GeneChips are manufactured to include all currently known and predicted genes of a particular organism, e.g., H. sapience. The information about physical locations of oligo probes for each gene on the chip is contained in the *.cdf file.

A sample of mRNA extracted from cells of an organism after pre-processing is hybridized with GeneChip giving PM and MM values which characterize genes expressions in the cells.

Microarrays-massively parallel experiment 2/5

For every gene there are 11-20(depending on chip design) of different oligo probes called perfect matches (PM). In addition, there are mismatch oligos (MM) corresponding to each of the PMs that differ in the middle base pair.

7

Page 8: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

Microarrays-massively parallel experiment 3/5Labelled cRNA targets derived from the mRNA of an experimental sample are hybridized to oligo probes.During hybridization, complementary nucleotides line up and bind together via hydrogen bonds in the same way as two strands of DNA bound together. The chip is then scanned with a laser giving the amount of each mRNA species represented. Image Courtesy: cnx.org

8

Page 9: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

A pool of mRNA is extracted from the cells of an organism and converted to a Biotin labelled strand (cRNA) that binds to the oligo probes on the GeneChip during hybridization.

The higher the concentration of a particular mRNA in the testing pool---the greater the hybridization level of the PM probes and thus the amount of the hybridized material on the processed GeneChip.

Then a fluorescent stain is applied that binds to the Biotin and the GeneChip is processed through a scanner that illuminates each dot of the GeneChip with a laser, causing dots to fluoresce.

The image data of the scanned probe array is stored in a *.dat file. The Affymetrix GCOS software processes the *.dat file and generates a *.cel file, containing all numerical data of the GeneChip experiment, e.g., probe locations and PM and MM intensities. The processing involves computing a square grid locating the dots for probes, intensity normalization, using internal controls, and detecting the outliers.

More sophisticated *.dat-->*.cel algorithms, e.g., taking into account the cRNA saturation, are being developed elsewhere.

Microarrays-massively parallel experiment 4/59

Page 10: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

The PM and MM values are not normally used directly for high-level statistical analysis, instead they are first converted into the gene expression values, which involves:

Detecting unreliable data by comparing PM and MMAdjustment for background and noiseCalculating the single array gene expression intensities, basically by averaging adjusted PM values for each probe set

Alternatively, the Comparison Analysis (Experiment versus Baseline arrays) detects and quantifies changes in gene expressions between two arrays, applying normalization of data and using the Signal Log Ratio algorithms.

Either way, the absolute or comparison gene expression values are stored in a *.chp file, which serves as the input for high-level statistical analysis. Typically, multiple GeneChip tests are performed giving multiple *.chp files with gene expression values.

Microarrays-massively parallel experiment 5/510

Page 11: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

When conducting microarray experiments there are multiple microarrays involved typically: Studying a process over time, e.g., to measure the response to a drug or food. Looking for differences between states, e.g., normal cells versus cancer cells.

A typical goal is Finding Gene Networks, i.e., groups of genes that change expression inter-dependently across samples. Having a significantly large number of microarrays, we want to reverse engineer the regulatory network that controls gene expressions. We need computer clustering on the microarray data to select a small (ideally) number of co-expressed genes of a gene network. Separate experiments using gene knockout on the selected genes can then be performed to confirm the discovered regulatory network biologically.

Clustering: why? 11

Page 12: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

Clustering: how?There is no good widely accepted definition of clustering. The traditional graph-theoretical definition is combinatorial in nature and computationally infeasible. Heuristics rule!

Many clustering techniques and methods are known, e.g.,: Hierarchical clustering/partitioning K-means (centroids) Self-organizing maps (partitioning vectors) Force-directed placement Principal Components Analysis (PCA) Spectral clustering/partitioning using Fiedler vectors

Some good and popular free open source software, e.g., METIS and CLUTO (Karypis Lab).

We focus on PCA and spectral clustering.

12

Page 13: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

13

• A = adjacency matrix

• D = degree matrix

• Laplacian matrix L = D – A

•Fiedler eigenvectors Lx=λx

•N-cut eigenvectors Lx=λDx (smallest) are the largest for

•PCA Markov walks Ax=µDx with µ=1-λ. D-1A is raw-stochastic and describes the walk probabilities.

31011

13101

01201

10010

11103

L

Rows sum to zero

3

1 2

4

5

13.

26.

44.

81.

26.

2x

Spectral clustering

22 83. xLx

A 4-degree-of-freedom system has 4 modes of vibration and 4 natural frequencies: partition into 2 clusters using the second eigenvector

Example Courtesy: Blelloch CMU

Images Courtesy: Russell, Ketteriung U.

www.cs.cas.cz/fiedler80/

Page 14: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

14

Connection to image segmentationImage pixels serve as graph vertices. Weighted graph edges are computed by comparing pixel colours.

Here is an example displaying 4 Fiedler vectors of an image:

We generate a sparse Laplacian, by comparing neighbouring pixels here when computing the weights for the edges. Genes correspond to vertices in microarrays, but we have to compare all genes, possibly getting a Laplacian with a large fill-in.

Page 15: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

15

Eigensolvers for spectral clustering Our BLOPEX-LOBPCG software has proved to be efficient for large-scale eigenproblems for Laplacians from PDE's and for image segmentation using multiscale preconditioning of hypreThe LOBPCG for massively parallel computers is available in our Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) packageBLOPEX is built-in in http://www.llnl.gov/CASC/hypre/ and is included as an external package in PETSc, see http://www-unix.mcs.anl.gov/petsc/On BlueGene/L 1024 CPU we can compute the Fiedler vector of a 24 megapixel image in seconds (including the hypre algebraic multigrid setup).

Page 16: Eigensolvers for analysis of microarray gene expression data Andrew Knyazev (speaker) and Donald McCuan Image from

Work in progress/future work

Our current test cases are to analyze:Affymetrix GeneChip data from Marina Kniazeva et al. PLoS Biology, 2004. Microarray data from Liang Zhang et al., Molecular Cell, 2007.Our future work will involve developing prototype spectral clustering software for Microarrays in the Bioinformatics toolbox in MATLAB, writing a Microarray analysis driver for our BLOPEX library, and testing on large-scale publicly available Microarray data.

16


Recommended