PAINT: A Promoter Analysis and Interaction Network ... · DANIEL E. ZAK,2JAMES S. SCHWABER,1and...

235

OMICS A Journal of Integrative Biology Volume 7 Number 3 2003copy Mary Ann Liebert Inc

PAINT A Promoter Analysis and Interaction NetworkGeneration Tool for Gene Regulatory

Network Identification

RAJANIKANTH VADIGEPALLI1 PRAVEEN CHAKRAVARTHULA1

DANIEL E ZAK2 JAMES S SCHWABER1 and GREGORY E GONYE1

ABSTRACT

We have developed a bioinformatics tool named PAINT that automates the promoter analy-sis of a given set of genes for the presence of transcription factor binding sites Based on co-incidence of regulatory sites this tool produces an interaction matrix that represents a can-didate transcriptional regulatory network This tool currently consists of (1) a database ofpromoter sequences of known or predicted genes in the Ensembl annotated mouse genomedatabase (2) various modules that can retrieve and process the promoter sequences for bind-ing sites of known transcription factors and (3) modules for visualization and analysis ofthe resulting set of candidate network connections This information provides a substantiallypruned list of genes and transcription factors that can be examined in detail in further ex-perimental studies on gene regulation Also the candidate network can be incorporated intonetwork identification methods in the form of constraints on feasible structures in order torender the algorithms tractable for large-scale systems The tool can also produce output invarious formats suitable for use in external visualization and analysis software In this man-uscript PAINT is demonstrated in two case studies involving analysis of differentially reg-ulated genes chosen from two microarray data sets The first set is from a neuroblastomaN1E-115 cell differentiation experiment and the second set is from neuroblastoma N1E-115cells at different time intervals following exposure to neuropeptide angiotensin II PAINT isavailable for use as an agent in BioSPICE simulation and analysis framework (wwwbiospiceorg)and can also be accessed via a WWW interface at wwwdbitjuedudbitoolspaint

INTRODUCTION

THE PRIMARY OBJECTIVE OF THIS PAPER is to present an automated and scalable bioinformatics approachto the identification and analysis of candidate regulatory interactions in a specific experimental con-

text Complex cellular processes such as adaptation development differentiation and cell cycle in biolog-

1Daniel Baugh Institute for Functional Genomics and Computational Biology Department of Pathology Thomas Jef-ferson University Philadelphia Pennsylvania

2Department of Chemical Engineering University of Delaware Newark Delaware

ical systems involve multiple genes functioning in a hierarchical and interconnected network Present tech-nological developments have resulted in rapidly growing public resources containing systematic data setsof various types gene expression changes from microarrays proteinndashDNA interaction and transcription fac-tor (TF) activity data from protein binding assays chromatin immunoprecipitation (ChIP) experiments(Wells and Farnham 2002) and DNA footprinting (Kang et al 2002 Ricci and El-Deiry 2003) pro-teinndashprotein interactions from two hybrid experiments and coimmunoprecipitation and genomic sequenceand ontology information in public databases (wwwgeneontologyorg) The analysis of these large datasetsholds the promise of identification of the nonlinear dynamic systems function of the interconnected geneand biochemical regulatory networks

Attempts at reverse engineering the gene regulatory networks from microarray data alone have metwith varied success (Holstege et al 1998 DrsquoHaeseleer et al 1999 Wessels et al 2001 Ronen et al2002) Typically all the genes are considered as potentially regulating all the other genes and the sub-optimal and non-unique results are subsequently pruned either by setting thresholds on the quantitativeparameters representing interaction strength or via constrained optimization (Yeung et al 2002) Com-bining the available heterogeneous data types significantly improves the ability to unravel the regula-tory networks (Tavazoie et al 1999 Hughes et al 2000 Zak et al 2001 Hartemink et al 2002 Idekeret al 2002) The principal effect of incorporating additional data types apart from microarrays is to con-strain the number of regulatory interactions per gene Based on the known proteinndashDNA and pro-teinndashprotein interactions many interactions can be required to be present or specified to be nonexistentin the identification algorithm This limits the number of interaction parameters to search for and ren-ders the network identification algorithms tractable for a large number of genes (Zak et al 2001 Yuenget al 2002)

The biological mechanism of transcriptional regulation is by specific transcription factors (TFs) bindingto the transcriptional regulatory elements (TREs) present in the cis-regulatory region (promoter) of the cor-responding genes The binding is sequence specific and the binding sites are present on multiple genesThis results in an interconnected transcriptional regulatory network Hence the analysis of the promotersfor the genes of interest for known and predicted TF binding sites will directly provide a good candidateset of network interactions (Hughes et al 2000 Hartemink et al 2002 Ideker et al 2002) This approachhas been most successful in developing a detailed understanding of gene regulatory networks in yeast (Tava-zoie et al 1999 Ideker et al 2001) The availability of genomic sequence combined with extensive in-formation about TF binding site motifs has enabled system-wide analyses to unravel the gene regulatorynetworks that govern the response of yeast to a multitude of environmental perturbations (Tavozie et al1999 Ideker et al 2001) Similar efforts are in progress in Drosophila (Berman et al 2002) sea urchin(Davidson et al 2002) and human systems (Elkon et al 2003)

In this context the objective of the research efforts described in this paper is to develop an automatedand scalable bioinformatics approach to the identification and analysis of candidate regulatory interactionsin a specific experimental setting We have developed PAINTmdashPromoter Analysis and Interaction NetworkToolmdashfor this purpose Briefly PAINT processes a list of unique identifiers representing the genes of in-terest and produces an interaction matrix that represents a candidate set of interactions between the tran-scription factors and the genes This information can be subsequently employed in various network identi-fication analysis and visualization software The objective is not to develop another tool for sequenceanalysis but to construct a modular and extensible platform into which various sequence analysis toolsnetwork analysis and visualization software and model identification tools can be ldquoplugged inrdquo In addi-tion to the command line and the web-based interfaces (available at wwwdbitjuedudbitoolspaint) PAINTmodules can be accessed as the ldquoagentsrdquo that communicate via Open Agent Architecture in the BioSPICEplatform (wwwbiospiceorg)

The present manuscript is organized as follows The details of various modules that constitute PAINTare presented in the Methods section In the Results section two case studies are presented to demonstratethe tool The first study involves a set of 155 differentially regulated genes in differentiated neuroblastomacells The second study is concerned with a set of 578 differentially regulated genes identified from ex-pression time series data derived from neuroblastoma cells exposed to the neuropeptide angiotensin II

VADIGEPALLI ET AL

236

MATERIALS AND METHODS

PAINT

The modular architecture of PAINT is not organism-specific The key requirements are the availabilityof annotated genome sequence and information on transcription factor binding site motifs PAINT 23 canconduct analysis specific to the mouse genome The tool contains five components

1 UpstreamDB A MySQL database containing a predicted promoter sequence for each gene that can bequeried using the Ensembl GeneID Locus Link Unigene ClusterID or Clone ID (GenBank accessionnumber)

2 PromoterIDDBsync A Perl module that periodically updates the promoter sequence database from theEnsembl genome assembly database Included in this module is the functionality of identifying the Tran-scription Start Site (TSS) using multiple methods described in the following section

3 Upstreamer A Perl module that provides the functionality of sequence retrieval from the UpstreamDBdatabase given a list of unique identifiers for the genes of interest

4 TFRetriever A Perl module that processes the retrieved sequences through the transcription factor in-spectiondiscovery programs The dynamic nature of the databases containing transcription factor infor-mation and user-specified parameter options require online retrieval rather than an offline processing forall the promoters in UpstreamDB

5 FeasNetBuilder A Perl module that processes the output of the transcription factor inspectiondiscov-ery programs and produces a Candidate Interaction Matrix (CIM) for the genes of interest

6 FeasNetViewer A Perl and R module that contains functions for analysis and visualization of CIM Amatrix image with optional clustering of data and a network layout diagram are available Also producedare various file formats SBML and GraphML for use in JDesigner (Hucka et al 2001) ClusterTree-View (Eisen et al 1998) Pajek (Batagelj and Mrvar 1998) and Cytoscape (Ideker et al 2002) Thesecan be used in the subsequent network analysis and visualization

The modular architecture of PAINT is depicted in Figure 1 A detailed description of each of the mod-ules and the input-output relationships is presented next A discussion of the issues involved and specificchoices made in the tool development is presented in the Discussion

PAINT PROMOTER ANALYSIS AND INTERACTION NETWORK TOOL

237

FIG 1 Modular architecture of PAINT The input and output information for each module is indicated on the leftThe external databases and programs that PAINT utilizes are indicated by dash-outlined boxes

PAINT modules

The UpstreamDB module For an organism of interest the principal requirement for constructing thepromoter database is annotated genome sequence assembly Several genome assemblies are available formammalian systems for example Ensembl (Hubbard et al 2002) Santa Cruz (httpgenomeucscedu)and Celera (wwwceleracom) The UpstreamDB database was constructed for all the annotated genes(known and putative) in the Ensembl genome database for Mus musculus For each gene 5000 base pairs(bp) upstream (59 to the gene) the first exon of the open reading frame (ORF) and 100 base pairs down-stream (39) to the end of first exon were retrieved from the genome and placed in a temporary database(TempDB) prior to the identification of the Transcription Start Sites (TSS) In the case of genes for whichupstream sequence of length 5000 bp is not available in the genome database (due to assembly being in-complete) the maximum available sequence was retrieved The retrieved sequence was placed in the data-base only if at least 300 bp sequence immediately 59 to the gene was available The genome database con-tains sequences in 59 to 39 orientation on a single strand (conventionally denoted as 11) of DNA For thegenes that are located on the strand 21 the sequence from the genome database was reversed and com-plementary base pairs were computed to produce the upstream sequences

The key aspect of the analysis is using the correct sequence to represent the cis-regulatory control re-gions Note that this requires information about the 59 untranslated region (UTR) of each gene in order tocorrectly identify the TSS and hence the corresponding cis-regulatory control region for each gene Thefirst exon of a gene as annotated in the Ensembl database does not necessarily correspond to the TSS (Davu-luri et al 2001) This creates difficulty in identifying and retrieving the appropriate sequence data corre-sponding to the cis-regulatory region for the genes of interest A sequence-driven approach can be employedfor computing the TSS of a given gene by alignment with corresponding expressed sequence data for ex-ample EST sequence (dbEST) or cDNA sequence (from GenBank) Of particular interest to the mousemodel system is the effort designed to provide 59 end data for mRNAs (RIKEN clone sequences Kawai etal 2001) These clone sequences were aligned to the gene upstream sequence in the TempDB to estimatethe TSS for each gene The sequence alignment program Megablast was used with option ldquoW48rdquo to use aword size of 52 in alignment For each gene the corresponding alignments were filtered by the followingcriteria

1 The alignment should have less than three mismatches2 The alignment should be closest to the 59 end of the aligned clone sequences3 If 59 ends of multiple clones align well then the alignment that is 59 most on the gene upstream se-

quence is selected

The position on the gene upstream sequence that is marked by the above alignment and filtering wasconsidered to be the estimated TSS Using this procedure an estimate of the TSSrsquos for 5040 genes was ob-tained For the remaining genes a TSS prediction tool Eponine (servletsangeracuk8080eponine) wasemployed to come up with an estimate of TSS within the 5000 bp 59 from the start of the ORF The TSSrsquosfor 2278 genes were identified in this manner For the remaining 14045 genes the start of the ORF wasconsidered as the TSS After the TSSrsquos for genes are estimated for each gene an lsquoupdatedrsquo upstream se-quence of 2000 bp 59 to the estimated TSS was retrieved and stored in the UpstreamDB database

In addition to the promoter sequence for each gene UpstreamDB also contains the cross reference tablesthat enable retrieval of promoters using Unigene ClusterID LocusLink and the cDNA clone Accessionnumber This cross reference was constructed using information from the Unigene database This allowsfor convenient retrieval of the promoter sequences directly from a list of genes marked as significantly vary-ing in expression by the microarray analysis software or other gene expression analysis methods

The PromoterIDDBsync module consists of Perl scripts that build the UpstreamDB database and pe-riodically update it when a new version of annotated genome database is released This includes updatedestimation of TSS and the cross-reference tables

The Upstreamer module contains Perl functions that can be wrapped for inclusion in UNIX shell scriptsPerl scripts web-based scripts such as PHP and Open Agent Architecture (for use as a BioSPICE mod-ule) The input from the user is a list of identifiers for the genes of interest and the number of base pairs

VADIGEPALLI ET AL

238

of the upstream sequence needed for analysis The length count is from the start of the gene towards theupstream (59) end However the retrieved sequence is written from 59 to 39 direction as per conventionThe output of the module is the upstream sequences of specified length for the genes that are referenced inthe UpstreamDB database The output is in FASTA format for further processing by transcription bindingmotif inspectiondiscovery software

The TFRetriever module is envisaged to contain several sub-modules that can communicate with var-ious local and web-based motif inspection and discovery software such as MatInspector (Quandt et al1995) and MEME (Bailey et al 1994) A motif is a characteristic sequence of a binding site and func-tionally similar motifs are grouped together into families PAINT 23 contains only the sub-module for in-teracting with MatInspector software The set of vertebrate transcription factor families is utilized for pro-moter inspection The output of the TFRetriever module is the output from the motif discovery program foreach input sequence list

The FeasNetBuilder module can process the output from MatInspector to construct an interaction ma-trix representing a candidate set of connections in the regulatory network based on the promoter sequenceand TFTRE information In the set of promoter sequences processed the complete list of TREs was gen-erated The columns of the interaction matrix correspond to the TREs and each row corresponds to a genefrom the input list If the parameter for binary counting is set in PAINT the regulation of a gene is repre-sented by a 1 if the corresponding TRE is present on the promoter for that gene and by a 0 otherwise Thismatrix represents the constraints to a network identification scheme The interaction parameters corre-sponding to zeros in the candidate matrix need not be computed substantially reducing the dimensionalityof the identification problem If the parameter for binary counting is not set each element of the CIM willbe equal to the number of corresponding TREs found on the respective promoter

The FeasnetBuilder module contains a sub-module named StatFilter that computes p-values for the over-representation of the TREs in the set of promoters considered with respect to a background set of promot-ers Specifically the p-values give the probability that the observed counts for the TREs in the set of pro-moters could be explained by random occurrence in the background set of promoters The p-values arecalculated using the hypergeometric distribution (Bury 1999 Jakt et al 2001 Elkon et al 2003) Typi-cally the reference set is that of the genes on the microarray utilized in the experiments For each TREV$XYZ given (1) a reference CIM of n promoters of which l promoters contain V$XYZ and (2) a CIMof interest with m promoters of which h contain V$XYZ the associated p-value for over-representation isgiven as

p 5

^m

i 5 h 1 l

i 2 1n 2 l

m 2 i21 n

m 2The p-value for under representation of a TRE in the observed CIM is calculated similarly with the sum-mation in the above equation going from 1 to h These estimates of significance can be utilized in filteringfor those TREs that meet a threshold (say p 01) to identify most likely regulators of the genes consid-ered in the experimental context of interest For the case studies presented here the CIM corresponding tothe 3200 annotated cDNA clones on the microarray utilized for experiments was considered as the ref-erence CIM Given no information about the source of the genes from which the input list to PAINT isgenerated PAINT can optionally utilize the CIM corresponding to all the genes in the UpstreamDB data-base as a reference CIM

The FeasNetViewer module contains various functions for the visualization and analysis of the CIMAn image of the interaction matrix is produced in which the individual elements of the matrix are repre-sented by a color based on the significance values for that particular TRE (p-values for over-representationin the observed CIM) This module also contains functionality for hierarchical clustering using ldquoRrdquo soft-ware for statistical analysis (wwwr-projectorg) For clustering the pair-wise distance that is most appro-priate for the CIM data is the binary distance also known as ldquoJaccardrsquos Coefficientrdquo (JC) The JC between


239

two genes (or TFs) can be computed as the ratio of number of elements for which the two rows (or columns)are dissimilar to the total number of elements for which either of the rows contains a 1 For the genes JCis the ldquodissimilarityrdquo between the regulatory pattern of two genes as related to the total number of distinctbinding sites present on either of them For the TFs JC is the ldquodissimilarityrdquo between the regulatory pat-terns of two TFs as related to the total number of genes either of the TFs can regulate

In PAINT the clustered data can be visualized as a matrix layout with the hierarchical tree structurealigned to the rows and the columns of the CIM The zeros in the matrix are shown in black and the non-zero entries in the CIM are colored based on the p-value of the corresponding TRE The brightest shade ofred represents low p-value (most significantly over-represented in the CIM) Conversely the brightest shadesof cyan represent smaller p-values for under-representation in the observed CIM indicating more signifi-cantly under-represented TREs This image can optionally represent the cluster index of each gene wheresuch cluster indices are generated from other sources such as expression or annotation-based clusteringWith such visualization it is straightforward to explore the relationship between expressionannotation-based clusters and those based on cis-regulatory pattern (ie CIM)

The FeasnetViewer module can also generate histograms of the network connectivity from the CIM toprovide additional insights into the regulatory network of interest A histogram of the sum of all columnsin the CIM provides the distribution of the number (or fraction) of TREs that can regulate a given geneThis distribution is typically uni-modal with long tails indicating that very few genes are regulated by veryfew or very many TREs Similarly a histogram of sum of all the rows in CIM provides the distribution ofnumber (or fraction) of genes that are regulated by a TRE Typically this distribution is monotonically de-creasing indicating that most of the TREs are present on few genes each (fine-tuned regulation) and rela-tively few TREs are present on large number of genes (system-wide effects)

The FeasNetViewer module can also generate a network layout diagram using the GraphViz libraries(available at wwwresearchattcomswtoolsgraphviz) Additionally various output formats can be pro-duced for input into analysis and visualization software such as ClusterTreeView Pajek Cytoscape andJDesigner

Experimental data

The model system considered presently is the N1E-115 neuroblastoma cell line (Amano et al 1972)After differentiation N1E-115 cells synthesize several neurotransmitters and express functionally coupledneurotransmitterneuropeptide receptors (Richelson 1990) For use with N1E-115 cells we are currentlyprinting 8600 mouse cDNAs which are greater than 95 nonredundant onto 1rdquo 3 3rdquo glass microarraysThis collection of cDNAs was derived solely from CNS tissues as part of the Brain Molecular AnatomyProject at the University of Iowa (httpbrainestenguiowaeduindexhtml) The set represents 3200 an-notated genes and 5400 unannotated genes using conservative definitions of annotation

The data for case study 1 was obtained by comparing gene expression in undifferentiated growing N1E-115 cells to that in differentiated N1E-115 cells A set of 193 annotated genes that are differentially ex-pressed at least by a factor of 2 was considered for further analysis

For case study 2 the dataset is obtained from microarray experiments of differentiated N1E-115 cellsexposed to the neuropeptide angiotensin II (AngII) Ang II is a multifunctional hormone that influences thefunction of cardiovascular cells through a complex set of intracellular signaling pathways initiated by theinteraction of Ang II with the AT1 and AT2 receptors (Berry 2001 Touyz 2002) AT1 receptor activa-tion leads to cell growth vascular contraction inflammatory responses and salt and water retention whereasAT2 receptors induce apoptosis vasodilation and natriuresis In an effort to isolate the transcriptional re-sponse to AngII to the AT1 receptor (AT1R) the AT2 subtype was blocked with a saturating dose of an-tagonist Cultures of differentiated N1E-115 cells were pretreated with 10 mM of PD123319 for 30 min byaddition of a 10003 stock solution directly to the culture media without removal of the dish from the in-cubator After 30 min AngII (100 nM final 100 mM stock) was rapidly added to all the parallel culturesrequired for the time course Pretreated cultures with no AngII added were considered as time 5 0 sam-ples A time series of gene expression data was obtained from microarray experiments with RNA isolatedat 0 5 15 30 and 60 min after exposure to AngII A total of 1338 genes with at least a twofold changeat any of the time points were considered responsive and were included in further analysis with PAINT

VADIGEPALLI ET AL

240

RESULTS

PAINT 23 contains sequences from version 73a of the Ensembl annotated mouse genome database Thismouse draft sequence is based principally on whole genome shotgun sequencing of around 73 coverageThis was frozen in February 2002 and incorporates finished clone information where available The se-quence is estimated to cover 96 of mouse euchromatic DNA A total of 22444 genes that are annotatedwere processed of which 21363 promoter sequences were retrieved based on the TSS identification andfiltering criteria specified in Materials and Methods

Case study 1

As described above (Methods section) an example microarray data set was obtained by comparing grow-ing and differentiated neuroblastoma N1E-115 cells in culture providing 193 annotated genes that were dif-ferentially regulated at least by a factor of two (130 up-regulated and 63 down-regulated) The GenBankaccession numbers of these 193 clones were provided as the input to PAINT

The Upstreamer module returned a list of 155 promoters ie 38 of the 193 genes of interest did not havea promoter sequence that met the filtering criteria (specified in Methods) for constructing the UpstreamDBdatabase This data set is referred to as DIFF155 for the rest of the document Of the 155 promoters re-trieved 107 correspond to the up-regulated genes (referred to as DIFFUP107) and 48 to the down-regulated genes (referred to as DIFFDOWN48) These 155 promoter sequences were processed using theTFRetriever module communicating with the MatInspector software Within MatInspector the vertebratedatabase with 128 TRE families and 313 position weight matrices was utilized in motif inspection TheTFRetriever module retrieved motif matches for a total of 273 distinct TREs FeasNetBuilder constructeda candidate interaction matrix between the 155 genes and 273 TREs The CIMs for the DIFFUP107 andDIFFDOWN48 subsets were obtained from the DIFF155 CIM

The distributions of interactions for the 155 genes and the 273 TREs are depicted in Figure 2 The dis-tribution is uni-modal with long tails (low percentage of genes at the extreme ends of the distribution) Thisindicates that relatively few genes are regulated by very many or very few TFs (extreme ends of distribu-tion shown in Figure 2a) The distribution shown in Fig 2b indicates that a few TREs are present on sig-nificant fraction of the genes of interest indicating potential role in system-wide effects (right end of thedistribution) Analysis of Figure 2b also indicates that most of the TREs are present on and hence can reg-ulate relatively few genes each (suggesting a function in fine-tuning local effects)


241

FIG 2 The distribution of interactions for the genes (a) and the TREs (b) in the DIFF155 interaction matrix Thereare relatively few genes that are regulated by either too large or too few number of TFs indicating relatively rare casesof excessive or little regulation The interaction distribution for transcription factors indicates that there are relativelyfew transcription factors that can regulate majority of the genes in the input list

A representation of the candidate interaction network is depicted in Figure 3 A subset of the CIM isshown in Figure 4 The genes and motifs were individually clustered using Jaccardrsquos coefficient as the dis-similarity metric The column immediately next to the CIM represents whether the corresponding gene isfound to be up- or down-regulated in the expression data Note that most of the up- or down-regulated genesdo cluster together based on the regulatory pattern of their promoters However there are clusters contain-ing both up and down regulated genes indicating that the activity of specific transcription factors in the ex-periment needs to be utilized to prune the candidate interaction matrix to improve the network predictionA network layout diagram containing five TREs in DIFF155 CIM with the lowest p-values for over repre-sentation is depicted in Figure 5

The p-value of each TRE in CIMs for DIFF155 DIFFUP107 and DIFFDOWN48 was calculated usingthe StatFilter sub-module Analysis of the TREs that are significantly over-represented in DIFF155 DIF-FUP107 and DIFFDOWN48 revealed details that are very different from that of the analysis of the staticcis-regulatory pattern (complete CIM) The p-values of 42 TREs that were significantly over-representedin at least one of DIFF155 DIFFUP107 and DIFFDOWN48 are shown in Table 1 A p-value threshold of01 for DIFFUP107 and DIFFDOWN48 and of 015 for DIFF155 was employed to filter for TRE signif-icance It is interesting to note that only two families of TREs (MYOD and FKHD) were significantly over-represented in both the up-regulated (DIFFUP107) and down-regulated (DIFFDOWN48) genes Such a dra-matic difference was not obvious from the analysis of the feasible cis-regulatory pattern based on all theTREs found to be present (Fig 3) Several of the TRE families are implicated in cell differentiation andmaturation (eg AREB Ikeda et al 1995 CREB Dobi et al 1995 GATA Nardelli et al 1999)

Note that the results from the analysis of significantly overunder represented TREs may be a conserva-tive estimate of the TREs involved in regulation in the experimental context of interest In the present casestudy of N1E-115 differentiation some TREs that are known to be involved in the differentiation of othercell types were not found to be significantly over represented One example is MZF1 not found to be sig-nificantly enriched in either of the up- or down-regulated genes (p-value of 077 in DIFF107 and p-valueof 059 in DIFF48) even though it has been shown to be involved in delaying cell differentiation in othercell types (Morris et al 1994) MZF1 and similar TREs may have not appeared to be significant in the pre-sent case study because they are not involved in N1E-115 differentiation or because the data currentlyavailable was not sufficient to identify every TRE that is involved in the process

Another interesting observation was that it is possible that a particular TRE is not found to be signifi-cantly overunder represented in a particular clustersub-group of genes but can still be significantly overrepresented in the overall set of genes considered NFAT LEFF and GATAGATA301 These are impli-cated in cellular differentiation in different neuronal cell types (NFAT Plyte et al 2001 GATAGATA301Nardelli et al 1999)

Given no other information an identification algorithm would have to compute 7187 5 6177 connec-tion parameters However since the candidate network contains 883 non-zero entries ie only 883 (142)interaction parameters need to be computed Even this is a gross overestimate as the dynamic activity dataabout specific TFs from ChIP experiments can substantially reduce the number of candidate interactionsfurther The localization data thus obtained can significantly improve the regulatory network identification(Zak et al 2001 Hartemink et al 2002)

Case study 2

As described above a gene expression time series data set was obtained from microarray experimentsinvolving neuroblastoma N1E-115 cells at 0 5 15 30 120 minutes after exposure to angiotensin II Of

VADIGEPALLI ET AL

242

FIG 3 A representation of the candidate interaction matrix for 155 differentially expressed genes and 273 TREs an-alyzed in N1E-115 cell differentiation The dataset is clustered using Jaccardrsquos coefficient (binary distance) as the dis-similarity metric Individual elements of the matrix are colored by the significance p-values over-representation in thematrix is indicated in red under-representation is indicated in cyan and the TREs that are neither significantly over-nor under-represented in the matrix are colored in gray The column to the left of the matrix represents gene expres-sion in differentiated cells red indicates up-regulated genes and green indicates down-regulated genes


243

FIG 4 A subset of the candidate interaction matrix for DIFF155

FIG 3

the 8600 genes on the microarray a total of 1338 genes with at least a twofold change at any of the timepoints were considered responsive and were included in the analysis This list of 1338 genes was presentedas input to PAINT for promoter analysis

The PAINT Upstreamer module retrieved 578 promoters (referred to as ANG578 for the rest of the doc-ument) The promoters for almost all the annotated genes in the initial set of 1338 genes were retrievedThe connectivity of the ANG578 CIM is depicted in Figure 6 The connectivity in CIM for ANG578 wasqualitatively similar to that observed in the DIFF155 data set (case study 1) As in case study 1 there werefew genes that could be regulated by very many or very few TFs (Fig 6a) Again as in case study 1 analy-sis of the distribution shown in Figure 6b indicates that relatively few TFs can regulate large number ofgenes in ANG578 (right end of the distribution system-wide effects) and most of the TFs regulate fewgenes each (left end of the distribution fine-tuning local effects)

VADIGEPALLI ET AL

244

FIG 5 A network layout showing the top five TREs in DIFF155 CIM (p value 01) The rectangular boxes rep-resent the TREs and ellipses represent the genes


245

TABLE 1 THE P VALUES FOR OVER-REPRESENTED TRES FOR DIFFERENT GROUPS

OF DIFFERENTIALLY EXPRESSED GENES IN DIFFERENTIATED N1E-115 CELLS

p values for over-representation

Group TRE DIFF193 DIFFUP107 DIFFDOWN48

Group I significantly V$AREBAREB604 034113 0098615 091362over-represented in V$BEL1BEL101 018538 0051241 092558DIFFUP107 V$CREBCREB03 0097854 088728 mdashp value 01 V$EGRFEGR101 016851 0026622 mdash

V$ETSFETS101 0048203 0034419 050693V$FKHDHNF3B01 02181 0053624 091238V$LTUPTAACC01 0061806 0054587 043855V$MYODMYF501 014774 0030602 mdashV$NFKBNFKAPPAB5001 0023547 0045569 020417V$NKXHMSX201 010546 0038806 073881V$NRLFNRL01 012771 0067729 mdashV$NRSFNRSF01 020544 0080247 087563V$RCATCLTR_CAAT01 0085856 0078617 045481V$RORARORA201 0054084 00094089 08739V$RORAT3R01 0084161 004346 mdashV$RREBRREB101 0037354 0017974 056438V$TCFFTCF1101 015363 0022425 094173V$TTFFTTF101 019358 0086217 mdashV$XBBFMIF101 0041325 0024494 057349

Group II significantly V$AP4RAP403 0043503 026637 0045447over-represented in V$AP4RTAL1BETAITF201 0058897 020282 0099846DIFFDOWN48 V$CLOXCDPCR301 0021855 033204 00048142p value 01 V$EKLFEKLF01 031294 075633 0074722

V$FKHDFREAC301 032708 084037 004263V$GATAGATA103 017291 044277 0098765V$GATAGATA104 0061552 044073 0028036V$GATAGATA105 037167 085809 0057487V$HOXFHOX1-301 00088537 010928 001381V$MEF2MMEF201 0063316 028358 007989V$MEISMEIS101 024903 064725 0055006V$MYODLMO2COM01 016547 057736 0041649V$OCT1OCT01 046994 mdash 0084003V$PARFDBP01 0067225 037548 0030859V$TEAFTEF101 025529 061101 0085438

Group III significantly V$BCL6BCL601 010291 010605 041402over-represented only V$GATAGATA301 0052874 016881 013296in combined data V$HOXFEN101 012072 013319 039308DIFF155 V$LEFFLEF101 0083383 010935 032124p value 015 V$NEURNEUROD101 011672 013585 037774

V$NFATNFAT01 0083957 020619 013916V$SRFFSRF01 011762 028177 014787V$TBPFTATA02 012603 01321 041078

The TRE families that are common between the DIFFUP107 and DIFFDOWN48 are highlighted in bold (in groups Iand II) The TREs that are not significantly over-represented in either of DIFFUP107 or DIFFDOWN48 but are over-rep-resented in the overall set DIFF155 are underlined (in group III)

A subset of the CIM is shown in Figure 7 with both gene and TRE labels The genes and TREs are in-dividually clustered using Jaccardrsquos Coefficient as the dissimilarity metric Each row in the column imme-diately next to the dendrogram represents the cluster number of each gene from clustering the expressiondata Note that many of the genes in an expression-based cluster also cluster together based on the regula-tory pattern of their promoters However this analysis is based on the static structure of the CIM The ac-tivity of specific TFs in the experiment needs to be utilized to prune the candidate interaction matrix forimproving the network prediction

The p-value of each TRE is calculated by the StatFilter module Based on a p-value threshold of 01 theTREs that are over represented in the ANG578 CIM are shown in Table 2 Several of the factors bindingto these TREs are known to be involved in cellular response to angiotensin II AP1F and EGRFmdashLebrunet al 1995 CREBmdashCammarota et al 2001 NFKBmdashWolf et al 2002 Also several TREs that are knownto play role in neuronal development are over-represented in the ANG578 CIM RBPJKmdashde la Pompa etal 1997 HEN1mdashBao et al 2000 LMO2COMmdashYamada et al 2002

The transcription factor family STAT is shown to be involved in response to stimulation of AT1R (Mas-careno and Siddiqui 2000) However the TREs corresponding to the factor STAT were not found to beover represented in ANG578 (p-values of all TREs binding to STAT family of TFs was close to 05 inANG578) Further PAINT-based analysis of individual clusters of genes in ANG578 may provide the groupof genes with over-represented TREs for the STAT family of TFs however

DISCUSSION

In these case studies PAINT in combination with experimentally associated genes list and genomic se-quence data has identified the TREs and cognate TFs likely to subserve the biological regulation studiedin each case These results are discovered in a scalable and automated manner using a bioinformatics ap-proach to analyze the data from global methods such as microarrays and ChIP Just as with any other com-putational predictions these results need to be validated by experiments As the two problems consideredin the case studies presented in this work involve well-studied systems the supporting literature (where pre-sent) is given along with the results to validate the PAINT generated hypotheses

The primary purpose of PAINT is to provide a scalable and extensible platform to automate the processof mining the existing databases for known regulatory information for a large number of genes of interest

VADIGEPALLI ET AL

246

FIG 6 The distribution of interactions for the genes (a) and the TREs (b) in ANG578 CIM data There are relativelyfew genes that are regulated by either too large or too few number of TFs indicating relatively rare cases of excessiveor little regulation The interaction distribution for transcription factors indicates that there are relatively few tran-scription factors that can regulate the majority of the genes in the input list


247

FIG 7 A subset of the candidate interaction matrix for ANG578

in a particular experiment or analysis The interaction matrix generated represents candidate connections inthe regulatory network In a particular experiment only a subset of transcription factors in the cell is ac-tive The over-represented TREs identified from CIM indicate a set of TREs that are likely to be activeTime series data of TF activity from ChIP or promoter binding assays provides a set of active TFs By com-bining these two sets together the most likely regulators in that particular experimental context are ob-tained Combining this data with the interaction matrix from PAINT a smaller subset of interaction matrixthat represents the candidate network specific to that particular experimental perturbation can be constructed

The elements for regulating transcription of a gene are not always restricted to the upstream portion ofthe gene alone The power of this PAINT tool will continue to improve as the underlying public resourcesallowing identification of promoter regions is rapidly improving but at the present relatively early stage ofdevelopment of these resources the tool is clearly valuable for use in the mouse model system In the cur-rent version 23 of PAINT only the 2000 base pairs of upstream sequence are considered to be the pro-moter sequence for the inspection and discovery of regulatory motifs We consider this to be a conserva-tive overestimate and PAINT allows the option of inspecting a smaller portion of the upstream sequenceas well

The MatInspector output contains matches on the sequence complementary to the sequence processedThese matches are deleted prior to constructing the matrix The convention is that the promoter region ofa gene is given 59 to 39 direction upstream to the 59 end of the gene Thus any match to a transcriptionfactor binding site on the complementary strand would be non-functional unless the motif is palindromic(same on both strands in 59 to 39 direction) For the palindromic motifs the MatInspector output containsmatches on the input sequence so that the match on the complementary strand represents redundant infor-mation and can be ignored

At present the analytical p-values calculated are for binary counting based CIM only Considering wholecounts complicates the analytical calculation as it has to include all possible combinations of the observedcounts from promoters on which a TRE is present two three or four times Given that the whole counts inCIM for the reference microarray data set used in this study contained promoters with up to 40 instancesof single TRE it is not practical to compute the p-value for whole counts analytically We have developed

(Zak et al in preparation) an empirical approach towards p-value calculation and efforts to incorporate itas part of StatFilter module are currently in progress

The cross-reference tables in PAINT 23 do not guarantee retrieval of promoters given any clone acces-sion number of Unigene cluster ID This is partly because of the filtering criteria as well as incompletecross-referencing in the Ensembl annotated database A sequence-based approach to putatively annotate thecDNA clones using alignments with the Ensembl annotated genome is under active development and willbe included in future versions of PAINT

Cluster analysis of the interaction matrix generated by PAINT can indicate groups of genes that containsimilar transcription factor regulation patterns in their promoter sequences This information can be processedin combination with the results from clustering based on expression profiles to discover consistent groupsacross the two data sets Such an analysis would indicate the extent to which a group of genes with simi-lar expression pattern share a similar regulatory transcription factor pattern The example presented in casestudy 1 demonstrates that there is a broad consistency in the example data set with up- and down-regulatedgenes being grouped into different clusters based on the regulatory pattern in the promoter sequences In

VADIGEPALLI ET AL

248

TABLE 2 THE LIST OF SIGNIFICANTLY OVER-REPRESENTED

(P VALUE 01) TRES IN ANG578 DATA SET

p value for over-representationTRE in ANG578 (filtered for p value 01)

V$PEROPPARA01 0011919V$RBPFRBPJK01 0012645V$EVI1EVI102 0012695V$MZF1MZF101 0018925V$ETSFETS101 0023073V$NFKBNFKAPPAB6501 0028654V$CREBCREBP101 0035346V$AP4RAP401 0037492V$ARP1ARP101 0050041V$EBORDELTAEF101 0051043V$NRSFNRSE01 0053647V$AP1FAP103 0054389V$HEN1HEN102 0054832V$SRFFSRF02 0058947V$HOXFEN101 0062681V$MOKFMOK201 0065696V$EKLFEKLF01 0065796V$MYODLMO2COM01 0069165V$MEISMEIS101 0070906V$OCT1OCT106 007417V$HNF4HNF402 0074672V$NFKBNFKAPPAB03 0075809V$CREBATF602 0078194V$GATALMO2COM02 007881V$AREBAREB603 0083728V$EGRFNGFIC01 0090022V$HOXFCRX01 0090962V$RARFRTR01 0091037V$OCT1OCT104 009845

The TREs corresponding to the TF families known to be involved in cellular response to angiotensin IIare highlighted in bold The underlined TREs correspond to TFs that have been shown to play a role inneuronal development and are hypothesized to be playing a role in response to angiotensin II

the example presented this consistency of shared regulation is more readily seen when examining only thesignificantly over-represented TREs

For Bayesian models of gene regulatory networks (Hartemink et al 2002) a prior set of interactions isspecified to incorporate known information about the network The interaction matrix generated by PAINTcan be utilized as a suitable prior in such model formulations

PAINT 23 includes modules that can process genes from the mouse genome utilizing MatInspector formotif inspection The framework and architecture of PAINT are applicable and easily extensible to otherorganisms for which genome sequence and transcription factor binding information exists Future directionsfor PAINT include incorporation of modules to handle other motif discovery tools such as MEME (Baileyet al 1994) and TESS (Schug and Overton 1997) Efforts are also on towards incorporation of humangenome in PAINT

The experimental and computational methods presented here identify a set of genes and transcription fac-tors that are significant in understanding the function of the gene regulatory network in question This in-formation can be directly utilized in construction of an in silico model of the regulatory network Incorpo-ration of this model into simulations along with models of signaling pathways and electrophysiology is thekey to analyzing the immediate intermediate and long-lasting cellular response to an external signal (egangiotensin II in case study 2)

We plan to employ the BioSPICE platform towards our objective of constructing a computational modelthat predicts cellular responses to external signals (eg angiotensin II) At present the putative gene reg-ulatory network from PAINT can be obtained in GraphML or SBML format with mass action kinetic equa-tions and default initial parameters This model has been successfully loaded and simulated in JDesigner inthe Systems Biology Workbench (Hucka et al 2001) demonstrating the integration of PAINT with exter-nal simulation tools The analysis tools in BioSPICE can be employed on this model as it is now repre-sented in a standardized structure (MDLSBML)

The methods presented here provide a connected graph of relevant genes and transcription factors whichthen needs to be explored for sub-networks with known or novel network motifs and modules (Arkin 2001Rao and Arkin 2001 Milo et al 2002 Segal et al 2003) Such information can be utilized in efficientexploration of ldquotuning knobsrdquo that allow modulationmanipulation of the network function

CONCLUSION

We have developed a bioinformatics tool named PAINT for automated analysis of large number of pro-moter sequences for system-wide gene regulatory network discovery PAINT processes a list of standardidentifiers for a set of genes of interest in an experiment and generates a candidate interaction matrix alongwith a list of transcription factors that could be playing significant role in that experimental context Thecandidate interaction matrix can be used for constraining the identification algorithms for gene regulatorynetworks The modular architecture enables straightforward extensions to ldquoplug-inrdquo various sequence analy-sis tools transcription factor binding motif identification and discovery algorithms and network analysisand visualization tools Ongoing extensions include modules that incorporate gene expression data and tran-scription factor expression data to further improve the candidate network prediction PAINT is availablefreely for academic and non-commercial use under the open source BioSPICE license at wwwdbitjuedudbitoolspaint

ACKNOWLEDGMENTS

We gratefully acknowledge the expert technical support of Hui Liu for cDNA microarray gene expres-sion data from N1E-115 cells and Dr Ronald Pearson (Thomas Jefferson University) for thoughtful dis-cussions This work was supported by the DARPA BioComp Initiative under the Project ldquoMulti-TimescaleComplex Adaptationrdquo contract number F30602-01-2-0578 James Schwaber P)


249

REFERENCES

AMANO T RICHELSON E and NIREMBERG M (1972) Neurotransmitter synthesis by neuroblastoma clonesProceedings of the National Academy Science of USA 69 258ndash263

ARKIN AP (2001) Synthetic cell biology Current Opinion in Biotechnology 12 638ndash644BAILEY TL and ELKAN C (1994) Fitting a mixture model by expectation maximization to discover motifs in

biopolymers Presented at the Second International Conference on Intelligent Systems in Molecular Biology MenloPark California Available httpmemesdscedumemewebsite

BAO J TALMAGE DA ROLE LW et al (2000) Regulation of neurogenesis by interactions between HEN1 andneuronal LMO proteins Development 127 425ndash435

BATAGELJ V MRVAR A and ZAVERNIK M (1999) Partitioning approach to visualization of large networksLecture Notes on Computer Science 1731 90ndash97

BERMAN BP NIBU Y PFEIFFER BD et al (2002) Exploiting transcription factor binding site clustering toidentify cis-regulatory modules involved in pattern formation in the Drosophila genome Proceedings of the NationalAcademy of Science USA 99 757ndash762

BERRY C TOUYZ R DOMINICZAK AF et al (2001) Angiotensin receptors signaling vascular patho-physiology and interactions with ceramide American Journal of Physiology Heart Circ Physiology 281 H2337ndashH2365

BioSPICE (2003) An open source framework and toolset for modeling dynamic cellular network functions Availablewwwbiospiceorg

BURY K (1999) Statistical Distributions in Engineering (Cambridge University Press Cambridge UK)CAMMAROTA M BEVILAQUA LR DUNKLEY PR et al (2001) Angiotensin II promotes the phosphoryla-

tion of cyclic AMPndashresponsive element binding protein (CREB) at Ser133 through an ERK12-dependent mecha-nism Journal of Neurochemistry 79 1122ndash1128

DrsquoHAESELEER P WEN X FUHRMAN S et al (1999) Linear modeling of mRNA expression levels during CNSdevelopment and injury Pacific Symposium on Biocomputing 4 41ndash52

DAVIDSON EH RAST JP OLIVERI P et al (2002) A Genomic regulatory network for development Science295 1669ndash1678

DAVULURI RV GROSSE I and ZHANG MQ (2001) Computational identification of promoters and first ex-ons in the human genome Nature Genetics 29 412

DE LA POMPA JL WAKEHAM A CORREIA KM et al (1997) Conservation of the Notch signalling path-way in mammalian neurogenesis Development 124 1139ndash1148

DOBI AL PALKOVITS M PALKOVITS CG et al (1995) ProteinndashDNA interactions during phenotypic dif-ferentiation Molecular Neurobiology 10 185ndash203

EISEN MB SPELLMAN PT BROWN PO et al (1998) Cluster analysis and display of genome-wide expres-sion patterns Proceedings of the National Academy of Science USA 95 14863

ELKON R LINHART C SHARAN R et al (2003) Genome-wide in silico identification of transcriptional regu-lators controlling the cell cycle in human cells Genome Research 13 773ndash780

HARTEMINK J GIFFORD DK JAAKOLA TS et al (2002) Combining location and expression data for prin-cipled discovery of genetic regulatory network models Pacific Symposium on Biocomputing 7 437ndash449

HOLSTEGE FC JENNINGS EG WYRICK JJ et al (1998) Dissecting the regulatory circuitry of a eukaryoticgenome Cell 95 717ndash728

HUBBARD T BARKER D BIRNEY E et al (2002) The Ensembl genome database project Nucleic Acids Re-search 30 38ndash41

HUCKA M FINNEY A SAURO H et al (2001) The ERATO Systems Biology Workbench an integrated envi-ronment for multiscale and multitheoretic simulations in systems biology In Kitano H (ed) Foundations in Sys-tem Biology (MIT Press Cambridge MA) pgs 125ndash143

HUGHES JD ESTEP PW TAVAZOIE S et al (2000) Computational identification of cis-regulatory elementsassociated with groups of functionally related genes in Saccharomyces cerevisiae Journal of Molecular Biology 2961205ndash1214

IDEKER T OZIER O SCHWIKOWSKI B et al (2002) Discovering regulatory and signalling circuits in mole-cular interaction networks Bioinformatics 18 S233ndashS240

IDEKER T THORSSON V RANISH JA et al (2001) Integrated genomic and proteomic analyses of a system-atically perturbed metabolic network Science 292 929

IKEDA K and KAWAKAMI K (1995) DNA binding through distinct domains of zinc-finger-homeodomain pro-tein AREB6 has different effects on gene transcription European Journal of Biochemistry 233 73

VADIGEPALLI ET AL

250

JAKT LM CAO L CHEAH KS et al (2001) Assessing clusters and motifs from gene expression data GenomeResearch 11 112ndash123

KAWAI J SHINAGAWA A SHIBATA K et al (2001) Functional annotation of a full-length mouse cDNA col-lection Nature 409 685

LEBRUN CJ BLUME A HERDEGEN T et al (1995) Angiotensin II induces a complex activation of tran-scription factors in the rat brain expression of Fos Jun and Krox proteins Neuroscience 65 93ndash99

KANG LSH VIEIRA K and BUNGERT J (2002) Combining chromatin immunoprecipitation and DNA foot-printing a novel method to analyze proteinndashDNA interactions in vivo Nucleic Acids Research 30 e44

LIU R and STATES DJ (2002) Consensus promoter identification in the human genome utilizing expressed genemarkers and gene modeling Genome Research 12 462ndash469

MASCARENO E and SIDDIQUI MA (2000) The role of JakSTAT signaling in heart tissue reninndashangiotensinsystem Molecular Cell Biochememistry 212 171ndash175

MILO R SHEN-ORR S ITZKOVITZ S et al (2002) Network motifs simple building blocks of complex net-works Science 298 823ndash827

MORRIS JF HROMAS R and RAUSCHER FJ (1994) Characterization of the DNA-binding properties of themyeloid zinc finger protein MZF1 two independent DNA-binding domains recognize two DNA consensus sequenceswith a common G-rich core Molecular Cell Biology 14 1786

NARDELLI J THIESSON D FUJIWARA Y et al (1999) Expression and genetic interaction of transcription fac-tors GATA-2 and GATA-3 during development of the mouse central nervous system Developmental Biology 210305ndash321

PEDERSEN AG BALDI P CHAUVIN Y et al (1999) The biology of eukaryotic promoter predictionmdasha re-view Computers amp Chemistry 23 191ndash207

PLYTE S BONCRISTIANO M FATTORI E et al (2001) Identification and characterization of a novel nu-clear factor of activated T-cells 1 isoform expressed in mouse brain Journal of Biological Chemistry 27614350ndash14358

QUANDT K FRECH K KARAS H et al (1995) MatInd and MatInspectormdashnew fast and versatile tools for de-tection of consensus matches in nucleotide sequence data Nucleic Acids Research 23 4878 Available wwwgenomatixde

RAO CV and ARKIN AP (2001) Control motifs for intracellular regulatory networks Annual Review Biomed-ical Engineering 3 391ndash419

RICCI M and EL-DEIRY W (2003) DNA footprinting Methods of Molecular Biology 223 117ndash128RACHELSON E (1990) The use of cultured cells in the study of mood-normalizing drugs Pharmacological Toxi-

cology 66 Suppl 3 69ndash75RONEN M ROSENBERG R SHRAIMAN BI et al (2002) Assigning numbers to the arrows parameterizing a

gene regulation network by using accurate expression kinetics Proceedings of the National Academy of Science USA99 10555ndash10560

SCHUG J and OVERTON GC (1997) TESS Transcription element search software on the WWW [Technical Re-port CBIL-TR-1997-1001-v00] Computational Biology and Informatics Laboratory School of Medicine Univer-sity of Pennsylvania Available wwwcbilupennedutess

SEGAL E SHAPIRA M REGEV A et al (2003) Module networks identifying regulatory modules and their con-dition-specific regulators from gene expression data Nature Genetics 34 166ndash176

TAVAZOIE S HUGHES JD CAMPBELL MJ et al (1999) Systematic determination of genetic network ar-chitecture Nature Genetics 22 281ndash285

THE GENE ONTOLOGY CONSORTIUM (2000) Gene ontology tool for the unification of biology Nature Genet-ics 25 25ndash29

TOUYZ RM and BERRY C (2002) Recent advances in angiotensin II signaling Brazilian Journal of Medical andBiological Research 35 1001ndash1015

UCSC GENOME BROWSER (2003) Available httpgenomeucsceduWELLS J and FARNHAM PJ (2002) Characterizing transcription factor binding sites using formaldehyde crosslink-

ing and immunoprecipitation Methods 26 48ndash56WESSELS LF SOMEREN EPV and REINDERS MJ (2001) A comparison of genetic network models Pacific

Symposium on Biocomputing 6 508ndash519WINGENDER E CHEN X FRICKE E et al (2001) The TRANSFAC system on gene expression regulation Nu-

cleic Acids Research 29 281WOLF G WENZEL U BURNS KD et al (2002) Angiotensin II activates nuclear transcription factorndashkappaB

through AT1 and AT2 receptors Kidney International 61 1986ndash1995


251

YAMADA Y PANNELL R FORSTER A et al (2002) The LIM-domain protein Lmo2 is a key regulator of tu-mour angiogenesis a new anti-angiogenesis drug target Oncogene 21 1309ndash1315

YEUNG MK TEGNER J and COLLINS JJ (2002) Reverse engineering gene networks using singular value de-composition and robust regression Proceedings of the National Academy of Science USA 99 6163ndash6168

ZAK DE DOYLE FJ GONYE GE et al (2001) Simulation studies for the identification of genetic net-works from cDNA array and regulatory activity data Presented at the Second International Conference on Sys-tems Biology

Address reprint requests toDr Gregory E Gonye

1020 Locust St Rm 521Jefferson Alumni Hall

Department of PathologyThomas Jefferson University

Philadelphia PA 19107

E-mail ggonyemaildbitjuedu

VADIGEPALLI ET AL

252

ical systems involve multiple genes functioning in a hierarchical and interconnected network Present tech-nological developments have resulted in rapidly growing public resources containing systematic data setsof various types gene expression changes from microarrays proteinndashDNA interaction and transcription fac-tor (TF) activity data from protein binding assays chromatin immunoprecipitation (ChIP) experiments(Wells and Farnham 2002) and DNA footprinting (Kang et al 2002 Ricci and El-Deiry 2003) pro-teinndashprotein interactions from two hybrid experiments and coimmunoprecipitation and genomic sequenceand ontology information in public databases (wwwgeneontologyorg) The analysis of these large datasetsholds the promise of identification of the nonlinear dynamic systems function of the interconnected geneand biochemical regulatory networks

Attempts at reverse engineering the gene regulatory networks from microarray data alone have metwith varied success (Holstege et al 1998 DrsquoHaeseleer et al 1999 Wessels et al 2001 Ronen et al2002) Typically all the genes are considered as potentially regulating all the other genes and the sub-optimal and non-unique results are subsequently pruned either by setting thresholds on the quantitativeparameters representing interaction strength or via constrained optimization (Yeung et al 2002) Com-bining the available heterogeneous data types significantly improves the ability to unravel the regula-tory networks (Tavazoie et al 1999 Hughes et al 2000 Zak et al 2001 Hartemink et al 2002 Idekeret al 2002) The principal effect of incorporating additional data types apart from microarrays is to con-strain the number of regulatory interactions per gene Based on the known proteinndashDNA and pro-teinndashprotein interactions many interactions can be required to be present or specified to be nonexistentin the identification algorithm This limits the number of interaction parameters to search for and ren-ders the network identification algorithms tractable for a large number of genes (Zak et al 2001 Yuenget al 2002)

The biological mechanism of transcriptional regulation is by specific transcription factors (TFs) bindingto the transcriptional regulatory elements (TREs) present in the cis-regulatory region (promoter) of the cor-responding genes The binding is sequence specific and the binding sites are present on multiple genesThis results in an interconnected transcriptional regulatory network Hence the analysis of the promotersfor the genes of interest for known and predicted TF binding sites will directly provide a good candidateset of network interactions (Hughes et al 2000 Hartemink et al 2002 Ideker et al 2002) This approachhas been most successful in developing a detailed understanding of gene regulatory networks in yeast (Tava-zoie et al 1999 Ideker et al 2001) The availability of genomic sequence combined with extensive in-formation about TF binding site motifs has enabled system-wide analyses to unravel the gene regulatorynetworks that govern the response of yeast to a multitude of environmental perturbations (Tavozie et al1999 Ideker et al 2001) Similar efforts are in progress in Drosophila (Berman et al 2002) sea urchin(Davidson et al 2002) and human systems (Elkon et al 2003)

In this context the objective of the research efforts described in this paper is to develop an automatedand scalable bioinformatics approach to the identification and analysis of candidate regulatory interactionsin a specific experimental setting We have developed PAINTmdashPromoter Analysis and Interaction NetworkToolmdashfor this purpose Briefly PAINT processes a list of unique identifiers representing the genes of in-terest and produces an interaction matrix that represents a candidate set of interactions between the tran-scription factors and the genes This information can be subsequently employed in various network identi-fication analysis and visualization software The objective is not to develop another tool for sequenceanalysis but to construct a modular and extensible platform into which various sequence analysis toolsnetwork analysis and visualization software and model identification tools can be ldquoplugged inrdquo In addi-tion to the command line and the web-based interfaces (available at wwwdbitjuedudbitoolspaint) PAINTmodules can be accessed as the ldquoagentsrdquo that communicate via Open Agent Architecture in the BioSPICEplatform (wwwbiospiceorg)

The present manuscript is organized as follows The details of various modules that constitute PAINTare presented in the Methods section In the Results section two case studies are presented to demonstratethe tool The first study involves a set of 155 differentially regulated genes in differentiated neuroblastomacells The second study is concerned with a set of 578 differentially regulated genes identified from ex-pression time series data derived from neuroblastoma cells exposed to the neuropeptide angiotensin II

VADIGEPALLI ET AL

236


PAINT










237


PAINT modules




quence is selected





VADIGEPALLI ET AL

238





p 5

^m

i 5 h 1 l

i 2 1n 2 l

m 2 i21 n




239





Experimental data




VADIGEPALLI ET AL

240

RESULTS


Case study 1





241







Case study 2


VADIGEPALLI ET AL

242



243


FIG 3



VADIGEPALLI ET AL

244



245















DISCUSSION



VADIGEPALLI ET AL

246



247









VADIGEPALLI ET AL

248












CONCLUSION


ACKNOWLEDGMENTS



249

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252


PAINT










237


PAINT modules




quence is selected





VADIGEPALLI ET AL

238





p 5

^m

i 5 h 1 l

i 2 1n 2 l

m 2 i21 n




239





Experimental data




VADIGEPALLI ET AL

240

RESULTS


Case study 1





241







Case study 2


VADIGEPALLI ET AL

242



243


FIG 3



VADIGEPALLI ET AL

244



245















DISCUSSION



VADIGEPALLI ET AL

246



247









VADIGEPALLI ET AL

248












CONCLUSION


ACKNOWLEDGMENTS



249

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252

PAINT modules




quence is selected





VADIGEPALLI ET AL

238





p 5

^m

i 5 h 1 l

i 2 1n 2 l

m 2 i21 n




239





Experimental data




VADIGEPALLI ET AL

240

RESULTS


Case study 1





241







Case study 2


VADIGEPALLI ET AL

242



243


FIG 3



VADIGEPALLI ET AL

244



245















DISCUSSION



VADIGEPALLI ET AL

246



247









VADIGEPALLI ET AL

248












CONCLUSION


ACKNOWLEDGMENTS



249

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252





p 5

^m

i 5 h 1 l

i 2 1n 2 l

m 2 i21 n




239





Experimental data




VADIGEPALLI ET AL

240

RESULTS


Case study 1





241







Case study 2


VADIGEPALLI ET AL

242



243


FIG 3



VADIGEPALLI ET AL

244



245















DISCUSSION



VADIGEPALLI ET AL

246



247









VADIGEPALLI ET AL

248












CONCLUSION


ACKNOWLEDGMENTS



249

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252





Experimental data




VADIGEPALLI ET AL

240

RESULTS


Case study 1





241







Case study 2


VADIGEPALLI ET AL

242



243


FIG 3



VADIGEPALLI ET AL

244



245















DISCUSSION



VADIGEPALLI ET AL

246



247









VADIGEPALLI ET AL

248












CONCLUSION


ACKNOWLEDGMENTS



249

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252

RESULTS


Case study 1





241







Case study 2


VADIGEPALLI ET AL

242



243


FIG 3



VADIGEPALLI ET AL

244



245















DISCUSSION



VADIGEPALLI ET AL

246



247









VADIGEPALLI ET AL

248












CONCLUSION


ACKNOWLEDGMENTS



249

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252






Case study 2


VADIGEPALLI ET AL

242



243


FIG 3



VADIGEPALLI ET AL

244



245















DISCUSSION



VADIGEPALLI ET AL

246



247









VADIGEPALLI ET AL

248












CONCLUSION


ACKNOWLEDGMENTS



249

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252


243


FIG 3



VADIGEPALLI ET AL

244



245















DISCUSSION



VADIGEPALLI ET AL

246



247









VADIGEPALLI ET AL

248












CONCLUSION


ACKNOWLEDGMENTS



249

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252



VADIGEPALLI ET AL

244



245















DISCUSSION



VADIGEPALLI ET AL

246



247









VADIGEPALLI ET AL

248












CONCLUSION


ACKNOWLEDGMENTS



249

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252


245















DISCUSSION



VADIGEPALLI ET AL

246



247









VADIGEPALLI ET AL

248












CONCLUSION


ACKNOWLEDGMENTS



249

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252




DISCUSSION



VADIGEPALLI ET AL

246



247









VADIGEPALLI ET AL

248












CONCLUSION


ACKNOWLEDGMENTS



249

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252


247









VADIGEPALLI ET AL

248












CONCLUSION


ACKNOWLEDGMENTS



249

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252




VADIGEPALLI ET AL

248












CONCLUSION


ACKNOWLEDGMENTS



249

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252







CONCLUSION


ACKNOWLEDGMENTS



249

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252

REFERENCES


























VADIGEPALLI ET AL

250




























251









VADIGEPALLI ET AL

252




























251









VADIGEPALLI ET AL

252









VADIGEPALLI ET AL

252

Date post:	14-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

PAINT: A Promoter Analysis and Interaction Network ... · DANIEL E. ZAK,2JAMES S. SCHWABER,1and...

Documents