A pattern-oriented specification of gene network inference processes

Computers in Biology and Medicine 43 (2013) 1415–1427

Contents lists available at ScienceDirect

Computers in Biology and Medicine

0010-48http://d

n CorrE-m

farias@f

journal homepage: www.elsevier.com/locate/cbm

A pattern-oriented specification of gene network inference processes

Nestor W. Trepode, Cléver R.G. de Farias n, Junior BarreraDepartment of Computer Science and Mathematics (DCM), Faculty of Philosophy, Sciences and Letters at Ribeirão Preto (FFCLRP),University of São Paulo (USP), Av. Bandeirantes, 3900, Monte Alegre, Ribeirão Preto 14040-901, SP, Brazil

a r t i c l e i n f o

Article history:Received 2 February 2011Accepted 6 July 2013

Keywords:Genetic regulatory networksDynamical system identificationGene network inferenceMicroarray data analysisProcess modelingPatterns

25/$ - see front matter & 2013 Elsevier Ltd. Ax.doi.org/10.1016/j.compbiomed.2013.07.008

esponding author. Tel.: +55 1636020564.ail addresses: [email protected] (N.W. Trepode),fclrp.usp.br (C.R.G. de Farias), [email protected] (

a b s t r a c t

Patterns have been widely used in Computer Science. A pattern describes a generic solution to an existingproblem in a more readable and accessible form. A pattern-oriented process specification consists of ageneric and abstract description of a process. This paper presents a pattern-oriented specification of agenetic regulatory network inference process performed from microarray data and prior biologicalknowledge. The proposed specification was conceived based on prior work on gene inference networks.The adequacy of the proposed solution was then evaluated with respect to modern tendencies of thegenes network inference literature.

& 2013 Elsevier Ltd. All rights reserved.

1. Introduction

Bioinformatics concerns the use of advanced computation tech-niques and biological knowledge, including internet distributeddatabases and processors, parallel processing and data mining, topromote knowledge advances in (molecular) biology. One of thegreatest challenges of the field consists in the development ofmathematical models and corresponding inference techniques formining databases and extracting new knowledge. These models andprocesses are continuously created, improved and modified byexperts in the area. The applied research community profits fromthese advances with the use of the corresponding software when itbecomes available in specialized repositories.

Genes and their protein products, generated by gene expressionthrough transcription and translation, form complex signalingnetworks, which control metabolic pathways, cellular functionsand the future expression of genes themselves. A network of thiskind is known as a gene regulatory network or genetic regulatorynetwork (GRN). In a GRN, the level of expression of a gene dependson the expression values of a gene subset and on external stimuli,both at previous instants of time. This property characterizes GRNsas dynamical systems [1]. Understanding the structure anddynamics of gene regulatory networks is one of the biggestchallenges that Biology faces today. For example, the ability ofintervening in a GRN in order to make it reach given states andavoid others (such as those associated with disease) would have

ll rights reserved.

J. Barrera).

strong impact in human therapy [2,3] as well as in animalbreeding [4,5] and farming [6–8].

Many approaches have been proposed for the inference of genesnetwork from gene expression data, for example [9–16]. Reviews ofsuch methods can be found in [1,17–20]. In general, these approachesconsist in the execution of a sequence of procedures (i.e., databasescreation and consulting, signal processing, system inference, etc.)until a new biological hypothesis can be inferred from available dataand previous knowledge. Since these tools are usually not integrated,researchers of the field are frequently compelled to re-implementpieces of software in order to create a coherent analysis pipeline [21].

The composition of individual system parts to create an integratedsoftware solution can be accomplished using general purposeintegration environments, such as Taverna [22,23], Bio-jETI [24,25]and Magallanes [26]. These environments provide great flexibility forthe integration of any given set of individual programs. Since theseenvironments are not targeted to a specific domain they neitherprovide guidelines nor define data structures and interfaces tofacilitate the development of individual contributions in any givendomain. We believe that this lack of facilities for the development ofindividual contributions in a particular domain hinders the effectiveuse of these environments as platforms for collaborative scientificresearch.

We believe that general purpose integration environmentswould benefit greatly from the availability of abstract processmodels for specific domains described using a sequence of abstractactivities, each one characterized by its input–output data andexpected behavior. Based on the abstract specification, differentcomputational methods could then be developed by researchers aspossible realizations of these abstract activities. These contribu-tions would easily be integrated with the contributions of other

www.sciencedirect.com/science/journal/00104825

www.elsevier.com/locate/cbm

http://dx.doi.org/10.1016/j.compbiomed.2013.07.008



http://crossmark.dyndns.org/dialog/?doi=10.1016/j.compbiomed.2013.07.008&domain=pdf



mailto:[email protected]




N.W. Trepode et al. / Computers in Biology and Medicine 43 (2013) 1415–14271416

researchers to create comprehensive analysis solutions. A similar,albeit simpler, approach was adopted by Khoros [27,28], whichwas a well-known software environment widely used by theimage processing research community.

This paper concretely applies this idea by presenting thespecification of an abstract genes network inference process thatcan be used as basis for collaborative research on gene regulatorynetwork inference. We call this abstract process specification apattern, since it describes a reusable solution to a problem in agiven context. Our specification was designed based on threeprevious works on gene regulatory network inference processes[29–32]. The quality of the proposed pattern specification waschecked by studying its adequacy to describe representativetechniques of different modern tendencies of the genes networkinference literature [19,33–35].

Following this introduction, Section 2 recalls the main activitiesinvolved in gene regulatory network inference. Section 3 presentsa review of concepts and properties of a pattern-oriented processspecification and provides an overview of a standard modelinglanguage used in our specification. Section 4 presents an overviewof the proposed gene regulatory network inference pattern speci-fication and describes in detail its main activity, Network Construc-tion. Section 5 discusses the genesis and robustness the proposedspecification. Finally, Section 6 presents some conclusions andoutlines future works of this research. Appendix A provides adetailed description of the activities specified in the proposedgene regulatory network inference pattern specification.

2 The concept of layer presented here is inherent to the construction method.The initial gene layer depends on which set of seed genes we start with and thefollowing layers will depend on the information obtained from the data andprevious biological knowledge. Seed genes selection is carried out based on actualbiological knowledge, which can be obtained from ontologies and/or functional

2. Gene regulatory network inference

A time-course microarray experiment aims at quantitativelymeasuring levels of gene expression for a large set of genes (usuallythousands) at successive time-intervals and at experimental condi-tions at which the biological phenomenon under study can beobserved. The output of a time-course microarray experiment isthe main input to the gene regulatory network identification process.It consists of a n�m matrix measuring the expression levels of the ngenes at m consecutive time instants, usually spaced at regularintervals. Each of the n rows in the matrix represents the expressionvalue of a single gene at the m time instants, and each of the mcolumns represents the expression values of the whole gene set at agiven time, i.e., the values in each column usually come from adifferent microarray experiment performed at a given time1 (seeFig. 1a).

These time-course microarray data, after adequate pre-processing(i.e., normalization and quantization), are used for estimating theprobabilistic dependence of a target gene to a set of predictor genes(see Fig. 1a–c). In general, these inference methods are based on theestimation of mutual information [36] or similar concepts such asCoD (Coefficient of Determination) [37], which essentially measuresthe degree of mass concentration of the conditional distributions dueto the observation of a given set of genes. The best predictor sets areexactly the ones that produce a stronger regulation, i.e., the ones thatconcentrate more substantially the probability mass of the condi-tional distributions. When evaluating these dependencies from time-course microarray data, predictor genes are observed in samplestaken before the target gene sample in order to infer dynamicdependencies and temporal evolution regulation. Nevertheless, inmany cases, like in higher eukaryotes, where temporal gene expres-sion data are difficult to obtain, samples taken from differentindividuals are supposed to capture steady states of the underlyingdynamics, and predictor–target dependencies are evaluated inside

1 Sometimes this matrix is presented in its transposed form.

the same sample [13,30,33,38]. These target–predictor dependencies,identified in the data, are the building blocks for inferring the GRNarchitecture and dynamics.

Fig. 1 provides an overview of how GRN inference frommicroarray data is carried out. The main sources of informationare periodic samples of gene expression obtained by time-coursemicroarray experiments (Fig. 1a). Based on this information we tryto determine dependencies of the type: which (predictor) genesregulate – directly or indirectly according to the data – a (target)gene – for simplicity, in Fig. 1b we assume cardinality two for thepredictor set. In order to infer these gene dependencies (alsocalled network wiring connections), we estimate from the micro-array data the probabilities of occurrence of each possible targetvalue given each possible state of the predictor set (Fig. 1c). We usethose probability distributions to estimate some dependence orpredictability measure (like CoD, entropy or mutual information)to evaluate the strength of connection SCðg1; g2-g3Þ from apredictor gene set g1; g2 to a target gene g3 (Fig. 1b).

The network construction method starts with a seed subset ofgenes, which are known to be involved in the phenomenon understudy, as the initial gene layer (initial network), and adds to thegrowing network, at each successive step, a new gene layer formedby the genes most significantly connected to the genes in theprevious layer.2 At each network growing step, we compute thestrength of connection “from” the previous layer G to each candidate3

gene h, SCðG-hÞ, and the strength of connection from eachcandidate gene h “to” the previous layer G, SCðh-GÞ (Fig. 1d).Candidate genes are ranked by their overall strengths of connectionwith the previous layer G, SCðh2GÞ ¼maxfSCðh-GÞ; SCðG-hÞg(upper Fig. 1e indicates this ranking). Candidate genes ranked overthe predictability threshold Tpl are automatically included in the nextlayer, while candidate genes between thresholds Tpl and Tpmin

areincluded in the next layer only if they belong to the auxiliary subsetof genes SF known (or plausibly supposed) to be related to thephenomenon under study (lower Fig. 1e indicates gene selection forthe next layer). The most recently added gene layer will beconsidered the previous layer G in the next growing step. Thisprocess is iterated a number of times until some stop condition isreached (in Fig. 1f each color indicates a different gene layer).

3. Pattern-oriented process specification

3.1. Pattern specification

A pattern describes a generic solution to an existing problem in amore readable and accessible form [42]. Patterns capture provensolutions to real problems and generalize these solutions so that theycan be reused in similar contexts. Historically, patterns emerged as adiscipline in Computer Science somehow influenced by the pioneerwork of Christopher Alexander, a professor of architecture at theUniversity of California at Berkely, who first wrote a series of bookscataloging a number of architectural patterns and describing theirapplication [43–45]. Patterns in Computer Science have been widelyadopted in the different phases of system development, e.g., [46–49].

A process can be specified using different (standard) modelingnotations. In the context of this work, we have adopted the BusinessProcess Modeling Notation (BPMN) [50]. BPMN is a process modeling

annotation databases, such as Gene Ontology (GO) [39], KEGG [40] and REACTOME[41].

3 Candidate genes are all genes not already included in the growing network.

Fig. 1. Gene regulatory network inference procedure overview. (a) Microarray signal. (b) Predictor–target dependencies. (c) Conditional probabilities estimation. (d) Strengthsof connection (SC) of a candidate gene h from and to the previous gene layer G. (e) Next layer gene selection. (f) Gene network growing (each color represents a differentlayer). (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

N.W. Trepode et al. / Computers in Biology and Medicine 43 (2013) 1415–1427 1417

notation developed and standardized by the Object ManagementGroup (OMG). BPMN was conceived for the specification of businessprocesses in general. However, its generic set of concepts andmodeling capabilities can also be used in the high-level specificationof processes in general. We have used BizAgi Process Modeler4 as ourBPMN modeling tool.

BPMN specifications are called Business Process Diagrams (BPD).A BPD is represented mainly using flow objects, connecting objects,and artifacts. Flow objects, such as events, activities and gateways,represent the main graphical elements used to define the intendedprocess behavior. An event represents something that happensduring the course of a process. An activity represents a unit of workcarried out within a process. Connecting objects, such as sequenceflows and associations, are used to connect flow objects to each otheror to artifacts. A sequence flow is used to represent the order inwhich activities will be executed within the process, while anassociation is used to associate artifacts to flow objects. An artifactprovides additional information about the execution of an activity.A data object, which represents the main type of artifact, is used torepresent how data is produced or consumed by activities. A dataobject is connected to activities through associations indicating eitherdata production or consumption. Finally, a gateway controls theconvergence and divergence of sequence flows. Gateways includeparallel gateways and exclusive gateways. A parallel gateway is usedto synchronize two incoming (parallel) sequence flows into a singleoutgoing sequence flow, while an exclusive gateway is used to forkan incoming sequence flow into two mutually exclusive outgoingsequence flows according to a given condition or decision.

3.2. Pattern realization, adaptation and extension

A pattern-oriented process specification consists of an abstractdescription of a number of interrelated activities that can be used tosolve a problem in a given context. Each identified activity isgenerically described in terms of its purpose, required input, produced

4 BizAgi Process Modeler (http://www.bizagi.com/).

output and sample implementation (behavior). The main benefit ofsuch abstract specification relies on its flexibility for alternativerealizations. An abstract process specification must be concretelyinstantiated with a suitable implementation for each proposed activity.Thus, this abstract specification can be realized in many different ways.As long as, for each defined activity, input and output data structureremains unchanged, different implementations of each activity can beused interchangeably, thus allowing the creation of different processrealizations for the same specification. Fig. 2 shows an abstract processspecification and three alternative concrete realizations.

A process specification can be adapted or extended in threedifferent ways: (i) replacing (part of) the behavior of an activity;(ii) adding new activities to the process; (iii) combining the replace-ment of the behavior of an activity with the addition of new activities.In the first case, both the input and output of an activity are preserved.However, additions or replacements to the activity behavior areperformed by, for example, using a different (alternative) method toproduce the expected output. In the second case, new activities and(possibly) data objects are introduced in order to create alternativeflows for the process. However, these additional activities must, atsome point, consume and/or produce a data object used in the originalspecification. In the third case, (possibly) new activities and dataobjects are introduced in order to create a new data object that can beused as input by an existing activity, thus producing a modification tothis activity behavior. Fig. 3 illustrates the different ways in which aprocess specification can be extended. For simplification purposes, werepresent only activities, data objects and their association.

4. Pattern specification

4.1. Modeling strategy

The following steps were used to identify the gene regulatorynetwork inference process: (1) compare existing solutions anddecompose them into a minimum set of activities that can be usedto solve the problem; and (2) model the identified sequences ofactivities and document the identified solution; and (3) evaluate theproposed solution by comparing it with tendencies of the literature.

http://www.bizagi.com/

P

P A CB

A

B

C

P1

P3

P2

Fig. 2. Abstract specification and alternative realizations. An abstract activity isdepicted by a white rectangle with rounded corners. An activity implementation isrepresented by a colored rectangle with rounded corners, where each colorrepresents a different implementation. The proposed process specification P canbe realized in many different ways, e.g., P1, P2 and P3, by combining a differentimplementation for each activity. (For interpretation of the references to color inthis figure caption, the reader is referred to the web version of this article.)

Fig. 3. Process adaptation and extension. Original or unmodified elements are white,while new or modified elements are colored. (a) Original specification. (b) Activityreplacement. (c) Activity introduction. (d) Activity replacement and activityintroduction. (For interpretation of the references to color in this figure caption,the reader is referred to the web version of this article.)


A number of papers focus the inference of gene regulationnetwork architectures and most of them adopt simple discreteMarkov systems to conciliate a structural dynamical description ofthe phenomena and the feasibility of inference processes fromsmall samples of the expression dynamics, cf. [51,52,29,18,20]. Wehave considered some previous work carried out in the context ofour research group as basis for the study of existing solutions.Particularly, we have considered the following gene networkmodeling approaches [29–32].

As a result of our study of existing solutions, we have identified anumber of activities that should be carried out in order to infer agene regulatory network description. We have identified the follow-ing set of activities: Microarray Data Normalization, Microarray DataQuantization, Functional Gene Classification, Construction SettingsDefinition, Network Construction, Network Definition, Graphical Repre-sentation Generation, and Network Functional Analysis (see Fig. 4).5

5 For clarity, the activities and data objects names shown in Fig. 4 are shown initalics in the paper.

We have then identified an execution ordering for these activitiesand modeled this behavior according to an integrated (high-level)perspective [53]. We consider the process as a unit of behaviorcomprised by a number of interrelated atomic activities which areperformed by functional entities, i.e., entities capable of executingbehavior, such as an information system or a human being. Weidentify neither the (functional) role these entities play in the processnor the individual contributions of these entities for the execution ofeach activity.

As a final step, we have compared our process specificationwith each of the basis solutions and with other works described inthe literature, including [19,33–35].

4.2. Process specification description

The development of a gene regulatory network serves to twodifferent purposes: knowledge validation or knowledge discovery.In the former, the goal is the validation of an inference techniquewith a set of parameters applied to a given biological context. In thiscase, we select a set of genes with the same known biologicalfunction(s). A subset of the selected genes is chosen as seed geneswhile the others will be the test data. The inference technique andassociated parameters will be adequate if most of the test genes areincluded in the estimated network. To increase the statisticalconfidence in the evaluation, the process can be evaluated underthe choice of several subsets of genes. Additionally, in order toimprove the inference properties, the process can also be evaluatedunder several sets of genes with a corresponding homogenousfunctional classification.

In the latter, the goal is the acquisition of new biological knowl-edge. In this case, we select as seed genes a set of genes related to abiological function under study (actual knowledge) and grow thenetwork. Then, the inferred set of genes should be validated throughspecific biological experiments. For example, in the study of theplasmodium falciparum [29], the gene network inference process forthe malaria expression regulation system has been initially validatedusing known glycolytic genes and later applied to acquire newknowledge about the apicoplast system by creating a gene networkfrom a small set of known apicoplast genes.

Fig. 4 depicts the proposed Gene Regulatory Network InferenceProcess BPMN specification. This proposed process specification can beused to infer a genetic regulatory network from microarray data andprevious biological knowledge. In the sequel, we provide a simplifiedoverview of how the whole process takes place and in the followingsection we describe in detail its core activity, Network Construction.Detailed specifications of the remaining activities are provided inAppendix A. These activities are described in terms of their purpose,inputs and outputs and proposed (sample) solution.

Basically, before Network Construction takes place, a group ofactivities are needed to pre-process the data into an adequate format,gather biological information to help in the inference process anddefine the parameters required by the construction method. After theexecution of Network Construction, another set of activities refine andanalyze the results, providing feedback for new runs of the inferenceprocedure or yielding the final results of the pipeline when they areconsidered satisfactory.

The Microarray Data Normalization activity takes as input geneexpression Microarray Data and normalizes them in order torender values of different genes comparable, producing as outputNormalized Data. These real valued data are then discretized by theMicroarray Data Quantization activity into a suitable range ofinteger numbers as required by the adopted network inferenceprocedure. The Functional Gene Classification activity aims atgathering prior biological knowledge to be used later to helpimprove the network estimation quality. It takes as input the

Gen

e R

egul

ator

y N

etw

ork

Infe

renc

e P

roce

ss

MicroarrayData

Start

Microarray Data Normalization

Functional Gene Classification

Gene Identifiers

NormalizedData

KnownBiologicalFunctions

Microarray Data Quantization

ConstructionSettings

Definition

Network Construction

Network Definition

GraphicalRepresentation

Generation

Is Network Ready?

Network FunctionalAnalysis

Are ResultsSatisfactory?

Yes

End

Yes

No

No

QuantizedData

GrowthBoundingConditions

Genes Associated to

BiologicalFunctions

Tentative Network Description

Defined Network Description

SeedGenes

Genes Relatedto the

Phenomenon

Network Graphical

Representation

Fig. 4. Gene regulatory network inference process specification. Sequence of activities needed to perform gene regulatory network inference from expression data. First,microarray data is preprocessed into the format required by the following activities (Microarray Data Normalization and Microarray Data Quantization). Previous biologicalinformation is collected to help in the inference process (Functional Gene Classification). The set of parameters required by the inference procedure is defined – or adjustedfrom previous runs of the inference procedure – (Network Construction Settings). Based on those parameters, a tentative network, with uncertainty, is grown from expressiondata (Network Construction). In the sequel, a network without uncertainty is defined choosing the most relevant and reliable gene dependencies inferred from the data(Network Definition). A graphical representation is created to help in the analysis of the results (Graphical Representation Generation). Known functions of the genes in thedefined network are searched (Network Functional Analysis). If the results are not satisfactory or potentially feasible of being improved, construction settings can be adjustedconveniently to perform the inference process again. This procedure is iterated until a satisfactory network is obtained.



microarray dataset Gene Identifiers6 and produces as output a list7

of Known Biological Functions associated to each gene identifier.The Construction Settings Definition activity aims at defining or

adjusting the set of parameters needed for the Network Constructionactivity. It has inputs: (1) a list of the Gene Identifiers of the genes in thetime-course microarray experiment; (2) lists of Known BiologicalFunctions (when they are available) assigned to each gene of themicroarray dataset; and (optionally) (3) lists of Genes Associated toBiological Functions in a previously grown network. It produces asoutputs: (1) a subset of so-called Seed Genes from the n genes set(representing a meaningful subset of genes, known to be involved inthe biological phenomenon under study); (2) an auxiliary subset ofGenes which are known to be Related to the Phenomenon under study;and (3) a group of network Growth Bounding Conditions, including themaximum network size of the final grown network (in number ofgenes) and two layer growth thresholds: a layer-growth predictabilitythreshold (TPl

) and the minimum predictability threshold (TPmin).

The Network Construction activity8 aims at identifying a generegulatory network from the measured time-course microarray data.It takes as inputs: (1) the Quantized Data; (2) lists of Known BiologicalFunctions previously assigned individually to genes in the microarraydataset; (3) a subset of Seed Genes from the microarray dataset;(4) network Growth Bounding Conditions, including predictabilitythresholds for next layer gene selection (TPl

) and (TPmin) andmaximum

network size (stop condition), and; (5) an auxiliary subset of Genesknown to be Related to the Phenomenon under study. This activityproduces as output the Tentative Network Description: a directed graphNT that represents the tentative network architecture constructed withthe detected predictive relations between genes, which consists of adescription with uncertainty represented by predictability and relia-bility information of the detected gene dependencies.

The Network Definition activity performs a refinement of theresults obtained in the Network Construction activity. NetworkDefinition aims at obtaining a final network description summar-izing the most relevant and reliable gene dependencies (connec-tions) as tentative biological hypothesis produced by the inferenceprocess. This activity receives as inputs: (1) the Tentative NetworkDescription graph produced by the Network Construction activity;(2) an optional auxiliary subset of Genes Related to the Phenomenonunder study, and; (3) optional lists of Known Biological Functionsassociated individually to genes in the time-course microarrayexperiment; and produces as output a Defined Network Descriptiongraph containing nodes associated to gene identifiers and directedbut unweighted edges (without uncertainty, i.e., no predictabilitynor reliability information since a decision has been taken onwhich connections are more meaningful).

The Graphical Representation Generation activity aims at creating aNetwork Graphical Representation suitable for visualization by a biolo-gist. It has as an input a network graph description, either the TentativeNetwork Description graph or the Defined Network Description graph,and produces as an output a Network Graphical Representation.

After Network Representation Generation (“Is Network Ready?” checkpoint in Fig. 4), if considered convenient, some adjusts can be made tothe parameters for a new run of the construction method or it can berun from a certain point (layer) of the previously grown network bygoing back to the Construction Settings Definition activity. Otherwise,the procedure continues with the next activity.

The Network Functional Analysis activity aims at identifying newgene functionalities found in the final Defined Network Description,

6 By “gene identifier” we mean a standard or systematic name or an alias, onefor each gene, used to unequivocally identify a single gene in a knowledge base. Inthe context of this work, when we refer to a gene we mean its respective identifier,unless stated otherwise.

7 Possibly empty for some genes.8 Described in detail in the next subsection.

whichmay be related to the ones under study (those in the initial seedgene set). This activity has as inputs: (1) the Defined NetworkDescription graph; (2) lists Known Biological Functions assigned indivi-dually to genes in the microarray dataset; and optionally (3) theNetwork Graphical Representation as a visual aid for the biologist. Thisactivity produces as output a list of Genes Associated to BiologicalFunctions found in the grown network.

As previously discussed, the development of a gene regulatorynetwork serves to two different purposes. If the analysis resultsare considered adequate according to associated gene networkdevelopment criteria, the process ends. Otherwise, it goes back tothe Construction Settings Definition activity for adjusting or rede-fining parameters conveniently for a new run of the inferenceprocedure.

4.3. Detailed activity description: Network Construction

The Network Construction activity aims at identifying a generegulatory network from the measured time-course microarray data.The constructive method starts up with an initial layer formed by thesubset S of Seed Genes and successively adds new gene-layers. Asdiscussed previously, seed genes can represent either a small set ofgenes chosen randomly from a set of genes that have the samefunction or a set of known genes associated to the functionality understudy. Each new layer is constructed both with the genes bestpredicted by genes in the previous layer and with the genes that bestpredict the genes in the previous layer, after ranking the candidatesaltogether (considered as predictors or as targets of genes in theprevious layer) by their predictability values.

Network Construction has as inputs9: (1) Quantized Data: ann�m matrix MQ of integer numbers with the normalized anddiscretized gene expression values; (2) lists LFi of Known BiologicalFunctions previously assigned individually to genes in the micro-array dataset; (3) a subset S of Seed Genes from the n genes (fromthe list GID of Gene Identifiers) in the microarray dataset; (4)network Growth Bounding Conditions, including the maximumnetwork size NM of the final grown network, in number ofgenes, and two layer growth thresholds, TPl , which representsthe layer-growth predictability threshold , and TPmin , which repre-sents the minimum predictability threshold, and; (5) an auxiliarysubset SF of Genes known to be Related to the Phenomenonunder study.

Network Construction produces as output the Tentative NetworkDescription: a directed graph NT that represents the tentativenetwork architecture constructed with the detected predictiverelations between genes. A measure of predictability and reliabilityis associated to each identified relation in the network (edgein the graph). Each node in this graph represents a single geneand is defined by a gene identifier included in GID. Each directededge goes from a predictor gene to its target gene and is identifiedby three elements: (1) predictability p, which measures howexactly the target value can be estimated by observing thepredictor values (the higher p, the lower is the expected estima-tion error); (2) reliability r, where 1≤r≤ l and l is the total numberof grown layers (the higher r, the lower the reliability), and; (3) co-predictor set c, which includes a list of genes which together withthe edge predictor are used to predict the edge target (co-predictors). The set c is empty when the edge predictor has noco-predictors, i.e., when the edge predictor alone is used to predictthe edge target.

The Seed Genes set is the starting point (initial layer and initialnetwork nodes) of the network construction activity. Additional

9 See Appendix A for more detailed specification of inputs generated byprevious activities.


gene layers can be added to the network according to theprocedure below, until a stop condition is reached. Layer iþ 1 isgrown considering only connections among the candidate genesand genes in layer i (previous layer)10:

�

are

butincllayeinitaddtheavotion

for each gene s in layer i, first compute predictability values for allconfigurations where s is a predictor having as target one of thecandidate genes or another gene in layer i, and then computepredictability values for all configurations where s is a target withat least one predictor gene which is a candidate or is also includedin layer i11;

�
rank candidate genes participating in those gene dependenciesaltogether in decreasing order by their predictability values(preserving the wiring connection information correspondingto each predictability value);
�
include in layer iþ 1 all candidate genes whose predictabilityvalues are not below TPl
, store the corresponding wiringconnection information (from or to layer i), and store alsowiring connection information between genes in layer i satisfy-ing this condition when detected;

�
include also in layer iþ 1 all candidate genes in SF havingpredictability values below TPl
but not below TPminand store

related wiring connection information, including detected wiringconnections between genes in the same layer when they satisfythis condition;

�
if after including layer iþ 1 the network size (in number ofgenes) is greater than or equal to NM, then a stop condition hasbeen reached;
�
on the other hand, if none of the candidate genes had satisfiedthe predictability threshold TPl (≥) and none of the candidategenes in SF had satisfied TPmin
in the current iteration, then astop condition has been reached. Also in this case, wiringconnection information between genes inside layer i (previouslayer) satisfying any of those thresholds should be stored, ifdetected.

Genes in the initial seed set have the greatest level of reliabilitybecause their selection has been based on biological knowledgeindicating their involvement in the phenomenon under study. Forthat reason, they were chosen as the starting point around whichthe network was grown. Successive gene layers, as they get fartherfrom the initial seed set, have decreasing reliability because theprobability that a given gene indeed belongs to the network weare trying to identify is conditioned to the probability that genes inthe previous layers also belong to the network. Each new layergrowth has an associated estimation error and its veracity(belonging to the network or not) is conditioned by the estimationerrors inherited from the previously grown layers. This impliesthat the higher the layer, the lower is its reliability. To preserve thisreliability information, edges in the output graph should belabeled (reliability r) with the number of the highest layer any ofthe genes connected by the edge belongs to.

10 All relevant connections involving genes in layers previous to layer i (if thereany) have been included in previous iterations of the procedure.11 Note that at the first step layer i is the initial network formed by the seed set,in the following iterations of this procedure, layer i is formed only by the genesuded in the previous iteration. Considering both predictor(s) and target insider i is used to detect wiring connections between genes in the same layer (theial seed set or the last previously added layer). This is required because, whening a new layer, only connections with the previous layer are considered andinitial seed set has no wiring connection information (between its genes). Toid losing relevant connections between genes in the same layer, also connec-s between genes in layer i are searched while growing layer iþ 1.

5. Genesis and robustness of the proposed patternspecification

5.1. Pattern genesis

We have used the works of Hashimoto et al. [30], Barrera et al. [29]and Ris et al. [31,32] as the basis for our pattern-oriented genenetwork inference process specification. Since we have generalizedsome of the identified activities in order to improve the proposedsolution, some activities present in these works do not fit ourspecification exactly. In the sequel we describe how these works fitour specification.12 In the next subsection, we describe how otherworks from the literature relate to our proposal and how thesecontributions could be integrated into our pattern methodology.

5.1.1. Hashimoto et al. [30]In this work, the microarray data had been already preprocessed in

a previous work [38]. Thus, the Microarray Data Normalization andQuantization activities were absent because Normalized and QuantizedData were available. Since in cancer it is not feasible to obtain closelyspaced time samples as the disease originates and disseminates, themicroarray data were not of time-course type, instead they weresamples taken from tissues at different stages during cancer develop-ment. Lacking time relationship between samples, statistical relation-ships were estimated between genes in the same sample. The datasetconsisted of 25 samples that had been already normalized andbinarized (two-level quantization). The parameters needed for net-work construction were the seed gene subset S and the maximumnetwork size NM. In this case, the Construction Settings Definitionactivity was performed manually by the biologist based on previousbiological knowledge and practical considerations.

Regarding the activity Network Construction, the initial networkwas the seed gene set considering the rest of the genes in the gene setas candidate genes. At each growing step only one gene was adjoinedto the network. The seed subset for each step was the whole networkgrown until the previous step. The chosen gene was the candidategene that maximized an adjoin function involving the strength ofconnection from and to the network and, possibly, the sensitivity of thecandidate gene from genes outside the network. As measures ofstrength of connection they used the coefficient of determination CoD[37] or influence [10] (not simultaneously). When evaluating thestrength of connection from the network, the candidate gene wasthe target and the whole predictor set was inside the network. Whenevaluating the strength of connection to the network, the candidategene acted as a predictor (with co-predictors, if any, inside thenetwork) and the target gene was part of the network. The thirdmeasure of strength of connection proposed in this work was thesensitivity of the candidate gene from genes outside the networkwhich may be desirable to reduce in order to enhance networkautonomy. The stop condition can be either the maximum networksize attained, the strength of connection falling below some minimumthreshold, or a combination of both. This procedure differs from ourspecification in three aspects: (1) only one gene is added at eachiteration; (2) in the considered relationships (strengths of connection),all genes but the single candidate to be adjoined are already in thegrowing network; and (3) the seed subset for each growing step is thewhole network grown until the previous step (there was not aprogressive layer structure). Due to this last difference, there is not ameasure of reliability to be associated to edges in the output graph asdefined in our proposal because they lose track of the path length (innumber of growing steps) from the original seed set to the newly

12 The reader may refer to Appendix A for more detailed specifications ofindividual activities in our pattern to compare them with the ones in the worksdiscussed here.


adjoined gene. The reliability measure in our constructive method isan improvement since it consists of an extension of previousapproaches that can be used as additional information when evaluat-ing the quality of the obtained gene relationships.

The activity Network Definition was not needed because thenetwork construction procedure already chooses the most relevantconnection and adds it, one at a time, to the network. The activityGraphical Representation Generationwas carried out using Graphviz,13

publicly available open source graph visualization software. Finally,the activities Functional Gene Classification and Network FunctionalAnalysis were carried out manually by the biologist by defining or re-defining the seed gene subset for the first or for new runs of thepipeline until a satisfactory network is obtained, taking into accountprevious biological knowledge and analyzing the result of previouspipeline runs.

5.1.2. Barrera et al. [29]In this work, the input matrix M consisted of 48 time-course

microarray samples14 [54]. The elements in the normalized matrix MN

were calculated by MN ½i; j� ¼M½i; j��Ei=si; i¼ 1;…;n where theexpectation Ei and the standard deviation si were estimated by

E i ¼∑mj ¼ 1M½i; j�=m and si ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi∑m

j ¼ 1ðM½i; j��E iÞ2=m�1q

, respectively.

After normalizing the data, the expression data were quantized intothree levels, �1, 0 and +1. For each gene in the dataset: (1) itsexpression values were divided in two groups: negative and positivevalues; (ii) for each group, the mean value was calculated; and(iii) finally, values greater than the positive mean were mapped to+1, values smaller than the negative mean were mapped to �1, andother values were mapped to 0. The activities Functional GeneClassification and Network Functional Analysis were performed “manu-ally” by the biologist with the aim of defining or re-defining the seedsubset for new runs of the construction procedure or analyzing itsresults.

In the Construction Settings Definition activity, the seed genesubset S was specified manually by the biologist. Two otherparameters were specified also manually: (1) the cardinality ofthe predictor set (cardinalities 1 or 2 were used) and (2) thenumber p of predictor sets to be associated to each target. In theNetwork Construction activity, genes in the initial seed subset S areconsidered only as targets and the predictability measure used wasthe expectation of the mutual information [36]. Predictors wereconsidered in the sample previous in time to the target sample. Foreach gene in the seed subset, the p best predictor sets wereestimated (p¼5 was used in some of the experiments). Thesepredictor–target configurations together with their associatedpredictability measures formed the output network descriptiongraph (Tentative Network Description NT). This is a one step net-work construction procedure (it is no further iterated) so it doesnot have a multi-layer structure. For this reason, there is noreliability measure r associated to the graph edges and theNetwork Definition activity was not performed in this case. Forthe Graphical Representation Generation activity, the softwareGraphviz was used (as in the previous case [30]) to generate thenetwork graphical representation from the network descriptiongraph NT.

5.1.3. Ris et al. [31,32]In this work, the Microarray Data Normalization and Quantiza-

tion pre-processing activities were performed by the same

13 Graphviz – Graph Visualization Software (http://www.graphviz.org/).14 CAMDA 2004 Conference Contest Datasets (http://www.camda.duke.edu/

camda04/datasets/).

methods used in the previous case [29]. In this case the datasetconsisted of 16 time-course microarray samples [55].

The Functional Gene Classification activity was performed manuallyby accessing the Gene Ontology (GO) [39] and KEGG [40] databases. Inthe Construction Settings Definition activity, the selection of the seedgenes was performed manually by the biologist, with possible restric-tions from GO and KEGG functional information. The number k of timeinstants (samples) between the predictors and the target, representsthe time that a change in gene expression requires to affect anothergene expression, is adjusted manually to affect another gene expres-sion (k¼1 was used).

For this construction method, in the Network Construction activity,seed genes were considered only as predictors. The adopted con-nectivity or predictability measure was conditional entropy [36].Minimum conditional entropy represents maximum predictability ofthe target by the predictors. The following iterative procedure wasused to grow the network. For each gene G in the n genes dataset, thesubset (or subsets) of the seed gene set S that best predict gene G issearched in the whole seed subset S. This feature selection problem isequivalent to finding a subset (or subsets) with minimum conditionalentropy [56] and it was solved using the U-curve algorithm [57]. Thealgorithm finds the optimal number of predictors for best prediction.The n genes in the dataset are ranked by their conditional entropyvalues (best predicted are ranked first). A new set of seed genes arechosen from the top ranked genes with possible restrictions appliedfrom GO and KEGG functional information. This process can beiterated i times. Each new seed set at the end of an iterationrepresents a new gene layer adjoined to the growing network. Thismulti-layer structure allows the application of a reliability r identifierto each network connection (edge) and the output network graph NT

can meet exactly our pattern definition.The Network Definition activity was not performed separately

(restrictions were applied while growing each layer), but as thenetwork description graph (Tentative Network Description NT) canhave the same complete structure of our pattern, it could becarried out by keeping more information during network con-struction and performing a final depuration of the whole resultswith this activity. The software Graphviz was also used in this caseto generate the Network Graphical Representation from the DefinedNetwork Description graph (Graphical Representation Generationactivity) and the Network Functional Analysis was also performedmanually by the biologist.

5.2. Pattern robustness

Other works (like [13,16,34,33,58] for example) may be integratedinto our pattern by implementing their own Network Construction andcorresponding Construction Settings Definition activities according totheir method. They may use all, part or none of the defined pre-processing activities (Microarray Data Normalization and Quantization,Functional Gene Classification) and post-processing activities (NetworkDefinition, Graphical Representation Generation, Network FunctionalAnalysis) and/or provide their own implementation for some of them.If required by their approach, they can also define and implementsome additional activity (or a sequence of them) that outputs somedata of the types defined in our pattern, coupling the processingpipeline of Fig. 4 at some point of it.

GRN inference from microarray data is a complex task due to itscombinatorial nature and limitations on available data samplesimposed by technical, economical and biological factors [19]. Currenttendencies in the literature aim to overcome these difficulties improv-ing inference techniques [35], introducing prior information from theliterature [34] or constraining the network search space [33] andincreasing the available data by the integration of heterogeneousexperimental data [19]. In the following paragraphs, we discuss how

http://www.graphviz.org/

http://www.camda.duke.edu/camda04/datasets/

http://www.camda.duke.edu/camda04/datasets/


some examples from these approaches can be integrated into ourpattern specification.

The work by Djebbari and Quackenbush [34] proposes a way oftaking advantage of prior information about gene–gene interactionsto improve quality of a Bayesian network inference process. Theyconstruct seed networks derived from the literature or/and fromavailable protein–protein interaction data as a starting point for theinference of Bayesian networks from microarray data. Then, theyperform bootstrap averaging to find a high-confidence network anduse the KEGG database information to validate the inferred generelationships [40]. Prior knowledge is used in two ways in this work:to construct the initial seed network and to validate the results. Theirseed network construction from the literature or protein-interactiondata and KEGG validation could be integrated into our pattern bydefining and implementing the corresponding activities. The NetworkConstruction procedure could be run using one of those seednetworks as the initial layer (our Network Construction activity canbe easily adapted to start from one of those pre-built networks) andtheir KEGG validation procedure could be implemented as part of theNetwork Functional Analysis activity, or their Bayesian networkinference procedure could be used by implementing the ConstructionSettings Definition and Network Construction activities according totheir method. Also in this case, our preprocessing (normalization andquantization) and post processing (visualization) activities couldbe used.

In the work by Vahedi et al. [33], the authors propose a CoDinference algorithm that avoids the creation of spurious non-singleton attractors (more than one state attractors) when infer-ring GRN from time-independent data (in this case, the data states– samples – are assumed to represent stationary states of thesystem dynamics, i.e., singleton attractors). Without time-coursedata, the CoD cannot provide information on the direction ofprediction, detecting many bidirectional gene relationships incases when actually only one gene should be a predictor of theother gene. These spurious connections in the network regulatorygraph give rise to many spurious attractor cycles. Their algorithmconstrains the number of bidirectional gene relationships andinfers networks without singleton attractors still with connectivitybased on strong gene prediction. The algorithm also warrants thata minimum number of the data samples represent singletonattractors. This is an example of constraining the sub-classes ofnetworks to be inferred, avoiding, in this case, undesired effects ofthe inference method. Their work could be integrated into ourpattern by implementing the Construction Settings Definition andNetwork Construction activities with their algorithm. As they usebinary data, the Microarray Data Quantization activity should beimplemented for two levels discretization. If needed, they coulduse the Microarray Data Normalization, Graphical RepresentationGeneration and Network Functional Analysis activities.

The review by Anastassiou [35] discusses the new informationtheoretic measure of synergy applied to multiple interacting genes. Assynergy is a measure of the additional contribution of informationprovided by the “whole” compared with the sum of the contributionsof the “parts”, it is suited to detect cooperative interaction among thevariables. It permits the hierarchical decomposition of modules intosubmodules of highly interacting genes in a tree of synergy. It is a newand significant measure of dependencies with direct application to theimprovement of network inference methods [16,59]. In our pattern,synergy measures could then be part of the Network Constructionactivity. If synergy is computed with continuous values, avoiding datadiscretization to preserve full information – as in [16] –, theMicroarrayData Quantization activity may not be needed.

The review by Hecker et al. [19] discusses how the incorporation ofadditional information from heterogeneous data sources can supportthe network inference process. Many techniques have been proposedto identify GRNs from transcriptome data (mainly obtained by DNA

microarray experiments), and they have been widely used in the fieldof network inference, but they are inherently bounded by limitationsin the information content of such data. It is beneficial to integratesystem-wide genomic, transcriptomic, proteomic and metabolomicmeasurements as well as prior biological knowledge to improve thereliability and quality of the network structure and dynamics infer-ence. A template of the network can be built using various levels ofadditional information, e.g., from databases and the literature, andthen, an inference strategy can be applied to fit the model to the datawhile taking the template into account. In our pattern, additionalinformation from other sources, like protein–protein interactions orTF–DNA interactions, can be translated into a priori connections orsub-networks from which the network construction procedure couldstart, in a similar way as what was explained when discussing how tointegrate the type of seed networks obtained in [34]. If a differentnetwork construction procedure was used, additional informationfrom sources other than microarray experiments would be alsotranslated as an input to the corresponding implementation of theNetwork Construction activity.

6. Conclusion

This paper describes a pattern for gene network inference frommicroarray data and some prior biological knowledge. The pro-posed pattern consists of a number of interrelated activities whosecomposed behavior was specified using a standard modelingnotation, BPMN. For each identified activity we have defined itspurpose, inputs and outputs and sample solution. This abstractspecification can be used as basis for different concrete realiza-tions, thus allowing the execution of several known methods forgene network inference.

The proposed specification also supports the composition ofknown methods and the creation of new ones due to three basicmechanisms: (i) the combination of modules of several knownmethods; (ii) the implementation of new solutions for the activitiesdefined; (iii) the extension of the proposed pattern by the definition ofnew activities that are compatible to the pattern. In order to analyzethe potential of the proposed pattern to represent new methods, itwas designed taking in consideration just a few methods and, after,checked its adaptation capacity to represent new methods notconsidered in its design. These considerations were exposed in thelast section and revealed the strong adequacy of the pattern, since anumber of methods of modern research tendencies were consideredin this analysis.

We are currently working in a implementation of the proposedpattern specification using a general purpose integration and execu-tion environment. We are also working on the definition of a set ofrequirements for the development of a new integration environmentthat includes facilities for collaborative research as proposed in thiswork. We believe that an environment with such characteristics maybecome a standard for gene network inference research and aparadigm for Bioinformatics collaborative research, since patterns forother fields of bioinformatics research (e.g., protein structure, DNAsequences, metabolic pathways, etc.) could be developed and inte-grated with our gene network inference pattern and with otherbioinformatics patterns available.

7. Summary

Patterns have been widely used in Computer Science. A patterndescribes a generic solution to an existing problem in a more readableand accessible form. Patterns capture proven solutions to real pro-blems and generalize these solutions so they can be reused in similarcontexts. A pattern-oriented process specification consists of a generic


and abstract description of a process that can be used to solve aproblem in a given context. This paper presents a pattern-orientedspecification of a genetic regulatory network inference process per-formed from microarray data and prior biological knowledge. Theproposed process specification was conceived based on prior work ongene inference networks. This process is described in terms of anumber of interrelated activities. We have used the Business ProcessModeling Notation (BPMN), standardized by the Object ManagementGroup, as our process modeling notation.

The proposed specification consists of a number of activitiesneeded to perform gene regulatory network inference fromexpression data. First, microarray data is preprocessed into therequired format (activities Microarray Data Normalization andMicroarray Data Quantization). Previous biological information iscollected to help in the inference process (activity Functional GeneClassification). The set of parameters required by the inferenceprocedure is defined or adjusted from previous runs of theinference procedure (activity Network Construction Settings). Basedon those parameters, a tentative network, with uncertainty, isgrown from expression data (activity Network Construction). In thesequel, a network without uncertainty is defined choosing themost relevant and reliable gene dependencies inferred from thedata (activity Network Definition). A graphical representation iscreated to help in the analysis of the results (activity GraphicalRepresentation Generation). Known functions of the genes in thedefined network are searched (activity Network Functional Analy-sis). If the results are not satisfactory or potentially feasible ofbeing improved, construction settings can be adjusted conveni-ently to perform the inference process again. This procedure isiterated until a satisfactory network is obtained.

Each identified activity was then generically described in termsof its purpose, required input, produced output and sampleimplementation (behavior). Such an abstract specification can beused as basis for different implementation solutions and cancontribute to foster collaborative research on gene regulatorynetwork inference. The proposed specification also supports thecomposition of known methods and the creation of new ones dueto three basic mechanisms: the combination of modules of severalknown methods; the implementation of new solutions for theactivities defined; and the extension of the proposed pattern bythe definition of new activities that are compatible to the pattern.In order to analyze the potential of the proposed pattern torepresent new methods, it was designed taking in considerationjust a few methods and, after, checked its adaptation capacity torepresent new methods not considered in its design. This assess-ment revealed the strong adequacy of the pattern, since a numberof methods of modern research tendencies were considered in thisanalysis.

Conflict of interest statement

None declared.

Acknowledgments

The authors are grateful to FAPESP (99/12765-2, 01/09401-0, 04/03967-0 and 05/00587-5), CNPq (300722/98-2, 468413/00-6, 521097/01-0474596/04-4 and 491323/05-0) and CAPES (23038.042092/2008-39) for financial support. This work was partially supported by Grant1D43 TW07015-01 from the National Institutes of Health, USA.

Appendix A. Detailed activity description

A.1. Microarray data normalization

Levels at which genes are over, sub or not differentially expressedmay vary greatly from one gene to another. The Microarray DataNormalization activity aims at rendering gene expression valuescomparable (i.e., varying over the same range of values on those sameexpression conditions) in order to apply a uniform criterion to allgenes in the dataset when defining – later – their quantized discretelevels.

This activity has as input the Microarray Data: a n�m matrix M ofreal numbers representing relative gene expression values – as log2 ofthe signal/reference ratio – of n genes at m consecutive time instants,and produces as output Normalized Data: a n�m matrix MN of realnumbers with the normalized gene expression values.

There are different approaches for normalizing the data. Forexample, a convenient way is to normalize the microarray data inthe n�m matrix M leaving each gene signal with a normal distribu-tion having zero mean value and standard deviation equal to 1, so wecan easily compare their expression values.

A.2. Microarray data quantization

The Microarray Data Quantization activity aims at discretizing thereal valued expression data into a given range of integer numbers inorder to facilitate further processing.

This activity has as input the Normalized Data: a n�m matrix MN

of real numbers with the normalized gene expression values. Thisactivity produces as output the Quantized Data: a n�m matrix MQ ofinteger numbers – in a predefined range – with the normalized anddiscretized gene expression values.

In order to map the continuous real values from the normalizedmatrix MN into the discrete integer numbers of the quantizedmatrix MQ, adequate thresholds (following some plausible criter-ion) need to be determined in order to define the interval ofcontinuous values that will be mapped into each discrete value.

A.3. Functional gene classification

Well established that biological knowledge can be used as anadditional input to improve the quality of the estimation process. Inthis sense, the Functional Gene Classification activity aims at associatingknown biological functions (when available) to the genes in the time-course microarray data. Depending on the context or the phenomenonbeing studied, these functions may refer to different levels ofphysiologic activity (cell, tissue, organ, organism). Examples of biolo-gical functions include DNA repair, cell cycle control, glycolysis,synthesis of lipids, tumor suppression or any normal physiologicalactivity.

This activity has as input a list, GID, of n Gene Identifiers for the ngenes whose expression levels are measured by the time-coursemicroarray experiment. This activity produces as output a list, LFi, ofKnown Biological Functions assigned to each gene identifier gi in theinput list GID, ð1≤ i≤nÞ. Since some genes may not have any knownbiological function (in fact, many do not have), the assigned list LFj canbe empty for a particular gene identifier gj.

Many publicly available databases offer gene functionality informa-tion, including the Gene Ontology Consortium (GO) [39], which linksgenomic information to its biological process, and Kyoto Encyclopediaof Genes and Genomes (KEGG) [40], which links genomic informationto metabolic or regulatory pathways. Thus, this activity may be carriedout automatically by performing a search in such databases (or anyavailable knowledge base), or manually with the provision of parti-cular information by the biologist. However, the lack of completefunction name standardization between different databases poses a


problemwhen accessing several databases automatically and mergingtheir search results. As a result, some tools have been devised to dealwith this problem, e.g., SOURCE [60].

A.4. Construction settings definition

The Construction Settings Definition activity aims at defining oradjusting the set of parameters needed for the network constructionprocedure. This activity is typically carried out by the biologist withoutany special automated support in order to define or adjust theparameters required to run the network construction process in agiven desired manner.

This activity has as inputs: (1) a list, GID, of n Gene Identifiers of thegenes in the time-course microarray experiment; (2) lists, LFi, of KnownBiological Functions assigned to each gene gi in the microarray datasetð1≤ i≤nÞ; and (3) (optional) Genes Associated to Biological Functions:lists, LGj, of genes, from a previously grown network, associated to eachbiological function fj found in that network.

The current activity produces as outputs: (1) a subset, S, of SeedGenes (represented by their identifiers) from the n Gene Identifiers setGID (possibly adjusted or redefined from previous runs of the networkgrowing procedure); (2) Genes Related to the Phenomenon: an auxiliarysubset of genes, SF, known to be related to the phenomenon understudy; and (3) network Growth Bounding Conditions, including themaximum network size NM of the final grown network (in number ofgenes) and two layer growth thresholds: TPl , which represents thelayer-growth predictability threshold, and TPmin , which represents theminimum predictability threshold.

The Seed Genes subset S must be a meaningful subset of genes,known to be involved in the biological phenomenon under study,selected from the whole set of genes GID. They will be used as thestarting point fromwhich the gene regulatory network will be grown.Seed gene selection is usually carried out manually by the biologistbased on biological knowledge, available information or even thebiologist intuition. In such case, the biologist chooses an arbitrary setof genes (from our point of view). Some new functions, that appear tobe related to the phenomenon under study because they wererevealed by previous runs of the network construction procedure,may also be used to adapt or redefine the initial subset of seed genes.In this sense, one may use the lists LGj

of genes associated to eachbiological function fj, found in a previously grown network by theNetwork Functional Analysis activity. If many genes in that grownnetwork share the same new function, that may indicate that functionis related to the phenomenon from which the previous seed set wasdefined and then including other genes involved in that function inthe seed set may lead to the detection of new relevant networkconnections in the following run of the growing procedure. The initialseed gene selection can also be performed automatically, for example,by searching in a database for genes that share some predefinedfeature, such as being involved in a given biological function orpathway, or being regulated by some substance or stress signal. Inthat sense, the lists of known biological functions LFi, assignedindividually to genes in the microarray dataset, can be used to selectgenes known to share some common function, as an starting point toattempt identifying the genetic network that controls that function.

The auxiliary subset SF of Genes known to be Related to thePhenomenon under study can be defined by the biologist taking intoaccount primarily the lists LFi of known biological functions previouslyassigned individually to genes in the microarray dataset. However, listsLGj of genes associated to each biological function, produced by apreviously grown network at the Network Functional Analysis activity,can also be used, in addition to other biological criteria.

Themaximum network size NM allowed for the final grown networkis set according to practical considerations and considering that thereliability of the wiring connections to the newly included genes

decreases as the number of adjoined gene-layers increases, as we willexplain later.

The layer growing thresholds TPland TPmin are usually defined by the

biologist based on biological knowledge and experience. TPlis used to

indicate the minimal predictability level that a gene must have inorder to be automatically included in the next network layer (group ofgenes adjoined at each constructive step). TPmin is used to indicate theminimal predictability value that a wiring connection must have inorder to be considered able to predict of the expression value of itstarget gene. Wiring connections with a predictability value below TPmin

will be disregarded. However, genes ranked with predictability levelsbetween TPmin

and TPlwill be included in the next layer if they belong

to the auxiliary subset of genes SF, known to be related to thephenomenon under study. According to these considerations,TPl should always be greater than TPmin

.

A.5. Network construction

Detailed specifications for this activity were given in Section 4.3.

A.6. Network definition

The Network Definition activity aims at obtaining a networkdescription without uncertainty, i.e., without predictability orreliability information, as a tentative biological hypothesis pro-duced by the inference process. The produced description shouldrepresent a network consisting of the most relevant and reliabledependencies or predictive relations between genes obtained fromthe data with the aid of previous biological knowledge.

This activity has as inputs: (1) Tentative Network Description: atentative network graph, NT, produced by the network constructionactivity; (2) an optional auxiliary subset, SF, of Genes known to beRelated to the Phenomenon under study, and; (3) Known BiologicalFunctions: optional lists LFi of functions associated individually to genesin the time-course microarray experiment. This activity produces asoutput the Defined Network Description: a defined network graph, ND,containing nodes associated to gene identifiers (from GID) and directedand unweighted edges (without uncertainty).

Based on the Tentative Network Description NT, we need to decidewhich are the most relevant and reliable connections in order tochoose from all candidate networks that fit into the constructed graph,a final defined network ND that best exploits the information extractedfrom the data. There are a number of issues that should be taken intoaccount in this process. The tentative network description may includea number of predictive relations with relatively low predictability orreliability values. These relations should be included in the finalnetwork description only if they are backed up by well documentedbiological knowledge. On the other hand, the Tentative NetworkDescription may include a number of new (previously unknown)predictive relations with strong support from the data (high predict-ability and reliability values). In this case, such relations representvaluable biological hypotheses and should be included in the finalnetwork description as well.

This activity is usually carried out manually by the biologisttaking into account previous biological knowledge and the tenta-tive network s predictability and reliability information. However,it can also be carried out automatically. In general, we first need toinclude into the final network description connections with highpredictability and reliability values. Then, we need to includethose connections with lower predictability and reliability values,provided these connections involves genes known to be related tothe phenomenon under study, i.e., genes included in the auxiliarysubset SF, or genes that, according to their associated lists LFi, havebiological functions related to the network under study. Predict-ability and reliability thresholds should be used to decide whenthe connection should be included in the final network or not.


A.7. Graphical representation generation

The Graphical Representation Generation activity aims at creating anetwork graphical representation suitable for visualization by abiologist. This activity has as input a network graph description, eitherthe Tentative Network Description NT or Defined Network Description ND,and produces as output a Network Graphical Representation, NG, in astandard binary image format, such as JPEG, PNG, GIF, etc. This activityis carried out automatically using, for example, a graph visualizationsoftware, which takes as input a graph description and creates acorresponding graphical representation suitable for visualization usingopen or proprietary software solutions.

A.8. Network functional analysis

The Network Functional Analysis activity aims at identifyinggene functionalities in the grown network that can be related tothe functionalities under study. Genes added to the network by thegrowing process hold strong regulation dependencies with theseed genes. Therefore, if many of the genes in the grown networkshare a function different from the ones that characterize the seed,it would be a plausible hypothesis to suppose that the new and theknown functions are somehow related.

This activity has as inputs: (1) the Defined Network Descriptiongraph ND; (2) Known Biological Functions: lists LFi of knownbiological functions assigned individually to genes in the micro-array dataset; and optionally (3) the Network Graphical Represen-tation NG, as a visual aid for the biologist. This activity produces asoutput Genes Associated to Biological Functions: i.e., a list LGj ofgenes (in ND) associated to each biological function fj found in thegrown network.

This activity can be carried out either manually or automati-cally. First, we need to associate to each known biological functionfound in the network, the list of genes in the defined networkgraph ND that is related to that function. For this purpose, we canuse the lists LFi of known biological functions and/or other knowl-edge sources. In the sequel, we need to rank these functionsaccording to the cardinality of their associated gene lists.

The ranking of these biological functions may reveal thesignificant presence in the network of some functions not knownto be related to the phenomenon being studied. This may lead tothe formulation of new biological hypotheses. These hypothesesmay serve to redefine the construction settings for new runs of thegrowing procedure, and, if persisting, they will require furtherexperimental biological validation in order to become new docu-mented biological knowledge.

References

[1] T. Schlitt, A. Brazma, Current approaches to gene regulatory network model-ling, BMC Bioinformatics 8 (2007) S9+.

[2] E.R. Dougherty, T. Akutsu, P.D. Cristea, A.H. Tewfik, Genetic regulatorynetworks, EURASIP J. Bioinformatics Syst. Biol. 2007 (2007) 2, Article ID 17321.

[3] A. Datta, R. Pal, A. Choudhary, E.R. Dougherty, Control approaches for probabilisticgene regulatory networks, IEEE Signal Process. Mag. 24 (2007) 54–63.

[4] F. Barillet, Genetic improvement for dairy production in sheep and goats,Small Rumin. Res. 70 (2007) 60–75.

[5] J. Altarriba, G. Yague, C. Moreno, L. Varona, Exploring the possibilities ofgenetic improvement from traceability data: An example in the pirenaica beefcattle, Livest. Sci. 125 (2009) 115–120.

[6] T. Yamada, G. Spangenberg, J.H. Bouton, Molecular breeding to improveforages for use in animal and biofuel production systems, in: MolecularBreeding of Forage and Turf, Springer, New York, 2009, pp. 1–14.

[7] N. Schauer, Y. Semel, U. Roessner, A. Gur, I. Balbo, F. Carrari, T. Pleban, A. Perez-Melis, C. Bruedigam, J. Kopka, L. Willmitzer, D. Zamir, A.R. Fernie, Compre-hensive metabolic profiling and phenotyping of interspecific introgressionlines for tomato improvement, Nat. Biotechnol. 24 (2006) 447–454.

[8] L. Chen, Z.-X. Zhou, Y.-J. Yang, Genetic improvement and breeding of tea plant(Camellia sinensis) in china: from individual selection to hybridization andmolecular breeding, Euphytica 154 (2007) 239–248.

[9] S. Liang, S. Fuhrman, R. Somogyi, Reveal, a general reverse engineeringalgorithm for inference of genetic network architectures, Pacific Symposiumon Biocomputing, 1998, pp. 18–29.

[10] I. Shmulevich, E.R. Dougherty, S. Kim, W. Zhang, Probabilistic boolean net-works: a rule-based uncertainty model for gene regulatory networks, Bioin-formatics 18 (2002) 261–274.

[11] L.A. Soinov, M.A. Krestyaninova, A. Brazma, Towards reconstruction of genenetworks from expression data by supervised learning, Genome Biology 4(2003) R6.

[12] X. Zhou, X. Wang, R. Pal, I. Ivanov, M. Bittner, E.R. Dougherty, A Bayesianconnectivity-based approach to constructing probabilistic gene regulatorynetworks, Bioinformatics 20 (2004) 2918–2927.

[13] A.A. Margolin, I. Nemenman, K. Basso, C. Wiggins, G. Stolovitzky, R. DallaFavera, A. Califano, Aracne: an algorithm for the reconstruction of generegulatory networks in a mammalian cellular context, BMC Bioinformatics 7(Suppl. 1) (2006) S7.

[14] P.E. Meyer, K. Kontos, F. Lafitte, G. Bontempi, Information-theoretic inferenceof large transcriptional regulatory networks, EURASIP J. Bioinformatics Syst.Biol. 2007 (2007) 9, Article ID 79879.

[15] W. Zhao, E. Serpedin, E.R. Dougherty, Inferring connectivity of geneticregulatory networks using information-theoretic criteria, IEEE/ACM Trans.Comput. Biol. Bioinformatics 5 (2008) 262–274.

[16] J. Watkinson, K.C. Liang, X. Wang, T. Zheng, D. Anastassiou, Inference ofregulatory gene interactions from expression data using three-way mutualinformation, Ann. N. Y. Acad. Sci. 1158 (2009) 302–313.

[17] H. de Jong, Modeling and simulation of genetic regulatory systems: literaturereview, J. Comput. Biol. 9 (2002) 67–103.

[18] N. Friedman, Inferring cellular networks using probabilistic graphical models,Sci. 303 (2004) 799–805.

[19] M. Hecker, S. Lambeck, S. Toepfer, E. van Someren, R. Guthke, Gene regulatorynetwork inference: data integration in dynamic models – a review, Biol. Syst.96 (2009) 86–103.

[20] H. Hache, H. Lehrach, R. Herwig, Reverse engineering of gene regulatorynetworks: a comparative study, EURASIP J. Bioinformatics Syst. Biol. 2009(2009) 12, Article ID 617281.

[21] S. Kumar, J. Dudley, Bioinformatics software for biologists in the genomics era,Bioinformatics 23 (2007) 1713–1717.

[22] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver,K. Glover, M.R. Pocock, A. Wipat, P. Li, Taverna: a tool for the composition andenactment of bioinformatics workflows, Bioinformatics 20 (2004) 3045–3054.

[23] D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M.R. Pocock, P. Li, T. Oinn,Taverna: a tool for building and running workflows of services, Nucleic AcidsResearch 34 (2006) W729–W732.

[24] T. Margaria, C. Kubczak, B. Steffen, Bio-jeti: a service integration, design, andprovisioning platform for orchestrated bioinformatics processes, BMC Bioinfor-matics 9 (2008) S12.

[25] A.L. Lamprecht, T. Margaria, B. Steffen, Bio-jeti: a framework for semantics-based service composition, BMC Bioinformatics 10 (2009) S8.

[26] J. Ríos, J. Karlsson, O. Trelles, Magallanes: a web services discovery andautomatic workflow composition tool, BMC Bioinformatics 10 (2009), Articlenumber: 334.

[27] K. Konstantinides, J.R. Rasure, The Khoros software development environmentfor image and signal processing, IEEE Trans. Image Process. 3 (1994) 243–252.

[28] M. Young, D. Argiro, S. Kubica, Cantata: visual programming environment forthe khoros system, SIGGRAPH Comput. Graph. 29 (1995) 22–24.

[29] J. Barrera, R.M. Cesar Jr., D.C. Martins Jr., R.Z.N. Vencio, E.F. Merino,M.M. Yamamoto, F.G. Leonardi, C.A.B. Pereira, H.A. del Portillo, Constructingprobabilistic genetic networks of plasmodium falciparum from dynamicalexpression signals of the intraerythrocytic development cycle, in:P. McConnell, S.M. Lin, P. Hurban (Eds.), Methods of Microarray Data AnalysisV, 1st edition, vol. 1, Springer, New York, 2006, pp. 11–26.

[30] R.F. Hashimoto, S. Kim, I. Shmulevich, W. Zhang, M.L. Bittner, E.R. Dougherty,Growing genetic regulatory networks from seed genes, Bioinformatics 20 (2004)1241–1247.

[31] M. Ris, Representacão de sistemas biologicos a partir de sistemas dinamicos:Controle da transcricão a partir do estrogeno (Biological Systems Representa-tion by Dynamical Systems: Transcription Control from Estrogen), Ph.D.Thesis, Universidade de São Paulo, 2008.

[32] L.A. Lima, M. Ris, J. Barrera, M.M. Brentani, H. Brentani, Computing a predictorset influence zone through a multi-layer genetic network to explore the roleof estrogen in breast cancer, Adv. Breast Cancer Res. 1 (2012) 21–29.

[33] G. Vahedi, I. Ivanov, E. Dougherty, Inference of boolean networks underconstraint on bidirectional gene relationships, IET Syst. Biol. 3 (2009) 191–202.

[34] A. Djebbari, J. Quackenbush, Seeded Bayesian networks: constructing geneticnetworks from microarray data, BMC Syst. Biol. 2 (2008) 13.

[35] D. Anastassiou, Computational analysis of the synergy among multipleinteracting genes, Mol. Syst. Biol. 3 (2007) 83.

[36] T.M. Cover, J.A. Thomas, Elements of Information Theory, 2nd edition, Wiley-Interscience, 2006.

[37] E.R. Dougherty, S. Kim, Y. Chen, Coefficient of determination in nonlinearsignal processing, Signal Process. 80 (2000) 2219–2235.

[38] S. Kim, E.R. Dougherty, I. Shmulevich, K.R. Hess, S.R. Hamilton, J.M. Trent, G.N. Fuller, W. Zhang, Identification of combination gene sets for gliomaclassification, Mol. Cancer Ther. 1 (2002) 1229–1236.

[39] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis,K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver,

http://refhub.elsevier.com/S0010-4825(13)00182-0/sbref1











http://refhub.elsevier.com/S0010-4825(13)00182-0/othref0005

































































































A. Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin,G. Sherlock, Gene ontology: tool for the unification of biology. the geneontology consortium, Nat. Genet. 25 (2000) 25–29.

[40] H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, M. Kanehisa, Kegg: Kyotoencyclopedia of genes and genomes, Nucl. Acids Res. 27 (1999) 29–34.

[41] G. Joshi-Tope, M. Gillespie, I. Vastrik, P. D´ Eustachio, E. Schmidt, B. de Bono,B. Jassal, G. Gopinath, G. Wu, L. Matthews, S. Lewis, E. Birney, L. Stein,Reactome: a knowledge-base of biological pathways, Nucl. Acids Res. 33(2005) D428–D432.

[42] H.v. Vliet, Software Engineering – Principles and Practice, Wiley, New York, 2000.[43] C. Alexander, The Timeless Way of Building, Oxford University Press, 1979.[44] C. Alexander, S. Ishikawa, M. Silverstein, A Pattern Language: Towns, Build-

ings, Construction, Oxford University Press, 1977.[45] C. Alexander, M. Silverstein, S. Angel, S. Ishikawa, D. Abrams, The Oregon

Experiment, Oxford University Press, 1975.[46] E. Gamma, R. Helm, R. Johnson, J. Vlissides, Design Patterns: Elements of

Reusable Object-Oriented Software, Addison-Wesley Professional, 1995.[47] F. Buschmann, R. Meunier, H. Rohnert, P. Sommerlad, M. Stal, Pattern-Oriented

Software Architecture Volume 1: A System of Patterns, Wiley, New York, 1996.[48] D. Schmidt, M. Stal, H. Rohnert, F. Buschmann, Pattern-Oriented Software

Architecture Volume 2: Patterns for Concurrent and Networked Objects,Wiley, 2000.

[49] F. Marinescu, EJB Design Patterns: Advanced Patterns, Processes, and Idioms,Wiley, 2002.

[50] Object Management Group, Business Process Modeling Notation, V1.2,Needham, 2009. OMG Available Specification.

[51] S. Kim, H. Li, E.R. Dougherty, Can Markov chain models mimic biologicalregulation? J. Biol. Syst. 10 (2002) 337–357.

[52] A. Datta, A. Choudhary, M.L. Bittner, E.R. Dougherty, External control inMarkovian genetic regulatory networks: the imperfect information case,Bioinformatics 20 (2004) 924–930.

[53] C.R.G. de Farias, Architectural Design of Groupware Systems: a Component-Based Approach, Ph.D. Thesis, University of Twente, 2002.

[54] Z. Bozdech, M. Llinás, B.L. Pulliam, E.D. Wong, J. Zhu, J.L. DeRisi, Thetranscriptome of the intraerythrocytic developmental cycle of plasmodiumfalciparum, PLoS Biol. 1 (2003), Article number: E5.

[55] C.Y. Lin, A. Ström, V.B. Vega, S.L. Kong, A.L. Yeo, J.S. Thomsen, W.C. Chan,B. Doray, D.K. Bangarusamy, A. Ramasamy, L.A. Vergara, S. Tang, A. Chong, V.B. Bajic, L.D. Miller, J.A. Gustafsson, E.T. Liu, Discovery of estrogen receptoralpha target genes and response elements in breast tumor cells, Genome Biol.5 (2004), Article number: R66.

[56] D.C. Martins, R.M. Cesar, J. Barrera, W-operator window design by minimiza-tion of mean conditional entropy, Pattern Anal. Appl. 9 (2006) 139–153.

[57] M. Ris, J. Barrera, D.C. Martins Jr., U-curve: a branch-and-bound optimizationalgorithm for u-shaped cost functions on boolean lattices applied to thefeature selection problem, Pattern Recognition (2009).

[58] K.-C. Liang, X. Wang, Gene regulatory network reconstruction using condi-tional mutual information, EURASIP J. Bioinformatics Syst. Biol. 2008 (2008)14, Article ID 253894.

[59] J. Watkinson, X. Wang, T. Zheng, D. Anastassiou, Identification of geneinteractions associated with disease from gene expression data using synergynetworks, BMC Syst. Biol. 2 (2008), Article number: 10.

[60] M. Diehn, G. Sherlock, G. Binkley, H. Jin, J.C. Matese, T. Hernandez-BoussardC.A. Rees, M.J. Cherry, D. Botstein, P.O. Brown, A.A. Alizadeh, Source: a unifiedgenomic resource of functional annotations, ontologies, and gene expressiondata, Nucl. Acids Res. 31 (2003) 219–223.

























































Date post:	02-Jan-2017
Category:	Documents
Upload:	junior
View:	214 times
Download:	2 times

A pattern-oriented specification of gene network inference processes

Documents