+ All Categories
Home > Documents > [Methods in Molecular Biology] Computational Systems Biology Volume 541 || Biological Network...

[Methods in Molecular Biology] Computational Systems Biology Volume 541 || Biological Network...

Date post: 01-Oct-2016
Category:
Upload: jason
View: 213 times
Download: 0 times
Share this document with a friend
26
Chapter 24 Biological Network Inference and Analysis Using SEBINI and CABIN Ronald Taylor and Mudita Singhal Abstract Attaining a detailed understanding of the various biological networks in an organism lies at the core of the emerging discipline of systems biology. A precise description of the relationships formed between genes, mRNA molecules, and proteins is a necessary step toward a complete description of the dynamic behavior of an organism at the cellular level, and toward intelligent, efficient, and directed modification of an organism. The importance of understanding such regulatory, signaling, and interaction networks has fueled the development of numerous in silico inference algorithms, as well as new experimental techniques and a growing collection of public databases. The Software Environ- ment for BIological Network Inference (SEBINI) has been created to provide an interactive environ- ment for the deployment, evaluation, and improvement of algorithms used to reconstruct the structure of biological regulatory and interaction networks. SEBINI can be used to analyze high- throughput gene expression, protein abundance, or protein activation data via a suite of state-of-the- art network inference algorithms. It also allows algorithm developers to compare and train network inference methods on artificial networks and simulated gene expression perturbation data. SEBINI can therefore be used by software developers wishing to evaluate, refine, or combine inference techniques, as well as by bioinformaticians analyzing experimental data. Networks inferred from the SEBINI software platform can be further analyzed using the Collective Analysis of Biological Interaction Networks (CABIN) tool, which is an exploratory data analysis software that enables integration and analysis of protein–protein interaction and gene-to-gene regulatory evidence obtained from multiple sources. The collection of edges in a public database, along with the confidence held in each edge (if available), can be fed into CABIN as one ‘‘evidence network,’’ using the Cytoscape SIF file format. Using CABIN, one may increase the confidence in individual edges in a network inferred by an algorithm in SEBINI, as well as extend such a network by combining it with species-specific or generic information, e.g., known protein–protein interactions or target genes identified for known transcription factors. Thus, the combined SEBINI–CABIN toolkit aids in the more accurate recon- struction of biological networks, with less effort, in less time. A demonstration web site for SEBINI can be accessed from https://www.emsl.pnl.gov/SEBINI/ RootServlet. Source code and PostgreSQL database schema are available under open source license. Contact: [email protected]. For commercial use, some algorithms included in SEBINI require licen- sing from the original developers. CABIN can be downloaded from http://www.sysbio.org/datare- sources/cabin.stm. Contact: [email protected]. Jason McDermott et al. (eds.), Computational Systems Biology, vol. 541 ª Humana Press, a part of Springer Science+Business Media, LLC 2009 DOI 10.1007/978-1-59745-243-4_24 551
Transcript

Chapter 24

Biological Network Inference and Analysis Using SEBINIand CABIN

Ronald Taylor and Mudita Singhal

Abstract

Attaining a detailed understanding of the various biological networks in an organism lies at the core ofthe emerging discipline of systems biology. A precise description of the relationships formed betweengenes, mRNA molecules, and proteins is a necessary step toward a complete description of thedynamic behavior of an organism at the cellular level, and toward intelligent, efficient, and directedmodification of an organism. The importance of understanding such regulatory, signaling, andinteraction networks has fueled the development of numerous in silico inference algorithms, as wellas new experimental techniques and a growing collection of public databases. The Software Environ-ment for BIological Network Inference (SEBINI) has been created to provide an interactive environ-ment for the deployment, evaluation, and improvement of algorithms used to reconstruct thestructure of biological regulatory and interaction networks. SEBINI can be used to analyze high-throughput gene expression, protein abundance, or protein activation data via a suite of state-of-the-art network inference algorithms. It also allows algorithm developers to compare and train networkinference methods on artificial networks and simulated gene expression perturbation data. SEBINI cantherefore be used by software developers wishing to evaluate, refine, or combine inference techniques,as well as by bioinformaticians analyzing experimental data. Networks inferred from the SEBINIsoftware platform can be further analyzed using the Collective Analysis of Biological InteractionNetworks (CABIN) tool, which is an exploratory data analysis software that enables integration andanalysis of protein–protein interaction and gene-to-gene regulatory evidence obtained from multiplesources. The collection of edges in a public database, along with the confidence held in each edge (ifavailable), can be fed into CABIN as one ‘‘evidence network,’’ using the Cytoscape SIF file format.Using CABIN, one may increase the confidence in individual edges in a network inferred by analgorithm in SEBINI, as well as extend such a network by combining it with species-specific orgeneric information, e.g., known protein–protein interactions or target genes identified for knowntranscription factors. Thus, the combined SEBINI–CABIN toolkit aids in the more accurate recon-struction of biological networks, with less effort, in less time.

A demonstration web site for SEBINI can be accessed from https://www.emsl.pnl.gov/SEBINI/RootServlet. Source code and PostgreSQL database schema are available under open source license.Contact: [email protected]. For commercial use, some algorithms included in SEBINI require licen-sing from the original developers. CABIN can be downloaded from http://www.sysbio.org/datare-sources/cabin.stm. Contact: [email protected].

Jason McDermott et al. (eds.), Computational Systems Biology, vol. 541ª Humana Press, a part of Springer Science+Business Media, LLC 2009DOI 10.1007/978-1-59745-243-4_24

551

Key words: Network inference, transcriptional regulatory networks, signal transduction networks,protein–protein interaction networks, and exploratory data analysis.

1. Introduction

Reconstruction of regulatory and signaling networks is a criticaltask in systems biology (1–9). The ‘‘Software Environment forBIological Network Inference’’ (SEBINI) (10) and the ‘‘Collec-tive Analysis of Biological Networks’’ (CABIN) (11) analysis fra-meworks, developed at the U.S. Department of Energy’s PacificNorthwest National Laboratory, aid in the reconstruction of thestructure of such mRNA and protein networks.

High-throughput molecular biology experiments are nowproducing mRNA expression data in quantities large enough forresearchers to attempt to reconstruct the structure of gene tran-scription networks based primarily on state correlation measure-ments (12–58). In its simplest form, such inference from statecorrelation could run as follows: if the expression level (state) ofgene A is always high across a (large) set of experiments when themRNA expression level of gene B is low, and low when the level ofgene B is high, then we could conclude that there is strongevidence that gene A directly regulates gene B, i.e., there is aregulatory edge directed from A to B, and that gene A repressesgene B. Of course, such inference is rarely so simple (59). In theabsence of other information, we could conclude with equal prob-ability that gene B represses gene A. Determination of the causaldirection is an important subtask in the determination of regula-tory network structure. With the possibility of multiple regulatorsaffecting a target gene and each source gene affecting a largenumber of targets, as well as imperfect experimental data, compu-tational scientists, drawing on knowledge from several fields(information theory – e.g., mutual information (60, 61), classicalstatistics – e.g., Pearson product-moment correlation coefficient(62), probabilistic graphical models – e.g., Bayesian network struc-ture learning (63–68), data mining – e.g., association rule mining(69)) have built very sophisticated inference algorithms to detectregulatory edges from even partial correlations in state.

Large-scale protein activation and protein abundance mea-surements are soon to follow in the steps of mRNA microarrayexperiments, allowing similar inference based on state correlationfor protein signaling networks and regulatory networks.. Further,metabolite networks can be inferred by algorithms that usecorrelations in large-scale measurements of metabolite levels(D. Wishart, personal communication, 2006) Thus, while recon-struction of metabolic pathways and networks will not be

552 Taylor and Singhal

discussed in this chapter, the algorithms in SEBINI may have valuein that research area as well. Protein–protein interaction networksare also starting to be inferred from high-throughput data, such asfrom sets of mass spectrometry bait-prey experiments (70). Infer-ence of protein interactions from bait-prey data sets is not based onstate correlation in the same sense that correlation is used in theanalysis of microarray experiments to infer transcriptional regula-tory edges. But evidence from a set of mass spectrometry experi-ments can be tied to the set of proteins, uploaded into the SEBINIplatform, and passed to an algorithm, the Bayesian Estimator ofProtein–Protein Association Probabilities (BEPro) (71–73),which determines the degree of association between the bait andprey proteins based on a Bayesian analysis across the entire set ofexperiments, and thus determines which baits and prey truly inter-act, eliminating false positives. In this manner we may construct aset of interaction edges that can be stored as a protein–proteininteraction network which fits comfortably within the SEBINIframework.

In a graphical representation of a transcriptional regulatorynetwork, the genes are the nodes and the edges between the nodesare directed. A set of microarray runs measuring mRNA expressionis used as the input data set to the inference algorithm that infersthe set of regulatory edges. A protein–protein interaction (PPI)network has proteins as nodes with undirected edges (showinginteraction, but no cause-and-effect). For example, a set of bait-prey experiments, as mentioned above, could be used as input toan appropriate analysis algorithm. Signal transduction networksalso have proteins as nodes, but the edges are directed, to showcausal influence on target protein activation state. The input to analgorithm inferring a signal transduction network would be a set ofexperiments measuring protein activation and inactivation (e.g.,phosphorylation / dephosphorylation) across a set of proteinsbeing scanned for involvement in the signaling network.

SEBINI provides an open source software platform that allowsthe use of many inference algorithms on many types of data sets.Experimental data in several formats can be uploaded into SEBINIfor analysis by the algorithms incorporated into SEBINI’s toolkit.Also, artificial data sets of different types can be created dynami-cally within SEBINI and used to test a growing collection ofinference algorithms. Thus, the SEBINI interactive environmenthelps researchers (1) easily apply state-of-the-art algorithms toinfer a network from experimentally generated high-throughputdata and (2) evaluate and refine new algorithms for the inference ofbiological regulatory and signaling network structure using com-mon data sets.

SEBINI is a framework whose database can store networks ofmany types, and any algorithm that works on pieces of evidenceattached to nodes in a potential network, or on evidence attached

Biological Network Inference and Analysis with SEBINI and CABIN 553

to the set of nodes as a whole, can be fitted into the SEBINIplatform. The majority of algorithms in SEBINI are directedtoward searching for networks of causal influence, where thestate of one node affects the state of another node; the best-known example of such being transcriptional regulatory networksderived from microarray experiments that measure mRNA levels.However, the BEPro algorithm described above has also beenadded into SEBINI to look at evidence from protein bait-preyexperiments. Such evidence allows BEPro to infer interactions(but not regulatory influence) between the bait and the preyproteins, and thereby derive from the experimental results a singleprotein–protein interaction network. In summary, the SEBINIplatform is useful not only in inferring the topology of any networkwhere the change in state of one node can affect the state of othernodes (regulatory networks) but also, more generally, in inferringthe topology of any (interaction) network where the evidence forthe existence of edges can be uploaded and stored in SEBINI’scentral database, where the uploaded data are tied to a networkobject for later use by an appropriate algorithm.

Focusing on the analysis of microarray experiments, we cannote that clustering or some form of statistical classification hastypically been employed to organize the high-throughput mRNAexpression values derived from microarray experiments (38, 74–78).The question then arises: how can the clustering or classificationresults be connected to the underlying biology? Such results can beuseful for pattern classification, for example, to classify subtypes ofcancer or to predict differential responses to a drug (pharmacoge-nomics). But to understand the relationships between the genes,i.e., to more precisely define the influence of each gene on theothers, the scientist usually attempts to reconstruct the transcrip-tional regulatory network. This can be done by using backgroundliterature or information in public databases, combined with theclustering results. It can also be done by the application of analgorithm, often based on a probabilistic graphical model or onan information-theoretic metric, to try to infer the regulatory net-work from the raw, high-throughput data. This is where the algo-rithms in SEBINI come into play.

What does the graph of such an inferred regulatory networkgive you? The structure of a transcriptional regulatory network canbe described as a ‘‘wiring diagram,’’ a directed graph whose edgesshow regulatory influences. Such a diagram describes all the directand indirect influences on the expression of a gene, and showswhat (the product of) a gene can affect (5, 79). Ideally, one wouldlike to have not just such a diagram but also the set of equationsgoverning the behavior (state) of all the nodes of such a networkover time. This would allow predictions to be made of the precisetemporal behavior of all the (dependent) variables in the system.However, the diagram itself is an extremely important starting

554 Taylor and Singhal

point for network analysis. It constrains the possibilities and showswhat can affect different aspects of the system. It is the blueprint –the starting point for later investigations for dynamic behavior.Also, the identification of the regulatory connections, or edges, ina network in and of itself answers important questions, and pro-vides a guide to several areas of research. Some examples fortranscriptional regulatory networks follow.

1.1. Gene Function

Identification

Suppose gene A is known to be involved in a signal transduction ormetabolic pathway and gene B, through the network regulatoryedge diagram, is seen to directly influence gene A. Then theresearcher can assign gene B a possible role – some involvementin that pathway. Also, gene B becomes a candidate for targetedexperiments to study the pathway. (Conversely, if gene A is seen tobe the direct source of a regulatory influence on a previouslyunsuspected gene C, then gene C also becomes a possibility forinclusion into the pathway.)

1.2. Identification of

Upstream and

Downstream Genes

The network diagram shows what genes lie upstream and down-stream of each other, in terms of any possible regulatory effect. Forany gene represented in the network, all possible sources of reg-ulatory influence on that gene from other genes in the network,both direct and indirect, are explicit in the diagram. Conversely,the direct or indirect affect of a given gene can be found by tracingall possible downstream paths from the gene. Hence, when thetranscriptional level of a gene is experimentally altered, theresearcher will know how far the effect of such a change maypropagate through the biological network under study, andwhere the effect may be strongest or weakest, based on the lengthof the path from the altered gene to the downstream gene and onthe number of other direct and indirect influences on thedownstream gene.

1.3. Target

Identification

A researcher might want to modulate a cellular subsystem, such asa signaling transduction pathway, with minimal effect on the othersubsystems in the cell. Knowledge of global transcriptional regu-latory network provides guidance to the researcher in the selectionof genes whose alteration of expression will have the least influ-ence, direct or indirect, on genes lying outside of the pathway.

1.4. Elimination of

Irrelevant Genes:

Pruning the Putative

Network

If there is prior belief that a subset of genes is possibly involvedwith each other, the availability of the network diagram will allowsome genes to be eliminated from consideration. If a gene Gcannot directly influence at least one other gene in the set andgene G is not a target of any regulatory connection from any othergene in the set, then gene G can be dropped from membership inthe proposed subsystem.

Biological Network Inference and Analysis with SEBINI and CABIN 555

1.5. Speed of Response The network diagram allows calculation of the average path lengthbetween nodes in the full network or in any subgraph. The lengthmay give some indication of the relative speed of a subnetwork inits reaction to a change in the cellular environment.

1.6. Identification of

Control Genes

The network diagram allows answering questions such as: Arethere master control genes in the network? Are there genes thatserve as dominant sources of regulatory connections and, thus, actas ‘‘hubs’’ for regulating many other genes? If so, research can befocused on those genes whose state will directly control or affectthe state of much of the network.

The inferred networks obtained from SEBINI can be validated inthe CABIN software package (11), that is, interactions or regula-tory connections can be verified by combining evidence from publicdatabases such as the Database of Interacting Proteins (DIP)(80–82) and the Biomolecular Interaction Database (BIND) (83);and from computational methods such as phylogenetic profiling,Rosetta Stone, gene neighborhood, homology information (55,84). Such verification, of course, increases our confidence in agiven edge. Networks can also be annotated or extended (edgesadded) within CABIN. CABIN has been developed as a plug-in toCytoscape (85), which is an open source network visualization andanalysis tool. CABIN is invoked from SEBINI from a button on aweb page that launches Cytoscape via Java Web Start, with theappropriate inferred network from SEBINI’s database automati-cally passed in for visualization and further analysis within Cytos-cape and CABIN. Once the Cytoscape window comes up, CABINcan then be selected from the plug-in menu.

CABIN facilitates integrating the evidence of interaction datafrom multiple sources by the use of interactive visual interfaces.Multiple coordinated views within CABIN foster exploratory dataanalysis, allowing weighting and filtering of data sources (via slidercontrols and other easy-to-use controls) to create a final combinednetwork. Use of CABIN permits high-quality human judgment onthe integration of complex data sets having different levels ofcertainty, with limited investment of the user’s time.

2. SoftwareArchitecture

SEBINI uses a standard three-tier architecture: (1) a web-basedclient user interface, (2) an application logic middle tier consistingof a suite of Java servlets, and (3) a relational database storing thedata required by the middle tier. Inferred networks, as well as theraw data, processed data (processing may mean binning of

556 Taylor and Singhal

microarray data, or peptide-to-protein collapse for mass spec bait-prey data), and algorithm parameter selections used to generatethe networks are permanently stored in the database for visualiza-tion, topological and statistical analysis, and for later export in ahuman-readable or program-specific format. Inference algorithmsand discretization (binning) or other data-processing algorithmscan be any sort of executable program; a Java handler class is addedfor each new algorithm to handle communication between theinvocation web page, the database, and the algorithm. Security isimplemented on a project basis, with one owner and possiblymultiple users per project. Upon password-protected login tothe SEBINI web site, the user is assigned a 32 digit hex digitJSessionID, which is checked before display of every web page.

Major design issues included (1) the interface for user naviga-tion among possibly huge data sets, allowing easy drill down froma network set to a specific network to a specific node or edge; and(2) producing an efficient, understandable mapping from theinferred networks and inferred edges back to the correspondingoriginal expression or abundance data. Note that we have one-to-many relationships from a raw uploaded data set to a processeddata set, as well as a one-to-many relationship between a processeddata set and the inferred network and inferred edges created by theselected inference algorithm, operating on the selected processeddata set. Records for each of these data types are permanentlystored and connected to the appropriate records of the otherdata types. Note that the processed data sets are permanentlystored, in addition to the raw data sets and the inferred networks.This is important for efficiency (reuse of processed data in anotherinference run), transparency, and verification of results. Otherdesign decisions: all Java inter-servlet communication is routedthrough a CentralControl servlet, for a clear flow of control andmonitoring choke point.

Each inference algorithm and each (pre)processing algorithm(preprocessing in the sense that the data are being prepared forinput into an inference algorithm) is invoked in a separate Javathread that performs job posting to the database, thus allowingdynamic monitoring of job progress by the user. Jobs are timed tothe millisecond, allowing comparison between algorithms of rela-tive speed versus relative power. While an algorithm is running,any web page listing for that newly created processed data set orinferred network will say ‘‘under construction,’’ with the processeddata set or inferred network only becomes available upon comple-tion of the algorithm run. At any time, the user may check on theprogress of a run by bringing up the job page for the processeddata set or inferred network and reading the job postings that havebeen stored so far in the database by the processing program or bythe Java wrapper around the selected inference algorithm. Thecomputational and RAM memory requirements for any given

Biological Network Inference and Analysis with SEBINI and CABIN 557

inference run will vary dramatically based on the number of nodesinvolved, the algorithm used and how efficiently it was written, theparameter settings used for a particular algorithm, and the numberof data points (gene expression, protein abundance, or proteinactivation values) uploaded for each node. The time requiredmay be a few seconds on desktop computer for a mutual informa-tion-based algorithm on a network of 20 nodes, given 100 geneexpression values for each node, from 100 microarray experiments.Or, if we use a Bayesian network type of algorithm on a set of 1000nodes using data from several hundred array experiments, the usermay set the parameters so that algorithm will run for days, in orderto return a high-quality inferred network.

Each node and each edge in an inferred network can be viewedon its own web page. Each edge carries with it the raw andprocessed state values for its two nodes across the entire set ofexperiments. These values can be viewed by the user, or lateroutput with the inferred network topology, for use in fitting theequations used for dynamic modeling to the experimental data.Each inferred network can be viewed as a table of nodes and edges,or visualized as a graph within Cytoscape and further analyzed orannotated with CABIN. The edges of an inferred network can alsobe exported in Cytoscape SIF file format.

The user may delete an entire project, an uploaded (raw) dataset, a processed data set, or an inferred network. Everything pre-viously created downstream of a deleted set is also deleted. Forexample, all inferred networks created from a given processed dataset are deleted, if the processed data set itself is deleted.

To support response speed in handling potentially huge data sets,and thus potentially some very large tables in the PostgreSQL centraldatabase, retrieval of records is almost always based on the primarykey of the relevant table. Secondary indices are sometimes used, buteven those are kept to a minimum, since each index added has to beupdated when new data is uploaded into the system, thus possiblyslowing response on the web site. Search and retrieval on unindexedfields is extremely rare and, when done, confined to tables that willstay relatively small over time. Most importantly, each project is givenits own separate PostgreSQL database. Thus, a user may at any timestart a new project, with a duplicate set of empty tables, and continuefrom there. All (old) projects remain fully accessible, but the newproject allows the user to begin completely anew.

SEBINI was initially implemented on a Dell desktop runningRed Hat Linux, using Java ver. 1.5, PostgreSQL ver. 7.4 (86), andApache Tomcat 4.1 (87). Communication between the Java pro-grams and the PostgreSQL database is done via JDBC. The Jar-karta Commons file upload and IO Java libraries are used (http://jarkarta.apache.org/commons). SEBINI has also been installedon a Windows 2003 computer and on a Mac running Mac OS Xver. 10.4.8. Machine-specific parameters are stored in an easily

558 Taylor and Singhal

changed text file read by use of the Java util.Properties class. Math-works’ MATLAB (88) is required for some of the inference algo-rithms, while R (89) is required for others. Note that although theuser interface is web-based, SEBINI can run completely self-con-tained (with CABIN), without any Internet connection, by settingthe host machine to ‘‘localhost’’ in the properties text file. A blockdiagram of the system is shown in Fig. 24.1.

Like SEBINI, CABIN is written in Java. It makes use of pub-licly available Java libraries such as Colt (http://dsd.lbl.gov/�hoschek/colt), JFreeChart (http://www.jfree.org/jfreechart),jMatrixView (http://jmatrixview.sourceforge.net), and BiSlider(https://bislider.dev.java.net) to provide rich visualization andan effective user interface. Once imported into CABIN, evidencenetworks are stored in a matrix model that keeps a list of thenetworks and their interactions. This model, provided by thehigh-performance Colt library, is backed by an optimized two-dimensional sparse matrix, which contains the confidence values ofeach interaction (row) of each network (column). These datavalues are visualized in multiple views as scatter plots (JFreeChart),as a heat map matrix representation (jMatrixView), and as Cytos-cape networks. Each view references the matrix model andobserves any changes in the model, allowing the views to updatethemselves when networks are imported, removed, or updated.Additionally, a view selection controller serves as an intermediary

SEBINI Central relational

database (PostgreSQL)

Input ModuleHigh-throughputexperimental data

Dynamic data extraction via Java JDBC library(e.g., from PNNL’s mass spec PRISM database)

Builder ModuleSimulated high-throughput

expression data for artificial networks

User interface – web siteoperated by Java servlets

Topological statistics, networkannotation, post-inference processing;scoring & error analysis (on artificial

data sets)

CABINGraph visualization and analysis

of inferred networks via Cytoscape

& CABIN

Mutual information-based and Bayesian network structure learning algorithms provided for learning regulatory networks; BEPro algorithm for learning protein-protein interaction networks from bait-prey-prey-

experiment mass spec data sets. Inferred networks permanently stored back into database.

Text files (flat files)

Human-readable reports oninferred networks

Machine-readable network structure filesfor dynamic modeling programs

SEBINI Central relational

database (PostgreSQL)

-

-throughput

-

Collection of network inference algorithms. User selects algorithm and data set, runs algorithm to infer a network (a set of edges). Mutual information-

-

-

Fig. 24.1. A block diagram of the SEBINI–CABIN system.

Biological Network Inference and Analysis with SEBINI and CABIN 559

that notifies each registered view of any data selection events. Datavalues can be manipulated further using the histogram range sliderinterface facilitated by BiSlider and JFreeChart libraries.

3. Capabilities

The capabilities provided within SEBINI include:1. Data Import – Upload of several types of experimental data

for input into selected processing and network inferencealgorithms.

2. Network Inference – There are several choices of inferencemethods. Currently, SEBINI has algorithms from classicalstatistics (e.g., Pearson correlation), static and dynamic Baye-sian network structure learning algorithms (e.g., the BANJOtoolkit at Duke University (90, 91)), and information theory(mutual information-based; e.g., basic no-frills mutual infor-mation (60), the ARACNE algorithm at Columbia University(92–95), and the CLR algorithm at Boston University (96,97)). Also, The Bayesian Estimator of Protein-Protein Asso-ciation Probabilities (BEPro) algorithm (70–72) has beenadded for inference of protein–protein interaction networksfrom bait-prey experiments.

3. Network Storage and Analysis – Inference networks can bepermanently stored and further analyzed. For each network,the user can view a summary page; a topological characteris-tics and statistics page; a graph visualization using Cytoscape(85), invoked via Java Web Start; summary pages for eachnode and edge showing the raw (uploaded) and processednode states; and job pages that record how the processing andinference tasks proceeded.

4. Algorithm Comparison – Direct comparison of networkinference methods on common synthetic or experimental datasets.

5. Experimental Planning Tool – Using simulated data sets,SEBINI can report, using different inference methods, on whatcan be reconstructed of the topology (regulatory connections) ofa network from the inference results on such an artificial data setof a given size. Thus, SEBINI may be useful in making a roughestimate of the number of experiments (data points) required.

6. Network Export – Output of inferred network structures asinput to other tools such as CABIN or to tools such as theSystems Biology Workbench (98) (e.g., for dynamical mod-eling), and export of human-readable reports on the net-works, with various topological characteristics noted.

560 Taylor and Singhal

The functionalities provided within CABIN for visual analysis ofmultiple interaction networks include:

7. Network Import – Each evidence network (e.g., a networkof protein–protein interactions from the BIND database) canbe imported into CABIN using the Cytoscape SIF file format.In its simplest form the first and third columns in the SIF filerepresent proteins and the middle column represents thevalue for the interaction. The evidence networks can beassigned a reliability score, or weight, based on the confidencein the evidence source. CABIN has provisions to assign cus-tom values to missing evidences for interactions. This missingvalue can be set to a value between 0 and 1, to the medianvalue for that evidence network, or we can ignore that evi-dence source for the interaction.

8. Exploratory Analysis – Once loaded, the CABIN environ-ment provides interactive visual interfaces to carry outexploratory analysis of the networks as shown in Fig. 24.2.The Weighted Scaling View window pane displays a point

Fig. 24.2. Use of the CABIN software to validate experimental interactions for Rhodopseudomonas palustris obtained usingtandem affinity purification technique bait-prey experiments.

Biological Network Inference and Analysis with SEBINI and CABIN 561

for each interaction in any of the evidence networks, withthe value for that interaction represented by the average ofthe weighted sums of confidence values in all the networks.The Cytoscape Network View window pane provides a net-work/graph visualization of the networks loaded intoCABIN. The Matrix/Table View provides a heat maprepresentation of confidence values of all sources for eachinteraction, along with the options to sort interactionsbased on their values. A Scatter Plot Matrix view is alsoprovided, which shows the scatter plots of interactions inall evidence networks with respect to the others. This viewfacilitates estimating the weight assignment for determiningthe confidence in the predictions for that source. The fourviews are coordinated with respect to each other, so selec-tions made in one view are reflected as selections in theother two views.

9. Construction of Network Subsets – Filters can be addedto select an edge confidence (edge score) cutoff for theedges in an evidence network. If the value for an edge fallsbelow the given cutoff, the edge is omitted from displayand any further use. The selection of the filter cutoff valueis performed by an easy-to-use slider control while viewinga histogram graph that shows the distribution of the edgevalues in each network. Edges displayed can also berestricted based on an OR relation or an AND relationamongst the networks; e.g., we may restrict the combinednetwork displayed (and possibly exported later) to thoseedges that appear in the inferred network passed in fromSEBINI and that also appear in either the evidence net-work from BIND or the evidence network from DIP(SEBINI AND (BIND OR DIP)). Once the filters are setand the update button is clicked, the views are updatedbased on interactions (edges) that pass the filters. Thefiltered set of edges can be saved as a new network withinCABIN and assigned a confidence of its own.

10. Find/Search Functionality – CABIN has regular expressionbased find/search functionality, which allows the user to lookfor a specific interaction or select all interactions involving aparticular molecule of interest.

11. Export – At any stage of the analysis process, the selectedinteractions can be saved to a local file for later use.

Additionally, to support algorithm developers, SEBINI alsoallows:

12. Artificial Data Sets – Topologies (99), perturbations, andnode input function definitions can be dynamically createdand stored. Boolean value expression sets are currently

562 Taylor and Singhal

created. The more sophisticated Java-based SynTreN net-work generator software is in the process of being added asan additional SEBINI module (100).

13. Step-wise Refinement of Inference Methods – Scoringmeasures (recall, precision, F-measure) are used to measureperformance against the simulated networks with knownstructure. Thus, supervised training of an inference algorithmis possible on a set of known (simulated) data sets.

14. Well-defined Expansion – Addition of each new inferencealgorithm, (pre) processing technique, import uploader orexport method is coded as a new Java module which fits easilyinto the already existing framework.

15. Scoring Distributions – As a guide for the interpretation ofthe scores produced by an inference technique, SEBINI canproduce scoring distributions for a given inference methodagainst known networks. Such distributions can then be usedto determine appropriate cutoff scores for determination of theexistence of a regulatory influence (an edge) to a target gene.

4. Analysis Flowof Control

In this section, we walk through the steps one would use inperforming an analysis in SEBINI on experimental data, and inusing SEBINI to evaluate an inference algorithm on a syntheticdata set. We will then examine the experimental analysis of aprotein bait-prey experiment set more closely using both SEBINIand CABIN, as representative test case.

The flow of control on the SEBINI web site for the analysis ofexperimental data goes as follows:

1) Log into SEBINI. If you have not used the SEBINI sitebefore, you are directed to a registration page to enter yourchosen user name and password. Once that is done, you maylog in and create a project.

2) Select a project, or create a new project. Typically, a user willlog in and select one of the projects to which he or she hasaccess – a list of such is presented on a web page. All work isdone within a context of a project, and a project must beselected before doing anything else. All data sets, all nodes, alledges, all networks belong to a particular project and can beaccessed from that project only.

3) Create a network container for the data set you are uploading,for the experimental network you are trying to reconstruct.This is a simple task – one enters a name, a short optionaldescription, and selects an uploader method for future use.

Biological Network Inference and Analysis with SEBINI and CABIN 563

4) Upload the experimental mRNA expression or protein abun-dance data file. Using the uploader method chosen in theprevious step, a web page is presented, which allows the userto upload a file or files from his/her local computer. After afile path is selected in a Browse box and the user clicks on‘‘submit’’, the data set in this file is parsed, according to theupload method selected, and permanently stored in thedatabase.

5) Select a processing or binning algorithm and run it on thedata. The uploaded raw data set can be processed or binned inmany ways. For example, for a set of microarray experiments,each gene expression value could be binned (discretized) intotwo-state Boolean values (on/off), based on a cutoff value setat the halfway point of the min and max of the value rangeacross the set of (already normalized) expression values,across the set of experiments. That is a simple example. Dis-cretization can be done in very complex ways, and the resultsone obtains for the inferred network can depend stronglyupon your binning technique. For data sets being used asinput into inference algorithms that do not require binningor other processing (or for data sets that were already binnedbefore being uploaded into SEBINI), the ‘‘pass-through’’menu option can be used. This option will create a processeddata set that is a duplicate of the raw data set. In any case, aprocessed data set must be created, because only a processeddata set is allowed as input into an inference algorithm.

6) Select an inference algorithm, select the values to use for itsparameters, and infer a network. This step can be consideredthe ‘‘heart’’ of SEBINI, since it is where we actually infer anetwork and store it back into the database. Using a set ofweb pages, the user selects a processed data set, selects theinference algorithm to employ on that data set, enters valuesinto fields for all the parameters that the algorithm needs(usually, default values are given on-screen), and then clickson a ‘‘submit’’ button that launches the algorithm, via its Javawrapper program, in a separate Java thread. When the algo-rithm finishes, the Java wrapper will parse the algorithm’soutput and store the inferred edges in the database as thenewly inferred network. The edge scores and any other usefulinformation produced by that particular algorithm will alsobe stored, attached to the inferred edges, or to the parentinferred network record. The inferred network record is the‘‘parent’’ of its child nodes and edge records, and containspointers – unique ids – to them. Likewise, the node and edgerecords have a field containing the unique id of the parentnetwork record. Thus the user can move from a web pageshowing a given edge to a page showing information about

564 Taylor and Singhal

the parent inferred network by clicking on a button thatinvokes very fast recall of the network record from the Post-gresSQL inferred-network table, based on that unique pri-mary key value – and vice versa.

7) View the resulting inferred network in tabular format, and viewdetails on individual nodes and edges. A web page table isproduced, which lists the nodes; and for each node, theedges in which it participates, one edge per row. (Each edgethus appears twice in the table.) The edge scores are given,along with clickable links to pages that give details on the node(the gene or protein), on the edge, and on the node at theother end of the edge. Two versions of the table are displayed,depending upon whether the inferred edges are directed orundirected. Fig. 24.3 shows a part of such a table for anundirected protein–protein interaction network found in thebacterium Rhodopseudomonas palustris (R. palustris). Alsoavailable is a table that simply lists the inferred edges, onerow per edge, with the two nodes involved and the edgescore. The pages that give full descriptions on the nodes maycontain background sequence level information on that geneor protein, if such information has been uploaded intoSEBINI. Such a page may also contain clickable links to publicdatabases. For example, we may retrieve the appropriate pagefrom the NCBI Entrez web site for that gene or protein in anew window. As mentioned above, the details page for an edgewill show the raw and processed node states (expression levels)for the two nodes involved, across the set of experiments in the

Fig. 24.3. The start of a SEBINI table showing the nodes and undirected edges for an inferred protein–protein interactionnetwork for the bacterium Rhodopseudomonas palustris.

Biological Network Inference and Analysis with SEBINI and CABIN 565

data set. These web pages may be printed out as human-read-able, permanent reports. Also, some basic topological statisticsare reported out for each inferred network – average degree,minimum and maximum degree (undirected networks), mini-mum and maximum in-degree (directed), minimum and max-imum out-degree (directed). Topological calculations arebeing expanded to include the full node in-degree and out-degree distributions, average clustering coefficient, and possi-bly other characteristics (graph diameter, characteristic pathlength, scale free exponent) important for structural analysis(41, 79, 101–103).

8) Export the inferred network, if so desired. The user cangenerate and store a text file for the inferred network inCytoscape SIF format on the local client computer. Also, atext file with edges and associated node state values can beproduced and passed to dynamic modeling tools.

9) Visualize the inferred network using Cytoscape. A set ofCytoscape files are automatically generated for each inferrednetwork and stored in the directory structure for a project, forthat given inferred network. In addition to the primaryCytoscape SIF file, with a listing of the edges and their scores,a large number of Cytoscape node and edge attribute files aregenerated, to allow the user to annotate the network graphdisplayed in Cytoscape. The user can invoke Cytoscape todisplay an inferred network via a button click from a web pagelisting that network. Cytoscape appears on the client compu-ter via Java Web Start, with the inferred network automati-cally loaded.

10) Further analyze the inferred network in CABIN. From theCytoscape plug-in menu, the user can invoke CABIN, whichwill then allow the user to bring in other public data sets ofknown edges (interactions) and compare those networks tothe inferred network found in SEBINI. Or the user can thenemploy those data sources to extend the inferred network byvarious combinations of weighting and filtering. Once anextended or combined network has been created in CABIN,the user may place it back into the SEBINI database forpermanent storage as a new inferred network.

The SEBINI flow of control for the analysis of algorithmsoperating on synthetic data sets is quite similar to the tasksequence described above for the analysis of experimental datasets. One logs into the SEBINI web site and selects a project inthe same manner. However, instead of creating an experimentalnetwork container and uploading an experimental data set into it,the user goes to a different set of web pages and creates a syntheticnetwork container. The user then selects methods to build a

566 Taylor and Singhal

network topology and a set of corresponding artificial expressiondata sets for that topology. Once that is done, the flow of control isthe same as before – we create a processed data set from the raw(artificial) data set, select an inference algorithm to use, entervalues for the algorithm parameters, and run the algorithm onthe processed data set. The inferred network can then be viewedin tabular format, as before, or visualized as a graph in Cytoscape.The difference here is that the user can also bring up a web page toview precision, recall, and F-measure statistics that measure howwell the inference algorithm performed against the ‘‘gold stan-dard’’, the known artificial network, in terms of the number ofcorrect and incorrect edges found.

At PNNL, algorithms in SEBINI have been used to analyzemRNA microarray and proteomics data coming from the EMSLGrand Challenge in Membrane Biology project (http://mbgc.emsl.pnl.gov/) focused on the study of nitrogen fixationand photosynthesis in the bacterium Cyanothece. However, per-haps a more interesting case study is the use of SEBINI–CABIN inthe U.S. Department of Energy’s Genomics:GTL Center forMolecular and Cellular Systems (CMCS) project, which is a jointOak Ridge National Laboratory (ORNL) / PNNL multi-yearcollaboration to determine protein complexes and interaction net-works in bacteria via mass spectrometry protein bait-prey experi-ments (http://mippi.ornl.gov/). SEBINI and CABIN now formthe backbone of the exploratory analysis pipeline for this project.Evidence for potentially interacting prey proteins is uploaded intoSEBINI for each bait experiment. Such evidence comes from thepeptides that preliminary analysis via the well-known SEQUESTpeptide mass spectra analysis algorithm has assigned to a particularprotein. Once such a data set has been uploaded, a processingalgorithm is invoked, which ‘‘collapses’’ the data set from possiblymultiple pieces of peptide evidence for each protein in each experi-ment down to a single numeric value for each protein in eachexperiment. This collapse can be done in many ways, based onthe parameters selected. The simplest means is to simply count thenumber of peptides that SEQUEST reported as evidence for aprotein in a given experiment.

Note that while ORNL data sets are uploaded as text files, thePNNL bait-prey experiment data sets are extracted on-the-fly froman Oracle database. Thus, an uploader program that will dynami-cally retrieve records (e.g., a set of microarray runs) from a remotedatabase, based on user-selected criteria, can easily be added to theset of SEBINI upload options.

Once the processed data set is created, the user selects theBEPro algorithm mentioned above to infer the set of protein–protein interactions. Such a resulting network is shown inFig. 24.4. We can invoke Cytoscape and CABIN for furtheranalysis of this inferred network.

Biological Network Inference and Analysis with SEBINI and CABIN 567

Figure 24.2 show the use of CABIN to validate experimentalinteractions for R. palustris obtained using the tandem affinitypurification technique mass spectrometry experiments at ORNL.For the set of proteins in these interactions, we obtained evidencenetworks using the phylogenetic profile, gene cluster, gene neigh-borhood, and Rosetta stone methods of the Prolinks database(84). In addition, we obtained evidence networks using proteininformation from the interolog and regulog methods from theBioverse (55, 104). The interolog method predicts an interactionbetween two proteins if they are both homologs of two proteinsknown to interact. Known protein interactions are gathered fromthe databases of experimentally determined protein–protein inter-actions (e.g., BIND (83), DIP(80–82)) and PSI-BLAST (105) isused to determine similarity between this set and all proteins in atarget organism. Regulogs are regulatory interactions inferred byhomology. A regulog is predicted by determining the similarity toa known transcription factor (TF) and the TF’s target protein.Finally, the nucleotide similarity in the upstream transcriptionalpromoter regions is determined and used to filter the regulogpredictions: if there are similar promoter sequences, then a regulogis predicted.

Fig. 24.4. The protein–protein interaction network shown here was inferred using the BEPro algorithm, operating ona set of 854 bait-prey experiments for the bacterium Rhodopseudomonas palustris. We see a part of the networkgraph in the Cytoscape window, along with information on the proteins (nodes) appearing in the Cytoscape attributebrowser.

568 Taylor and Singhal

In all, seven interaction networks are assigned a weight basedon the confidence in the evidence source and imported intoCABIN in this example. As can be seen from Fig. 24.2, the scatterplots show the correlation of the different interaction networkswith respect to each other. We can clearly see the good agreementof the interolog predictions with the experimental interactions(points along the diagonal) and the low overlap of the regulogpredictions with the other networks (points along the axis). Thiscan be attributed to the fact that protein–protein interactions(and/or interologs) are not expected to overlap with regulatoryinteractions (and/or regulogs) since the types of interactions arevery different. Regulatory interactions act through an intermediate(the promoter region) so the TF and TF target do not need tophysically make contact. Only in (probably rare) cases where theprotein produced from the TF gene target binds to its TF generallyto inhibit its activity (an auto-regulatory loop) would you see botha protein–protein and regulatory interaction between the samepair of proteins.

Figure 24.2 show the exploratory analysis process in which weselect the interactions (shown in black) that have high evidence inthe pulldown experiments as well as the interolog predictions.These interactions are automatically selected in all the otherviews, showing their corresponding values in those views.Although the different views of the data give a deeper understand-ing of the multi-source data, the interpretation of an interactionnetwork with more than a few hundred edges becomes difficult ina traditional network/graph like view. The use of filters helps insub-setting the data by changing the cutoff for the evidence net-works dynamically. As shown in Figure 24.2 the interactions arefiltered based on a value greater than 0.2 for the experimentalobservations. We can see a clear separation of the interactionsbased on the combined confidence from all the evidence sourcesin the Weighted Scaling View. Using the functionalities withinCABIN, further exploratory data analysis can be carried out tovalidate the experimental interactions and, on conclusion of theanalysis process, the high-confidence interactions can be saved in alocal file or back into SEBINI.

5. Summary

The SEBINI–CABIN system offers an open source software plat-form for biological network inference and analysis. In addition tothe large collection of inference algorithms the system makesavailable, permanent storage of the inferred networks in the

Biological Network Inference and Analysis with SEBINI and CABIN 569

network-centric database allows for (1) post-processing and furtherinference and annotation via CABIN, and (2) further annotationand topological analysis within SEBINI (e.g., topology-based edgeadditions to existing core subnet using less stringent cutoffs).

We will continue to add to the capabilities of SEBINI andCABIN: additional inference and processing algorithms, bettermethods of generating artificial data sets, additional import/export techniques, and additional statistical algorithms for com-bining evidence from multiple sources. With new algorithms inthis field being published on a regular basis, there remain a largenumber of promising inference algorithms (e.g., (106–112)) toadd into SEBINI’s toolkit. (An extensive list of relevant articles,current through mid-2006, many of which discuss new algo-rithms, can be found in (113)). Also, we are exploring refining orcombining algorithms already present in SEBINI for improvedresults, using the SEBINI–CABIN system ourselves to exploreimprovement of the algorithms; quickly adding, testing, and com-paring variants. Further, we will use our platform to developexpertise on how much data is needed, what are the appropriateparameter settings and cutoffs for each algorithm, what are theweaknesses of a given method compared to others on a commondata set, what background information on a genome is most usefulto supplement the primary gene expression data for a given algo-rithm, and so on.

We realize that adding general and genome-specific annota-tion from the public databases, and using such annotation aspossible constraints on the edges that are inferred, is extremelyimportant. (For example, if a gene’s product is a protein that isknown to be located in the membrane, that knowledge consider-ably lessens the probability that any inferred transcriptional regu-latory edge coming out of that gene is correct.) Hence we will alsobe working to add to SEBINI’s capabilities in this area.

It has not escaped our attention that a SEBINI-CABIN siteadopted by a large community would provide a network-centricdatabase whose edge records would constitute a resource excep-tionally well-suited for investigating edge motifs in signaling, reg-ulatory, and interaction networks (1, 114–116).

‘‘Network biology is only in its infancy’’ (3). We do not yetknow what inference algorithm(s) will perform best for what datasets. Theoretical guidance is lacking. But SEBINI–CABIN posi-tions us to empirically test new algorithms, and easily modify orcombine algorithms, while providing biologists much easier accessto a growing collection of state-of-the-art algorithms. Moreover,as the high-throughput data sets continue to grow, the SEBINI–CABIN platform will aid in making the inference of networktopologies a common starting point for further work in systemsbiology, such as dynamic modeling, rather than a seldom-reachedend point, as is now the case.

570 Taylor and Singhal

Acknowledgments

The research described in this paper was conducted under theLaboratory Directed Research and Development Program at thePacific Northwest National Laboratory (PNNL), a multiprogramnational laboratory operated by Battelle for the U.S. Departmentof Energy, under Contract DE-AC06-76RL01830. Also, work forSEBINI has been supported by PNNL’s William R. Wiley Envir-onmental Molecular Science Laboratory (EMSL) and the EMSLGrand Challenge in Membrane Biology project, and by the jointORNL / PNNL collaboration for the Genomes to Life Center forMolecular and Cellular Biology, project # 43930, US Departmentof Energy.

References

1. U. Alon, An Introduction to Systems Biology –Design Principles of Biological Circuits. BocaRaton: Chapman & Hall/CRC, 2007.

2. E. H. Davidson, The Regulatory Genome –Gene Regulatory Networks in Developmentand Evolution. Burlington: Elsevier,2006.

3. A.-L. Barabasi and Z. N. Oltvai, ‘‘Networkbiology: understanding the cell’s functionalorganization,’’ Nat. Rev. Genet., vol. 5, pp.101–13, 2004.

4. A. S. N. Seshasayee, P. Bertone, G. M. Fra-ser, and N. M. Luscombe, ‘‘Transcriptionalregulatory networks in bacteria: from inputsignals to output responses,’’ Curr. Opin.Microbiol., vol. 9, pp. 511–9, 2006.

5. H. de Jong, ‘‘Modeling and simulation ofgenetic regulatory systems: a literaturereview,’’ J Comput Biol, vol. 9, pp. 67–103,2002.

6. J. A. Papin, T. Hunter, B. O. Palsson, and S.Subramaniam, ‘‘Reconstruction of cellularsignalling networks and analysis of theirproperties,’’ Nat. Rev. Mol. Cell Biol., vol.6, pp. 99–111, 2005.

7. E. P. van Someren, L. F. Wessels, E. Backer,and M. J. Reinders, ‘‘Genetic network mod-eling,’’ Pharmacogenomics, vol. 3, pp.507–25, 2002.

8. L. F. Wessels, E. P. van Someren, and M. J.Reinders, ‘‘A comparison of genetic networkmodels,’’ Pac. Symp. Biocomput., vol. 6, pp.508–519, 2001.

9. C.-H. Yuh, H. Bolouri, and E. H. Davidson,‘‘Genomic cis-regulatory logic: experimental

and computational analysis of a sea urchingene,’’ Science, vol. 279, pp. 1896–902, 1998.

10. R. C. Taylor, A. Shah, C. Treatman, and M.Blevins, ‘‘SEBINI: software environmentfor bIological network inference,’’ Bioinfor-matics, vol. 21, pp. 2706–8, 2006.

11. M. Singhal and K. Domico, ‘‘Collective ana-lysis of biological interaction networks(CABIN),’’ www.sysbio.org/dataresources/cabin.stm, 2006.

12. M. Bansal, V. Belcastro, A. Ambesi-Impiombato, and D. di Bernardo, ‘‘Howto infer gene networks from expression pro-files,’’ Mol. Syst. Biol., vol. 3, 2007.

13. V. Filkov, ‘‘Identifying gene regulatory net-works from gene expression data (Chapter27),’’ in Handbook of Computational Mole-cular Biology. Boca Raton: Chapman &Hall/CRC, 2005.

14. C. L. Barrett and B. O. Palsson, ‘‘Iterativereconstruction of transcriptional regulatorynetworks: an algorithmic approach,’’ PLoSComput. Biol., vol. 2, p. e52, 2006.

15. A. J. Butte and I. S. Kohane, ‘‘Mutual infor-mation relevance networks: functionalgenomic clustering using pairwise entropymeasurements,’’ Pac. Symp. Biocomput.,vol. 5, pp. 418–429, 2000.

16. A. de la Fuente, P. Brazhnik, and P. Mendes,‘‘Linking the genes: inferring quantitativegene networks from microarray data,’’Trends Genet., vol. 18, pp. 395–8, 2002.

17. P. D’haeseleer, X. Wen, S. Fuhrman, and R.Somogyi, ‘‘Linear modeling of mRNAexpression levels during CNS development

Biological Network Inference and Analysis with SEBINI and CABIN 571

and injury,’’ Pac. Symp. Biocomput., vol. 4,pp. 41–52, 1999.

18. N. Friedman, M. Linial, J. Nachman, and D.Pe’er, ‘‘Using Bayesian networks to analyzeexpression data.,’’ J. Comput. Biol., vol. 7,pp. 601–20, 2000.

19. N. Friedman, ‘‘Inferring cellular networksusing probabilistic graphical models,’’Science, vol. 303, pp. 799–805, 2004.

20. F. Gao, B. Foat, and H. Bussemaker,‘‘Defining transcriptional networks throughintegrative modeling of mRNA expressionand transcription factor binding data,’’ BMCBioinformatics, vol. 5, p. 31, 2004.

21. T. S. Gardner, D. di Bernardo, D. Lorenz,and J. J. Collins, ‘‘Inferring genetic net-works and identifying compound mode ofaction via expression profiling,’’ Science, vol.301, pp. 102–5, 2003.

22. F. Geier, J. Timmer, and C. Fleck, ‘‘Recon-structing gene-regulatory networks fromtime series, knock-out data, and priorknowledge,’’ in BMC Syst. Biol.. vol. 1,2007.

23. R. Guthke, U. Moller, M. Hoffmann, F.Thies, and S. Topfer, ‘‘Dynamic networkreconstruction from gene expression dataapplied to immune response during bacter-ial infection,’’ Bioinformatics, vol. 21,pp. 1626–34, 2005.

24. A. J. Hartemink, D. K. Gifford, T. S. Jaak-kola, and Y. R. A., ‘‘Combining location andexpression data for principled discovery ofgenetic regulatory network models,’’ Pac.Symp. Biocomput., vol. 7, pp. 37–43, 2002.

25. D. Husmeier, ‘‘Sensitivity and specificity ofinferring genetic regulatory interactionsfrom microarray experiments with dynamicBayesian networks,’’ Bioinformatics, vol. 19,pp. 2271–82, 2003.

26. T. Ideker, V. Thorsson, and R. M. Karp,‘‘Discovery of regulatory interactionsthrough perturbation: inference and experi-mental design,’’ in Pacific Symposium onBiocomputing Hawaii, 2000, pp. 305–16.

27. S. Liang, S. Fuhrman, and R. Somogyi,‘‘REVEAL: a general reverse engineeringalgorithm for inference of genetic networkarchitectures,’’ Pac. Symp. Biocomput., vol.3, pp. 18–29, 1998.

28. L. Mao and H. Resat, ‘‘Probabilistic repre-sentation of gene regulatory networks,’’Bioinformatics, vol. 10, pp. 2258–69, 2004.

29. C. L. Myers, D. Robson, A. Wible, M. A.Hibbs, C. Chiriac, C. L. Theesfelld, K.Dolinski, and O. G. Troyanskaya, ‘‘Discov-ery of biological networks from diverse

functional genomic data,’’ Genome Biol.,vol. 6, p. r114, 2005.

30. N. Nariai, Y. Tamada, S. Imoto, and S.Miyano, ‘‘Estimating gene regulatory net-works and protein-protein interactions ofSaccharomyces cerevisiae form multiplegenome-wide data,’’ Bioinformatics, vol.21, pp. ii206-12, 2005.

31. D. Pe’er, A. Regev, and A. Tanay, ‘‘Minreg:inferring an active regulator set,’’ Bioinfor-matics, vol. 18, pp. 258S–67S, 2002.

32. D. Pe’er, A. Regev, G. Elidan, and N. Fried-man, ‘‘Inferring subnetworks from per-turbed expression profiles,’’ Bioinformatics,vol. 17, pp. 215S–24S, 2001.

33. B. E. Perrin, L. Ralaivola, A. Mazurie, S.Bottani, J. Mallet, and F. d’Alche-Buc,‘‘Gene networks inference using dynamicBayesian networks,’’ Bioinformatics, vol.19, pp. ii138–48, 2003.

34. J. Qian, J. Lin, N. M. Luscumbe, H. Yu, andM. Gerstein, ‘‘Prediction of regulatory net-works: genome-wide identification of tran-scription factor targets from gene expressiondata,’’ Bioinformatics, vol. 19, pp. 1917–26,2003.

35. D. J. Reiss, N. S. Baliga, and R. Bonneau,‘‘Integrated bi-clustering of heterogeneousgenome-wide datasets for the inference ofglobal regulatory networks,’’ BMC Bioinfor-matics, vol. 7, 10.1186/1471-2105-7-280,2006.

36. K. Sachs, O. Perez, D. Pe’er, D. A. Lauffen-burger, and G. P. Nolan, ‘‘Causal protein-signaling networks derived from multipara-meter single-cell data.,’’ Science, vol. 308,pp. 523–9, 2005.

37. E. E. Schadt, J. Lamb, X. Yang, J. Zhu, S.Edwards, D. Guhathakurta, S. K. Sieberts,S. Monks, M. Reitman, C. Zhang, P. Y.Lum, A. Leonardson, R. Thieringer, J. M.Metzger, L. Yang, J. Castle, H. Zhu, S. F.Kash, T. A. Drake, A. Sachs, and A. J. Lusis,‘‘An integrative genomics approach to infercausal associations between gene expressionand disease,’’ Nat. Genet., vol. 37, pp.710–7, 2005.

38. M. Schena, D. Shalon, R. W. Davis, and P.O. Brown, ‘‘Quantitative monitoring ofgene expression patterns with a complemen-tary DNA microarray,’’ Science, vol. 270, pp.467–70, 1995.

39. E. Segal, R. Yelensky, and D. Koller, ‘‘Gen-ome-wide discovery of transcriptional mod-ules from DNA sequence and geneexpression,’’ Bioinformatics, vol. 19, pp.i264–72, 2003.

572 Taylor and Singhal

40. E. Segal, B. Taskar, A. Gasch, N. Friedman,and D. Koller, ‘‘RIch probabilistic modelsfor gene expression,’’ Bioinformatics, vol.17, pp. S243–52, 2001.

41. V. A. Smith, E. D. Jarvis, and A. J. Harte-mink, ‘‘Evaluating functional network infer-ence using simulations of complexbiological systems,’’ Bioinformatics, vol.18, pp. S216–24, 2002.

42. L. Soinov, M. A. Krestyaninova, and A.Brazma, ‘‘Toward reconstruction of genenetworks from expression data by supervisedlearning,’’ Genome Biol., vol. 4, p. R6, 2003.

43. Y. Tamada, S. Kim, H. Bannai, S. Imoto, K.Tashiro, S. Kuhara, and S. Miyano, ‘‘Esti-mating gene networks from gene expressiondata by combining Bayesian network modelwith promoter element detection,’’ Bioin-formatics, vol. 19, pp. ii227–36, 2003.

44. R. C. Taylor, ‘‘Reconstruction of metabolicand genetic networks from gene expressionperturbation data using a Boolean model:construction of a simulation testbed and anempirical exploration of some of the limits,’’Doctoral Dissertation, George Mason Uni-versity, Fairfax, VA, 2003.

45. O. G. Troyanskaya, K. Dolinski, A. B.Owen, R. B. Altman, and D. Botstein, ‘‘ABayesian framework for combining hetero-geneous data sources for gene functionprediction (in Saccharomyces cerevisiae),’’Proc. Natl. Acad. Sci. U.S.A., vol. 100,pp. 8348–53, 2003.

46. A. Wagner, ‘‘Estimating coarse gene net-work structure from large-scale geneperturbation data,’’ Genome Res., vol. 12,pp. 309–15, 2002.

47. A. Wagner, ‘‘How to reconstruct a largegenetic network from n gene perturbationsin fewer than n(2) easy steps,’’ Bioinfor-matics, vol. 17, pp. 1183–97, 2001.

48. A. Wagner, ‘‘Reconstructing pathwaysin large genetic networks from geneticperturbations,’’ J. Comput. Biol., vol. 11,pp. 53–60, 2004.

49. Y. Wang, T. Joshi, X. Zhang, D. Xu, and L.Chen, ‘‘Inferring gene regulatory networksfrom multiple microarray datasets,’’ Bioin-formatics, vol. 22, pp. 2413–20, 2006.

50. D. C. Weaver, C. T. Workman, and G. D.Stromo, ‘‘Modeling regulatory networkswith weight matrices,’’ Pac. Symp. Biocom-put., vol. 4, pp. 112–123, 1999.

51. A. V. Werhli, M. Grezegorczyk, and D.Husmeier, ‘‘Comparative evaluation ofreverse engineering gene regulatory net-works with relevance networks, graphical

gaussian models and bayesian networks,’’Bioinformatics, vol. 22, pp. 2523–31, 2006.

52. C. J. Wolfe, I. S. Kohane, and A. J. Butte,‘‘Systemic survey reveals general applicabil-ity of ‘‘guilt-by-association’’ within genecoexpression networks,’’ BMC Bioinfor-matics, vol. 6, p. 227, 2005.

53. C.-C. Wu, H.-C. Huang, H.-F. Juan, and S.-T. Chen, ‘‘GeneNetwork: an interactive toolfor reconstruction of genetic networks usingmicroarray data,’’ Bioinformatics, vol. 18,pp. 3691–3, 2004.

54. B. Xing and M. J. Van der Laan, ‘‘A statis-tical method for constructing transcriptionalregulatory networks using gene expressionand sequence data,’’ J. Comput. Biol., vol.12, pp. 229–46, 2005.

55. H. Yu, V. A. Smith, P. P. Wang, A. J. Har-temink, and E. D. Jarvis, ‘‘Advances to baye-sian network inference for generatingcausal networks form observational biologi-cal data.,’’ Bioinformatics, vol. 20,pp. 3594–603, 2004.

56. W. Zhao, E. Serpedin, and E. R. Dougherty,‘‘Inferring gene regulatory networks fromtime series data using the minimum descrip-tion length principle,’’ Bioinformatics, vol.22, pp. 2129–35, 2006.

57. X. Zhou, X. Wang, R. Pal, I. Ivanov, M.Bittner, and E. R. Dougherty, ‘‘A Bayesianconnectivity-based approach to construct-ing probabilistic gene regulatory networks,’’Bioinformatics, vol. 20, pp. 2918–27, 2004.

58. M. Stetter, G. Deco, and M. Dejori, ‘‘Large-scale computational modeling of geneticregulatory networks,’’ Artif. Intell. Rev.,vol. 20, pp. 75–93, 2003.

59. T. Chu, C. Glymour, R. Scheines, and P.Sprites, ‘‘A statistical problem for inferenceto regulatory structure from associationsof gene expression measurements withmicroarrays,’’ Bioinformatics, vol. 19,pp. 1147–52, 2003.

60. T. M. Cover and J. A. Thomas, Elements ofInformation Theory, 1st ed. New York: JohnWiley & Sons, 1991.

61. C. O. Daub, R. Steuer, J. Selbig, and S.Kloska, ‘‘Estimating mutual informationusing B-spline functions – an improvedsimilarity measure for analysing geneexpression data,’’ BMC Bioinformatics.vol. 5, 2004.

62. Wikipedia, ‘‘Pearson product-moment corre-lation coefficient,’’ 2006, p. Wikipedia entry.

63. K. Murphy, ‘‘Bayes net toolbox for matlab(open source project on Source Forge),’’2006.

Biological Network Inference and Analysis with SEBINI and CABIN 573

64. F. V. Jensen, An Introduction to BayesianNetworks. New York: UCL Press Limited,1996.

65. M. I. Jordan, ‘‘Learning in graphical mod-els,’’ 1st ed. Cambridge: MIT Press, 1998.

66. R. E. Neapolitan, Learning Bayesian net-works. Upper Saddle River: Pearson Educa-tion Inc., 2004.

67. D. Pe’er, ‘‘Bayesian network analysis of sig-naling networks: a primer,’’ in Science STKE,2005, pp. on-line primer.

68. S. G. Bottcher and C. Dethlefsen, ‘‘Deal: apackage for learning Bayesian networks,’’ J.Stat. Softw., vol. 8, p. i20, 2003.

69. J. Han and M. Kamber, Data Mining Con-cepts and Techniques. San Diego: MorganKaufman Publishers, 2001.

70. Oak Ridge National Laboratory and PacificNorthwest National Laboratory, ‘‘MicrobialProtein-Protein Interactions (MiPPI) pro-ject,’’ project web site: http://mippi.ornl.-gov, 2007.

71. J. L. Sharp, K. K. Anderson, D. S. Daly, D.L. Auberry, W. R. Cannon, A. M. White,and V. Kery, ‘‘Inferring protein-proteinassociations with affinity isolation LC-MS/MS assays,’’ J. Proteome Res., vol. 6(9), pp.3788–95, 2007.

72. M. A. Gilchrist, L. A. Salter, and A. Wagner,‘‘A statistical framework for combining andinterpreting proteomic datasets,’’ Bioinfor-matics, vol. 20, pp. 689–700, 2004.

73. J. Gilmore, D. L. Auberry, A. M. White, J.L. Sharp, K. K. Anderson, and D. S. Daly,Bayesian Estimator of Protein-Protein Asso-ciation Probabilities (BEPro) web site:http://www.pnl.gov/statistics/bepro3/index.htm, 2006.

74. P. T. Spellman, G. Sherlock, M. Q. Zhang,V. R. Iyer, K. Anders, M. Eisen, P. O.Brown, D. Botstein, and B. Futcher,‘‘Comprehensive identification of cellcycle regulated genes of the yeast Sacchar-omyces cerevisiae by microarray hybridiza-tion,’’ Mol. Biol. Cell, vol. 9, pp. 3273–97,1998.

75. M. Eisen, P. Spellman, P. O. Brown, and D.Botstein, ‘‘Cluster analysis and display ofgenome-wide expression patterns,’’ Proc.Natl. Acad. Sci. U.S.A., vol. 95, pp.14863–8, 1998.

76. B. S. Everitt, Cluster Analysis, 3rd ed. NewYork: Arnold, 1993.

77. P. Baldi and G. W. Hatfield, DNA Micro-arrays and Gene Expression, 1st ed. Cam-bridge: Cambridge University Press, 2002.

78. T. Speed, ‘‘Statistical analysis of gene expres-sion microarray data,’’ Boca Raton: Chap-man & Hall, 2003.

79. A. V. Lukashin, M. E. Lukashev, and R.Fuchs, ‘‘Topology of gene expression net-works as revealed by data mining and mod-eling,’’ Bioinformatics, vol. 19, pp.1909–16, 2003.

80. I. Xenarios, E. Fernandez, L. Salwinski, X. J.Duan, M. J. Thompson, E. M. Marcotte,and D. Eisenberg, ‘‘DIP: the database ofinteracting proteins: 2001 update,’’ NucleicAcids Res., vol. 29, pp. 239–41, Jan 1 2001.

81. I. Xenarios, D. W. Rice, L. Salwinski, M. K.Baron, E. M. Marcotte, and D. Eisenberg,‘‘DIP: the database of interacting proteins,’’Nucleic Acids Res., vol. 28, pp. 289–91, Jan1 2000.

82. I. Xenarios, L. Salwinski, X. J. Duan, P.Higney, S. M. Kim, and D. Eisenberg,‘‘DIP: the database of interacting pro-teins: a research tool for studying cellularnetworks of protein interactions,’’Nucleic Acids Res., vol. 30, pp. 303–5,Jan 1 2002.

83. G. Bader and H. CW., ‘‘BIND: a data spe-cification for storing and describing biomo-lecular interactions, molecular complexesand pathways.,’’ Bioinformatics, vol. 16,pp. 465–77, 2000.

84. P. Bowers, M. Pellegrini, M. Thompson, J.Fierro, T. Yeates, and D. Eisenberg, ‘‘Pro-links : a database of protein functional lin-kages derived from coevolution,’’ GenomeBiol., vol. 5, p. R35, 2004.

85. P. Shannon, A. Markiel, O. Ozier, N. S.Baliga, J. T. Want, D. Ramage, N. Amin,B. Schwikowski, and T. Ideker, ‘‘Cytos-cape: a software environment for integratedmodels of biomolecular interaction net-works,’’ Genome Res., vol. 13, pp.2498–504, 2003.

86. PostgreSQL Global Development Group,PostgreSQL web site: http://www.postgresql.org

87. Apache Software Foundation, Apache Tom-cat web site: http://tomcat.apache.org,2007.

88. MathWorks, 2007, p. MATLAB home website.

89. R. Gentleman and R. Ihaka, ‘‘The R projectfor statistical computing,’’ R home web site:http://www.r-project.org, 1997.

90. A. J. Hartemink, ‘‘Bayesian network infer-ence with Java Objects (Banjo),’’ 2005,Banjo web site at Duke.

574 Taylor and Singhal

91. A. Hartemink, ‘‘Banjo: Bayesian networkinference with Java Objects,’’ web site:http://www.cs.duke.edu/�amink/soft-ware/banjo/, 2005.

92. A. A. Margolin, K. Wang, W. K. Lim, M.Kustagi, I. Nemenman, and A. Califano,‘‘Reverse engineering cellular networks,’’Nat. Protoc., vol. 1, pp. 663–72, 2006.

93. A. A. Margolin, I. Nemenman, K. Basso,C. Wiggins, G. Stolovitzky, R. Dalla-Favera, and A. Califano, ‘‘ARACNE: analgorithm for the reconstruction of generegulatory networks in a mammalian cellu-lar context,’’ BMC Bioinformatics, vol. 7,pp. S1–7, 2006.

94. A. J. Hartemink, ‘‘Reverse engineering generegulatory networks,’’ Nat. Biotech., vol. 23,pp. 554–5, 2005.

95. MDeC Bioinformatics core facility at theColumbia Genome Center, ‘‘ARACNE(Algorithm for the Reconstruction of Accu-rate Cellular Networks),’’ ARACNE Algo-rithm Download Site: http://amdec-bioinfo.cu-genome.org/html/ARA-CNE.htm, 2006.

96. J. J. Faith, B. Hayete, J. T. Thaden, I.Mogno, J. Wierzbowski, G. Cottarel, S.Kasif, J. J. Collins, and T. S. Gardner,‘‘Large-scale mapping and validation ofescherichia coli transcriptional regulationfrom a compendium of expression profiles,’’PLoS Biol., vol. 5, pp. 54–66, 2007.

97. Gardner lab, ‘‘Context likelihood or related-ness (CLR) algorithm,’’ CLR AlgorithmDownload Site: http://gardnerlab.bu.edu,2006.

98. H. Sauro, ‘‘Systems biology workbench sys-bio.org web site,’’ SBW home web site, 2006.

99. P. Mendes, W. Sha, and K. Ye, ‘‘Artificialgene networks for objective comparison ofanalysis algorithms.,’’ Bioinformatics, vol.19, pp. ii122–9, 2003.

100. T. Van den Bulcke, K. L. Van Leemput, B.Naudts, P. van Remortel, M. Hongwu, A.Verschoren, B. De Moor, and K. Marchal,‘‘SynTReN: a generator of synthetic geneexpression data for design and analysis ofstructure learning algorithms,’’ BMC Bioin-formatics. vol. 7, 2006.

101. D. Tuck, H. Kluger, and Y. Kluger, ‘‘Char-acterizing disease states from topologicalproperties of transcriptional regulatory net-works,’’ BMC Bioinformatics, vol. 7, p. 236,2006.

102. A. Vazquez, R. Dobrin, D. Sergi, J.-P. Eck-mann, Z. N. Oltvai, and A.-L. Barabasi,‘‘The topological relationship between the

large-scale attributes and local interactionpatterns of complex networks,’’ Proc. Natl.Acad. Sci. U.S.A., vol. 101, pp. 17940–5,2004.

103. A. Wagner and D. Fell, ‘‘The small worldinside large metabolic networks,’’ Proc. Roy.Soc. Lond. Ser. B, vol. 268, pp. 1803–10,2001.

104. J. McDermott and R. Samudrala, ‘‘Bio-verse: functional, structural, and contextualannotation of proteins and proteomes,’’Nucleic Acids Res., vol. 31, pp. 3736–7,2003.

105. S. Altschul, T. Madden, A. Schaffer, J.Zhang, Z. Zhang, W. Miller, and D. Lip-man, ‘‘Gapped blast and psi-blast: a newgeneration of protein database searchprograms. ,’’ Nucleic Acids Res., vol. 25,pp. 3389–402, 1997.

106. M. Bansal, G. D. Gatta, and D. di Bernardo,‘‘Inference of gene regulatory networks andcompound mode of action from time coursegene expression profiles,’’ Bioinformatics,vol. 22, pp. 815–22, 2006.

107. R. Bonneau, D. J. Reiss, P. Shannon,M. Facciotti, L. Hood, N. S. Baliga,and V. Thorsson, ‘‘The Inferelator: analgorithm for learning parsimoniousregulatory networks from systems-biol-ogy data sets de novo,’’ Genome Biol.,vol. 7, 2006.

108. X.-W. Chen, G. Anantha, and X. Wang, ‘‘Aneffective structure learning method for con-structing gene networks,’’ Bioinformatics,vol. 22, pp. 1367–74, 2006.

109. A. de la Fuente, N. Bing, I. Hoeschele, andP. Mendes, ‘‘Discovery of meaningful asso-ciations in genomic data using partial corre-lation coefficients,’’ Bioinformatics, vol. 20,pp. 3565–74, 2004.

110. B. Hayete, T. S. Gardner, and J. J. Collins,‘‘Size matters: network inference tacklesthe genome scale,’’ Mol. Syst. Biol., vol. 3,2007.

111. J. J. Rice, Y. Tu, and G. Stolovitzky,‘‘Reconstructing biological networksusing conditional correlation analysis,’’Bioinformatics, vol. 21, pp. 765–73,2004.

112. M. Zou and S. D. Conzen, ‘‘A new dynamicBayesian network (DBN) approach for iden-tifying gene regulatory networks from timecourse data,’’ Bioinformatics, vol. 21,pp. 71–9, 2005.

113. F. Markowetz, ‘‘A bibliography on learn-ing causal networks of gene interactions(July 31, 2006),’’ Princeton University,

Biological Network Inference and Analysis with SEBINI and CABIN 575

Lewis-Sigler Institute of Integrative Geno-mics, 2005.

114. R. Milo, S. S. Shen-Orr, S. Itzkovitz, N.Kashtan, D. Chklovskii, and U. Alon, ‘‘Net-work motifs: simple building blocks of com-plex networks,’’ Science, vol. 298,pp. 824–827, 2002.

115. F. Schreiber and H. Schwobbermeyer,‘‘MAVisto: a tool for the exploration of net-work motifs,’’ Bioinformatics, vol. 21,pp. 3572–4, 2005.

116. S. S. Shen-Orr, R. Milo, S. Mangan, and U.Alon, ‘‘Network motifs in the transcriptionalregulation network of Escherichia coli,’’ Nat.Genet., vol. 31, pp. 64–8, 2002.

576 Taylor and Singhal


Recommended