+ All Categories
Home > Documents > Integrating data to acquire new knowledge: Three modes of integration in plant science

Integrating data to acquire new knowledge: Three modes of integration in plant science

Date post: 18-Dec-2016
Category:
Upload: sabina
View: 216 times
Download: 1 times
Share this document with a friend
12
Integrating data to acquire new knowledge: Three modes of integration in plant science Sabina Leonelli Department of Sociology, Philosophy and Anthropology & Egenis, University of Exeter, Byrne House, St Germans Road, EX4 4PJ Exeter, UK article info Article history: Available online xxxx Keywords: Data Integration Plant biology Translational research Model organisms Scientific knowledge abstract This paper discusses what it means and what it takes to integrate data in order to acquire new knowledge about biological entities and processes. Maureen O’Malley and Orkun Soyer have pointed to the scientific work involved in data integration as important and distinct from the work required by other forms of integration, such as methodological and explanatory integration, which have been more successful in captivating the attention of philosophers of science. Here I explore what data integration involves in more detail and with a focus on the role of data-sharing tools, like online databases, in facilitating this process; and I point to the philosophical implications of focusing on data as a unit of analysis. I then analyse three cases of data integration in the field of plant science, each of which highlights a different mode of inte- gration: (1) inter-level integration, which involves data documenting different features of the same spe- cies, aims to acquire an interdisciplinary understanding of organisms as complex wholes and is exemplified by research on Arabidopsis thaliana; (2) cross-species integration, which involves data acquired on different species, aims to understand plant biology in all its different manifestations and is exemplified by research on Miscanthus giganteus; and (3) translational integration, which involves data acquired from sources within as well as outside academia, aims at the provision of interventions to improve human health (e.g. by sustaining the environment in which humans thrive) and is exemplified by research on Phytophtora ramorum. Recognising the differences between these efforts sheds light on the dynamics and diverse outcomes of data dissemination and integrative research; and the relations between the social and institutional roles of science, the development of data-sharing infrastructures and the production of scientific knowledge. Ó 2013 Elsevier Ltd. All rights reserved. When citing this paper, please use the full journal title Studies in History and Philosophy of Biological and Biomedical Sciences 1. Introduction The so-called ‘data deluge’, caused by the overwhelming quan- tity of information available to scientists through new technologies for the production, storage and dissemination of data, keeps mak- ing headlines. 1 Perhaps unsurprisingly, Microsoft researchers have taken the lead in dubbing data-intensive approaches as a brand new approach to scientific research (Hey, Tansley, & Tolle, 2009). Equally unsurprising is the position of scholars in the history, philos- ophy and social studies of science, who are taking a more cautious stand on both the novelty and the revolutionary potential of these developments (see for instance Bowker, 2000; Lenoir, 1999; Calle- baut, 2012 and other papers in the same special issue). Enthusiasts and skeptics tend to agree, however, that digial technologies are fuelling remarkable developments within most scientific fields, including new ways to mine large datasets to extract meaningful patterns; and that philosophers of science ought to take notice and carefully examine their epistemic characteristics and possible impli- cations. This paper contributes to this effort by focusing on a central 1369-8486/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.shpsc.2013.03.020 E-mail address: [email protected] 1 See the recent special themed issues of Nature (4 September 2008), Science (11 February 2011), the Economist (27 February 2011), and recurring coverage of the data deluge in these magazines as well as the New York Times, the Wired Magazine and most national newspapers worldwide. The latest instantiation of these discussions is the report released by the Royal Society in June 2012 on ‘open data’, which takes a very strong stance the importance of intelligent open access—which many funding bodies, especially in Europe and the United States, are taking seriously (Royal Society, 2012). Studies in History and Philosophy of Biological and Biomedical Sciences xxx (2013) xxx–xxx Contents lists available at SciVerse ScienceDirect Studies in History and Philosophy of Biological and Biomedical Sciences journal homepage: www.elsevier.com/locate/shpsc Please cite this article in press as: Leonelli, S. Integrating data to acquire new knowledge: Three modes of integration in plant science. Studies in History and Philosophy of Biological and Biomedical Sciences (2013), http://dx.doi.org/10.1016/j.shpsc.2013.03.020
Transcript
  • ge

    Department of Sociology, Philosophy and Anthropology & Egenis, University of Exeter, Byrne House, St Germans Road, EX4 4PJ Exeter, UK

    Keywords:DataIntegrationPlant biologyTranslational researchModel organismsScientic knowledge

    about biological entities and processes. Maureen OMalley and Orkun Soyer have pointed to the scientic

    ing headlines.1 Perhaps unsurprisingly, Microsoft researchers havetaken the lead in dubbing data-intensive approaches as a brandnew approach to scientic research (Hey, Tansley, & Tolle, 2009).Equally unsurprising is the position of scholars in the history, philos-ophy and social studies of science, who are taking a more cautious

    including new ways to mine large datasets to extract meaningfulpatterns; and that philosophers of science ought to take notice andcarefully examine their epistemic characteristics and possible impli-cations. This paper contributes to this effort by focusing on a central

    E-mail address: [email protected] See the recent special themed issues of Nature (4 September 2008), Science (11 February 2011), the Economist (27 February 2011), and recurring coverage of the data deluge

    in these magazines as well as the New York Times, the Wired Magazine and most national newspapers worldwide. The latest instantiation of these discussions is the reportreleased by the Royal Society in June 2012 on open data, which takes a very strong stance the importance of intelligent open accesswhich many funding bodies, especially in

    Studies in History and Philosophy of Biological and Biomedical Sciences xxx (2013) xxxxxx

    Contents lists available at SciVerse ScienceDirect

    Studies in History and Philosophy of Biological andBiomedical Sciences

    .e lEurope and the United States, are taking seriously (Royal Society, 2012).1. Introduction

    The so-called data deluge, caused by the overwhelming quan-tity of information available to scientists through new technologiesfor the production, storage and dissemination of data, keeps mak-

    stand on both the novelty and the revolutionary potential of thesedevelopments (see for instance Bowker, 2000; Lenoir, 1999; Calle-baut, 2012 and other papers in the same special issue). Enthusiastsand skeptics tend to agree, however, that digial technologies arefuelling remarkable developments within most scientic elds,When citing this paper, please use the full journal title Studies in History and Philosophy of Biological and Biomedical Sciences1369-8486/$ - see front matter 2013 Elsevier Ltd. Ahttp://dx.doi.org/10.1016/j.shpsc.2013.03.020

    Please cite this article in press as: Leonelli, S. IntPhilosophy of Biological and Biomedical Sciences (work involved in data integration as important and distinct from the work required by other forms ofintegration, such as methodological and explanatory integration, which have been more successful incaptivating the attention of philosophers of science. Here I explore what data integration involves in moredetail and with a focus on the role of data-sharing tools, like online databases, in facilitating this process;and I point to the philosophical implications of focusing on data as a unit of analysis. I then analyse threecases of data integration in the eld of plant science, each of which highlights a different mode of inte-gration: (1) inter-level integration, which involves data documenting different features of the same spe-cies, aims to acquire an interdisciplinary understanding of organisms as complex wholes and isexemplied by research on Arabidopsis thaliana; (2) cross-species integration, which involves dataacquired on different species, aims to understand plant biology in all its different manifestations andis exemplied by research onMiscanthus giganteus; and (3) translational integration, which involves dataacquired from sources within as well as outside academia, aims at the provision of interventions toimprove human health (e.g. by sustaining the environment in which humans thrive) and is exempliedby research on Phytophtora ramorum. Recognising the differences between these efforts sheds light on thedynamics and diverse outcomes of data dissemination and integrative research; and the relationsbetween the social and institutional roles of science, the development of data-sharing infrastructuresand the production of scientic knowledge.

    2013 Elsevier Ltd. All rights reserved.a r t i c l e i n f o

    Article history:Available online xxxx

    a b s t r a c t

    This paper discusses what it means and what it takes to integrate data in order to acquire new knowledgeIntegrating data to acquire new knowledplant science

    Sabina Leonelli

    journal homepage: wwwll rights reserved.

    egrating data to acquire new kn2013), http://dx.doi.org/10.101: Three modes of integration in

    sevier .com/locate /shpscowledge: Three modes of integration in plant science. Studies in History and6/j.shpsc.2013.03.020

  • aspect of the data-intensive approach: the integration of large anddiverse datasets in order to acquire new knowledge.

    Data vary greatly in their format and availability; in the ways inwhich they have been produced and the materials fromwhich they

    environmental and agricultural interventions, all of which are ofpotential value to the preservation and improvement of humanhealth. I will argue that taking the potential social impact of scien-tic research into account has implications for how the philosophy

    (200cs.re a

    2 S. Leonelli / Studies in History and Philosophy of Biological and Biomedical Sciences xxx (2013) xxxxxxhave been extracted; in the geographical sites, temporal scales andepistemic goals of the scientists generating them; and, most trivi-ally perhaps, in the objects and processes that they can be taken todocument. Integrative research efforts need to bridge across thesemultiple dimensions, by bringing together data obtained in a vari-ety of different settings so that they can be analyzed together andbrought to bear on common questions. Data integration thus re-quires extensive scientic labor, including the development ofapposite infrastructures, analytic tools, standards, methods andmodels. In their recent analysis of integration in molecular systemsbiology, Maureen OMalley and Orkun Soyer point to the scienticwork involved in data integration as important and distinct fromthe work required by other forms of integration, such as methodo-logical and explanatory integration, which have been more suc-cessful in captivating the attention of philosophers of science(OMalley & Soyer, 2012; see also OMalley, 2013). This paper ex-tends their argument by looking in more detail at what data inte-gration involves and pointing to the implications of focusing ondata as a unit of analysis (Sections 2 and 3). I then focus on specicpractices of data sharing and re-use in plant science, and argue thatfocusing on the dynamics of data integration makes it possible toidentify at least three distinct modes of integration at work in con-temporary biology, which often co-exist within the same labora-tory, but whose competing demands and goals cannot usually beaccommodated to an equal extent within any one research project(Section 4). These three modes are: (1) inter-level integration,involving the assembling and interrelation of results applying todifferent levels of organization within the same species, with theprimary aim to improve on existing knowledge of its biology; (2)cross-species integration, involving the comparison and co-con-struction of research on different species, again with the primaryaim to widen existing biological knowledge; and (3) translationalintegration, involving the use of data from a wide variety of differ-ent sources in order to devise new forms of intervention on organ-isms which will improve human health (for instance throughagricultural interventions, which are arguably as relevant as med-ical interventions in fostering human health, though plant biologytypically receives less attention than medical research).

    Paying attention to the differences and interplay between thesemodes of integration illuminates the mechanics and challenges ofmaking data not only accessible online but also usable to the scien-tic community; the large amount of conceptual and material scaf-folding needed to transform data available online into newscientic knowledge; and the different forms of scientic knowl-edge that may result from processes of data integration, dependingon which communities, infrastructures and institutions are in-volved in scientic research. My analysis also underscores theimportance of considering the whole spectrum of scientic activi-ties, including so-called applied research carried out by industryor governmental agencies, in order to develop and improve currentphilosophical understandings of scientic epistemology. As alsostressed by Brigandt (2013) and Bechtel (2013) with reference tomechanistic explanations, any one component of scientic re-search (whether data, models or explanations) can potentially con-tribute to enhancing opportunities for biomedical as well as

    2 Historical work by the likes of Harwood (2012), Mller-Wille (2007), Smocovitisdevelopment of several branches of biology, including evolutionary theory and geneti

    3 Of course this does not mean that plant science is a homogeneous eld, or that the

    the historical separation between agricultural and molecular approaches to plant science, w& Bastow, 2012). I also do not mean to assert that there are no comparable examples oforganisms, see Kohler, 1994 and Leonelli & Ankeny, 2012).

    Please cite this article in press as: Leonelli, S. Integrating data to acquire new knPhilosophy of Biological and Biomedical Sciences (2013), http://dx.doi.org/10.101of science makes sense of the different strategies that scientistsmay develop when pursuing a research programme.

    I shall base my philosophical analysis on historical and ethno-graphic research that I carried out in the areas of model organismbiology, bioinformatics and plant science over the last eight years(documented in detail in Leonelli, 2010a, 2012a; Leonelli &Ankeny, 2012). In particular, I will use three case studies in con-temporary plant science as exemplary for the forms of integrationthat I wish to discuss: (1) the research activities centeredaround data gathered on model organism Arabidopsis thaliana;(2) the efforts to integrate Arabidopsis data with data gathered onthe perennial crop family Miscanthus; and (3) current investiga-tions of the biology of Phytophtora ramorum, a plant parasite thatis wreaking havoc in the forests of the South-West of the UK,where I live.

    Focusing on plant biology is an important choice here for sev-eral reasons. First, despite its enormous scientic and social impor-tance, this is a relatively small area of researchespecially incomparison to biomedicineand has received very little attentionwithin the philosophy of biology.2 It is also heavily funded by gov-ernmental agencies, particularly when it comes to research relatingto molecular and genomic aspects of plantsa fact that enhancedopportunities for plant scientists to work on foundational questions,as well as pushing them to join forces and collaborate in order to at-tract the attention of sponsors and make the best of limited re-sources (Leonelli, 2007). Plant science has indeed been open togeneralist and interdisciplinary thinking throughout its history, aseven the most reductionist plant scientists tend to be greatly inter-ested in the relations between molecular biology, cellular mecha-nisms, developmental biology, ecology and evolution (BotanicalSociety of America, 1994; Browne, 2001). Further, pre-publicationexchanges have been strongly encouraged within this community,particularly again at the time of the introduction of molecular anal-ysis on plants, which was spearheaded by a group of charismaticindividuals who explicitly aimed to advance knowledge throughopen collaboration, thus effectively predating todays open sciencemovement (Rhee, 2004, Koornneef and Meinke, 2010). These factorshave made the community of plant scientists into a relatively morecohesive and collaborative one than more powerful, socially visibleand well-funded elds focusing on animal models, such as cancer re-search or immunology.3 Given this background, it is not surprisingthat plant science has produced some of the best available resourcesfor scientic data management and integration to date. Plant scien-tists interest in working together, and thus in nding efcient waysto assemble and disseminate their resources and results, long pre-cedes the advent of digital technologies for data sharing, and manyof these scientists were quick to seize the potential provided bythose technologies to help them in their integrative efforts. As a re-sult, plant science disposes of some of the most sophisticated dat-abases and modeling tools in biology (The InternationalArabidopsis Informatics Consortium, 2012). Its contributions to sys-tems biology are also very advanced, particularly in fostering thedevelopment of digital organisms (i.e. the use of mathematical mod-els and simulations to integrate qualitative data in order to predictorganismal behavior and traits in relation to the environment).

    6) and Kingsbury (2009) clearly shows the key role played by plant scientists in the

    re no tensions and non-overlapping programmes at play within it. A major problem is

    hich came into effect in the second half of the 20th century (Leonelli, Charnley, Webb,cooperation in other elds (for an analysis of such cooperation surrounding model

    owledge: Three modes of integration in plant science. Studies in History and6/j.shpsc.2013.03.020

  • Yet another characteristic of plant sciencemakes it a particularlyfruitful terrain onwhich to explore ideas on data integration as a keyprocess in producingnewknowledge. Plant science produces resultsofdirect interest to several sectionsof society, including farmers, for-estry management, landowners, orists and gardeners, the foodindustry and the energy industry (through the production of rst

    through an evaluation of the ways in which data have been col-lected and disseminated, including the instruments and materialsemployed to that effect.

    This denition of data is inspired by and compatible with IanHackings (1992) work on marks, even if he limits his analysis todata produced through experiment, while my view includes eld

    c ret (20ns (encewhe

    S. Leonelli / Studies in History and Philosophy of Biological and Biomedical Sciences xxx (2013) xxxxxx 3and secondgenerationbiofuels), socialmovements concernedaboutgenetically modied foods, sustainable farming and populationgrowth, breeders of newplant varieties for agricultural or decorativeuse, and of course national and international government agencies.As Iwill show, representatives of these groups canbe and sometimesare called to participate in the development and planning of scien-tic projects, thus contributing to choice of goals, opportunitiesand constraints associated to ongoing research. In signicant ways,direct contributions to scientic research by non-scientists make adifferencenot only to the goals ultimately servedby science, but alsoto its practice, methods and results, including what strategies areused to share and integrate data, and what comes to count as newscientic knowledge arising fromsuch integration. As I hope to illus-trate, recognizing the differences in the degrees to which scienticinquiry is brought in contact with other sections of society involveschallenging the internalistic view of scientic knowledge that is stillfavored by many philosophers of science, thus bringing my argu-ments to bear on recent debates around the social relevance of phi-losophy of science (e.g., Kourany, 2010). Going beyond the view ofscience as aiming solely to acquire true knowledge of the worldmay seem a long shot when starting from an analysis of differentforms of integration in contemporary plant science; and yet, as Ishow in this paper, looking at processes of integration in actionimmediately points to important differences in the types andsources of the data that are being integrated, the integrative pro-cesses themselves and the forms of knowledge obtained as a resultof integration.4

    2. Using data to produce knowledge

    Before delving into an analysis of different forms of integrationin plant science, let me elaborate on what I mean by data andknowledge and what I take to be the relation between these twokey notions. I view data as mobile pieces of information, whichare collected, stored and disseminated so as to be used as evidencefor claims about specic processes or entities. Thus any materialproduct of research activities, ranging from artefacts such as pho-tographs to symbols, can be considered as a piece of data as long as(1) it is taken to constitute potential evidence for a range of phe-nomena, and (2) it is possible to circulate it across a communityof scientists (through any means ranging from archives to dat-abases, journal publications and stock centres or biobanks). Thismeans rst of all that the evidential value assigned to an observa-tion or a measurementthe specic claims that it is taken to con-stitute evidence fordoes not need to be specied in advance forthat observation or measurement to count as data: what mattersis that someone collects and stores that observation or measure-ment with the expectation that it may be used as evidence for oneor more claims about the world at some point in the future. Whatmatters in assessing the evidential value of data is their reliabilityand quality as potential evidence, which is judged by scientists

    4 My work is thus aligned with other attempts at broadening the notions of scientiscience, including for instance Longino (2002), Douglas (2009), Mitchell (2009), Elliot

    5 It might be objected here that my account of data is very close to Morgan & Morrisoto models, there is no difference between a model and data. On my account, the differartefacts. When something is used as evidence for a claim, it is fullling the role of data;6 This is not because it is impossible to predict at least some future use of data as evideRather, it is because one cannot predict all the possible claims that data might be used as evbe used as evidence for specic claims until it happens.

    Please cite this article in press as: Leonelli, S. Integrating data to acquire new knPhilosophy of Biological and Biomedical Sciences (2013), http://dx.doi.org/10.101observations and the results of simulations and mathematicalmodelling. It is also largely compatible with James Griesemers(2006) view on tracking as a key form of scientic inquiry and withHans-Jrg Rheinbergers (2011) work on the medial world ofknowledge-making, where he elaborates on the idea of data asthings that can be stored and retrieved, and are thus made durable.My approach emphasises the importance of the use that is made ofdata over their intrinsic properties: the evidential value of datacomes from their interpretation in relation to specic contextsand goals, rather than as a context-independent quality (Leonelli,2009a). It also separates the analysis of how scientists corroboratetheir claims about reality, which is the main issue I am concernedwith here, with the analysis of how they develop representations ofnatural phenomena that are conducive to new understand-ing,which is the problem typically discussed by philosophers con-cerned with modelling.5 An important implication of this way ofviewing data is that, depending on how and where they are used,they can be interpreted as constituting evidence for several differentclaims and can provide evidence on which to model several differentphenomena; and that, given the context-dependence of their use, itis not possible to predict in advance exactly which claims data mightbe used to corroborate (Leonelli, 2009a).6

    This broad characterization of data is crucial to understandingthe link between data and scientic knowledge. Data are not, bythemselves, a form of knowledge. Rather, data need to be inter-preted in order to yield knowledge; and interpretation, in which-ever form and through whichever process it is achieved, involvesusing data as evidence for a specic claim. Here I agree with Rhein-bergers view that the scientic value of data lies in the extent towhich they are taken to document aspects of the world: the impor-tant question for scientists is what kinds of patterns they are ableto extract from data, in order to use them as evidence for specicclaims about phenomena. It is that claim, rather than the data,which expresses knowledge about reality, and is therefore often re-ferred to as knowledge claim or propositional knowledge. This formof knowledge is also what scientists typically refer to as expressingthe scientic signicance of data. Another form of scienticknowledge, which is complementary to the propositional form, isthe embodied knowledge required to handle data and use them asevidence within a specic research context. Embodied knowledgecaptures what Gilbert Ryle refers to as knowing how, thus singlingout the knowledge needed to carry out scientic research and dis-tinguishing it from the propositional knowledge used to deviseexperiments and interpret results (knowing that; Ryle, 1949).Embodied knowledge includes, for instance, the skills involved inmanipulating a mathematical model so that it can be tted to a gi-ven dataset, as routinely done in metabolic control analysis; andthe skills involved in understanding that the symbols used to cap-ture sequence data refer to the order and quality of nucleic acidmolecules within a genome (for a detailed discussion, see Leonelli,2009b).

    search, collaboration and knowledge traditionally supported within the philosophy of11), Nordmann, Radder, & Schiemann (2011), and Chang (2012).1999) account of models as tool, and thus that if one adopts the mediators approachbetween data and models is not ontological, but rather lies in the use made of givenn the same thing is used to represent a phenomenon, it is fullling the role of a model.

    nce: this is clearly the case, as one of the referees of this paper usefully pointed out.idence for in the future; and also because one cannot predict whether data will actually

    owledge: Three modes of integration in plant science. Studies in History and6/j.shpsc.2013.03.020

  • Noting the tight relation between propositional and embodiedknowledge involved in handling data is relevant to understandingcurrent concerns surrounding the data deluge and the difculties

    and practical decisions about how to integrate and visualise data af-fect the form and quality of knowledge obtained as a result.

    It should be noted here that the integrative efforts of curators

    ativis c

    4 S. Leonelli / Studies in History and Philosophy of Biological and Biomedical Sciences xxx (2013) xxxxxxin transforming data available in public databases into scienticknowledge. Data are made available through public repositoriesand online databases on the strength of one crucial assumption:that different groups of scientists, including but not limited tothe group that originally generates the data, will be able to pickthem up, interpret them and use them as evidence for new claims.In this sense the Royal Society stresses the importance of intelli-gent data access (Royal Society, 2012): data circulated onlineneeds to be formatted and visualised in ways that facilitate theiruse by scientists across the globe, independently of whether thosescientists have been involved in the production and collection ofthose data. Scientists working within different labs across theglobe possess their own distinct mix of propositional and embod-ied knowledge, which makes the re-use of data in new settingsboth very challenging and potentially fruitful, as different skillsand expertise will be employed to interpret the scientic signi-cance of the same dataset. Once a specic dataset is made availableto several scientic communities at the same time, each of thosecommunities may have a different perspective on how to interpretit. Hence, one dataset may end up being used as evidence for amuch greater number of knowledge claims than if it was keptwithin the same research context in which it was originally pro-duced. Scientic pluralism in both the practices (embodied knowl-edge) and the results (propositional knowledge) of research is thusessential to data-intensive processes of discovery: the generativepower of data-intensive science comes from the wide variationsin the propositional and embodied knowledge, as well as the goalsand interests, possessed by different scientic communities.7

    3. Standardising knowledge to integrate data

    The above vision of data-intensive methods as efciently chan-nelling diverse research activities into new discoveries, which fuelsmuch of the expectations and hype underlying data-intensive sci-ence, is immediately challenged when considering its practicalrealisation. Here is where the issue of data integration, and allthe complications involved in achieving it, comes into the picture.Data cannot be stored and circulated without any organising prin-ciples; this basic requirement is ever more pressing when postingdata online, given the potential for large storage capability, imme-diate dissemination and wide reach provided by the web. Datastored in digital databases need to be standardised, ordered andvisualised, so that scientists can retrieve them from databases inways that help them in their own research. These processes consti-tute an important form of data integration, which involves signi-cant amounts of labour and expertise in biology as well ascomputer science, including the ability to conceptually order data,format them to t specic programmes, and developing adequatesoftware and models. Indeed, making data available through dat-abases often requires a sophisticated understanding of what datamight be used for, as well as extensive work on the classicationand modelling of datasets so that they become compatible witheach other, retrievable and re-usable by the wider scientic com-munitya form of scientic labour that I shall refer to as curato-rial, as it is currently database curators who are largelyresponsible for taking care of data in this way.8 Data curation con-stitutes an integral part of processes of discovery, where conceptual

    7 My interpretation of pluralism is more encompassing than Sandra Mitchells integrgrounded in the complexity specic to biological systems (Mitchell, 2003); my position

    axial regimes of practice encompassing all elements of research, including data, models a

    8 For a detailed study of the professional role of curators in making data travel in contempcurators has evolved throughout 20th century biology, see Strasser (2011) and Leonelli &

    Please cite this article in press as: Leonelli, S. Integrating data to acquire new knPhilosophy of Biological and Biomedical Sciences (2013), http://dx.doi.org/10.101are not well coordinated with the data integration carried out bydatabase users, such as experimental biologists. This is, on theone hand, because scientists involved in generating knowledgethrough data re-use typically do not want to spend time participat-ing in curatorial efforts, as it robs them of precious research time(the fact that curatorial activities are not recognised within currentsystems of credit attribution in science only adds to this problem;McCain, 1995, Ankeny & Leonelli, 2011, Leonelli & Ankeny, 2012).On the other hand, and perhaps more problematically from theepistemic viewpoint, many biologists do not fully appreciate thatdata dissemination via databases involves substantial integrativeefforts, rather than being simply conducive to integration. Thisis partly due to the pernicious idea that data found online areraw, i.e., that they are shown online in exactly the same formatas when they were rst produced. Biologists who are committedto this idea are reluctant to consider how the integration of data ef-fected through databases is likely to affect the original format ofdata and the ways in which they are visualised. So databases areviewed by many scientists as a neutral territory, through whichdata travel without changing in any way; while, in order to workefciently in disseminating data, databases need to function as atransformative platform, within which data are carefully selected,formatted, classied and integrated in order to be retrieved andused by the scientists who may need them.

    In their analysis of data integration, OMalley and Soyer (2012)discuss one key aspect of this situation: the activity of makingcomparable different datasets from a huge variety of potentiallyinconsistent sources (p. 61). As they illustrate, data disseminationthrough databases is often expected to remove inconsistenciesamong sourcessuch as differences in the instruments used toproduce dataand thus make it possible to interpret data as a uni-ed whole, without worrying about their provenance and abouthow they were merged (e.g., Ideker, Bafna, & Lemberger, 2007).Other potential inconsistencies to be removed include differencesin data formats; standards for what count as valid or reliable data;and data types, ranging from genome sequences to metabolic andeven morphological data (as in the recent case of phenomics). Inte-gration within databases can thus be described as the gathering ofseveral different types of data, obtained by a variety of occasionallyinconsistent sources, so that they can be searched and analysed asa single body of information.

    This form of data integration is largely achieved through thedevelopment of standardised descriptions of both the propositionaland the embodied knowledge employed in producing the data in therst place. Database curators do not achieve data integration byavoiding worries about data provenance or manipulations of thedata that they collect into their resource. Rather, curators are wellaware that information about the provenance of datahow theywere produced, why and by whomis crucial to interpreting thedata and assessing their quality and reliability; they are also awarethat how data are formatted, annotated and visualised in a databasedetermined the extent to which data are re-widely usable. Placingvalue on the diversity of life histories of data, and of the settings inwhich they are produced, is thus crucial to the practice of data cura-tion. Thus, overcoming the problem of inconsistent sources, as wellas the broader question of how to integrate and visualise data thatcome in different formats and types, involves putting information

    e pluralism, which focuses chiey on the integration of multilevel explanations and islosest to Hasok Changs recent account of interactive pluralism, which involves pluri-

    nd a variety of possible aims (Chang, 2012, p. 272).orary biology, see Leonelli (2010, 2012). For historical analyses of how the role of dataAnkeny (2012).

    owledge: Three modes of integration in plant science. Studies in History and6/j.shpsc.2013.03.020

  • about the provenance of data at the centre of the database itself. Tothis aim, curators are developing standardised descriptions of tech-niques, assumptions, methods and conditions under which data areproduced, so that information about how data have been generatedcanbe intelligible aswidely as possible beyond theboundaries of the

    quence data, transcriptomics, proteomics and even phenotypicdata, which all need to be searchable and retrievable by users;and aims to serve comparative research with other organisms,which involves extensive efforts to integrate data across species.TAIR curators, who include experts in various aspects of plant biol-

    desorialat probse

    S. Leonelli / Studies in History and Philosophy of Biological and Biomedical Sciences xxx (2013) xxxxxx 5producers lab.9

    One example of such standards is data formats and condencerankings. Curatorsdeterminewhich formats (and thuswhich instru-ments andoutputs) data producers should privilege, and studywaysto translate between these formats and others used by data produc-ers. Further, curators are increasingly pushed to provide at leastsome preliminary assessment of the quality and reliability of data(a process often referred to as data control), such as condencerankingswhere datasets are classieddepending on the trust placedby the scientic community on the methods and instruments usedfordatageneration (see for instance thecontroversiesover thestatusofmicroarray data; Keating& Cambrosio, 2012). Another example isthe selection of salient features of data production. Curators selectand standardise information, usually referred to as meta-data,about where specic datasets originally come from, how they werecollected and how they were formatted, annotated and visualisedupon entry into the database. This involves making decisions aboutwhich knowledge, both embodied and propositional, is most rele-vant to situating data, integrating them with other data, and inter-preting their signicance. Several ongoing initiatives inbioinformatics, usually initiated by database curators, center uponthe criteria for choosing which instruments, experimental condi-tions, techniques and procedures need to be reportedwhen dissem-inating data online. One relatively successful effort is MinimalInformation on Biological and Biomedical Investigations (MIBBI),which provides standardized descriptions of the embodied knowl-edge involved inbiological experiments (Tayloret al., 2008).Anotheris the standardized description of features of seeds to be stored inseedbanks provided by the Plant Ontology (Avraham et al., 2008).Notably, written descriptions are increasingly seen as insufcientways to capture embodied knowledge, and are complemented byother media such as videos of data production processes (for in-stance, the Journal of Visualised Experiments). Even a brief overviewof these standardising practices indicates that curators are contrib-uting to the formalisation and expression of both the propositionaland the embodied knowledge required to make sense of data andtransform them into newknowledge, thus playing a key role in facil-itating data integration both within and beyond databases (for astudy of this process with regard to the formalisation of proposi-tional knowledge through bio-ontologies, see Leonelli, 2012b).

    4. Data integration in plant science

    I will now consider three actual cases of data integration, eachof which leads to the production of new knowledge. All three casescome from the eld of plant science, which, as I stressed in myintroduction, nds itself at the forefront of data integration anddata-intensive discovery. One instance of this is The ArabidopsisInformation Resource (TAIR), an online database devoted to thestorage and dissemination of data collected on Arabidopsis thaliana,a key model organism for plant science. Data integration has been acentral concern for TAIR curators since its inception, since the data-base contains several types of data about Arabidopsis, including se-

    9 Curators are able to do this by combining several research skills that typically incluinformation technology and programming (Leonelli 2010, 2012). In many cases, curatrelevant expertises; yet, it is often the vision of biologists, and their awareness of wh10 I should note that this reading of model organism biology does not run counter the

    and pathways. What Rachel Ankeny and I have defended is that the idea of focusing researcthose disparate bits of knowledge at some point in the future. The attempt to achieve succommunities, and shapes the methods used to pursue research.

    Please cite this article in press as: Leonelli, S. Integrating data to acquire new knPhilosophy of Biological and Biomedical Sciences (2013), http://dx.doi.org/10.101ogy as well as computer programmers, have thus long been awarethat their database is expected to both integrate data as part of itsefforts to disseminate them, and to foster further integration (ofdata as well as other components of scientic research) by itsusers. For instance, a major achievement has been the integrationof available data about genes and their functions, enzymes, com-pounds and reactions to visualize and study specic metabolicpathways (through two tools called MetaCyc and AraCyc; Zhanget al., 2005). More recent updates include a complete revision ofthe Arabidopsis genome, in which pseudogenes and transposongenes were re-annotated, and new data from proteomics and nextgeneration transcriptome sequencing were incorporated into genemodels and splice variants (Lamesch et al., 2012, p. 1).

    Sue Rhee, who has directed the resource in its rst ve years,described the type of biology that TAIR is supposed to foster inthe following way:

    If the next twenty years of biology could be summed up intoone word, it would be integration. We will see integration ofbasic research with applied research in which plant biotechnol-ogy will play an essential role in solving urgent problems inour society such as developing renewable energy, reducingworld hunger and poverty, and preserving the environment.We will see integration of disparate, specialised areas of plantresearch into more comparative, connected, holistic views andapproaches in plant biology. We will also see more integrationof plant research and other biological research, from microbes tohumans, from a large-scale comparative genomics perspective.Bioinformatics will provide the glue with which all of thesetypes of integration will occur (Rhee et al., 2006, p. 352).

    Rhee implicitly singles out different types of integration here, which Ihighlighted initalics. Indherpreliminaryclassicationilluminating,as it implicitly points to a difference in modes of integration that hasnothithertobeenexplicitlydiscussedwithinphilosophyofsciencelit-erature, andwhichplacesdata integration in thecontextof thevarietyof interests, epistemicand institutional structures andgoals at play incontemporary plant science research. I thus want to expand on thethree approaches to integration that Rhee distinguishes here, andwhich I shall call inter-level, cross-species and translational. Inthe next three subsections, I look inmore detail at these types of inte-grative effort by picking case studies that exemplify each of them.

    4.1. Inter-level integration and model organism research

    The idea of centering research efforts on a handful of species hasbeen a hallmark of 20th century experimental biology, culminatingin the sequencingof thegenomesof themostpopular of theseorgan-isms: Escherichia coli, Caenorhabdiis elegans, yeast,Drosophila, zebra-sh and, for plants, Arabidopsis. One of the aims of this researchstrategy was to integrate knowledge produced on different aspectsof the biology of the same organism, so as to understand the biologyof the organism as a holistic whole rather than as an ensemble ofdisconnected parts (Ankeny & Leonelli, 2011).10 Despite the many

    both propositional and embodied knowledge as a bench scientist and some training inteams are composed of both biologists and computer scientists, in order to cover allospective database users need, that guides the development of databases.rvation that many research projects under that banner deal with isolated components

    h on one species is historically grounded on the expectation of being able to integrateh holistic understanding functions as a key motivating factor for researchers in these

    owledge: Three modes of integration in plant science. Studies in History and6/j.shpsc.2013.03.020

  • criticisms leveled at the bias created by this strategy in funding alloca-tion and research directions (Bolker, 1995; Davis, 2004),model organ-ism research has been immensely successful, particularly in plantsciencewhere enormous advancesweremade in understanding com-plex processes such as photosynthesis, owering and root develop-

    tant phenotype and inferred from reviewed computational analy-sis, are used by TAIR users to gather information about theprovenance of the data found online and thus assess their reliabilityand quality. These meta-data facilitate inter-level integration by en-abling researchers working at a specic level (e.g. cellular) to assess

    ayh re

    ech

    6 S. Leonelli / Studies in History and Philosophy of Biological and Biomedical Sciences xxx (2013) xxxxxxment by focusing research efforts on Arabidopsis (Bevan & Walsh,2004). Understanding these processes requires an interdisciplinaryapproach comprising several levels of organization, from themolecu-lar to the developmental andmorphological.11 Arabidopsis provided arelatively simple organism on which integration across these levelscould be tried out under the controlled conditions of a laboratory set-ting. The use of Arabidopsis has been highly controversial within plantscience at large, with scientists specializing on the study of otherplants and/or plant ecology complaining that focusing on Arabidopsistook resources away from the study of plant biodiversity, evolutionand the relation between plants and their environment (Leonelliet al., 2012). At the same time, considering Arabidopsis in relative iso-lation from its natural environments and other plants has been verysuccessful in generating important insights about its inner mecha-nisms, particularly at the molecular and cellular levels. For instance,the detailed mechanistic explanations of photosynthesis achieved todate, and the resulting ability of scientists to manipulate starchesand light conditions to favorplant growth, are largelydue to successfulattempts to bring the study of enzymes and other proteins involved inphotosynthesis (which involves the analysis ofmolecular interactionswithin the cell nucleus) in relation with the study of metabolism(which involves the cellular level of analysis, since it focuses onpost-translational processes outside the nucleus). The integration ofthese two levels of analysis is fraught with difculties, since the eval-uation of data about DNAmolecules (as provided by genome analysis)needs to take account of their actual behavior within and interactionswith the complex and dynamic environment of the cell, which makesit extremely difcult to model metabolic pathways (e.g. Stitt, Lunn, &Usadel, 2010; Bechtel and Brigandt, in this issue, also discuss the dy-namic modeling of pathways).

    This is an excellent illustration of inter-level integration as aim-ing to understand organisms as complex entities, by combiningdata coming from different branches of biology in order to obtainholistic, interdisciplinary knowledge that cuts across levels of orga-nization of the same organism.12 This case also instantiates the keyrole played by databases and curatorial activities in achieving inter-level integration, as the development of centralized depositories fordata (rst material archives, and then digital databases) has beencentral to the success of model organism research (Bult, 2006; Leo-nelli & Ankeny, 2012). Since its inception, TAIR has been heavily en-gaged in facilitating inter-level data integration, particularly throughthe development of software and modeling tools that enable users tocombine and visualize several datasets acquired on two or more lev-els or organisation. Tools such as the above-mentioned AraCyc andMetaCyc, for instance, have been fundamental in enabling research-ers to combine and visualize genomic, transcriptomic and metabolicdata as a single body of information. This has made it possible tointegrate data generated from the molecular and the cellular levelsof organization, thus enabling researchers to visualise and study spe-cic metabolic pathways. TAIR curators have also devoted muchattention to developing evidence codes to capture meta-data aboutexperimental settings and techniques used to generate data in therst place. Evidence codes, include labels such as inferred from mu-

    11 I here adopt a broad denition of level of organisation, which is meant to reect a wis, the focus on specic components and scales of biological organisation, each of whicnature of the objects at hand.12 What I here call inter-level integration has also been discussed by proponents of m

    has also pointed to the importance of intra-level integration ignored by proponents of red13 This view is compatible with an emphasis on experimental intervention (learning by d1997; OMalley, 2011).

    Please cite this article in press as: Leonelli, S. Integrating data to acquire new knPhilosophy of Biological and Biomedical Sciences (2013), http://dx.doi.org/10.101and interpret data gathered at another level (e.g. molecular). Last butnot least, TAIR curators have endeavored to collaborate withresearchers from all corners of plant science in order to generatekeywords to describe the biological objects and processes currentlyunder investigation (keywords that could, in turn, be used to retrievedata relevant to those objects and processes from the database). Cru-cial requirement for these keywords was that they are intelligible toresearchers from all branches of biology, ranging from genetics toimmunology, ecology and, increasingly, evolutionary-developmentalbiology. To this aim, TAIR curators were involved in several contentmeetings, where biologists specializing on research at a variety oflevels of organization met to discuss how to identify and dene key-words to enable inter-level communication (Leonelli, 2010a). Twoprominent results of these efforts are the Plant Ontology and theGene Ontology, which have been implemented within TAIR as clas-sication systems for the retrieval of data about, respectively, geneproducts and plant features (Avraham et al., 2008; for a philosophi-cal analysis see Leonelli, 2010b, 2012b).

    It is important for my purposes here to stress that inter-level re-search on Arabidopsis, as in the case of many other popular modelorganisms, was driven strongly by the scientic community, withthe support of funding bodies such as the National Science Founda-tion, but with little inuence from other parts of society whichhave stakes in plant sciencesuch as, for instance, agricultural re-search, farmers and industrial breeders (Leonelli et al., 2012).Attempting to integrate data resulting from different strands ofplant research was seen as requiring expert consultations withinthe plant science community, aimed to resolve technical and con-ceptual problems in an effort to acquire an improved understand-ing of Arabidopsis biology. The very idea of using data from thesame model organism to achieve inter-level integration exempli-es the image of science often celebrated within philosophy of sci-ence circles, as well as many popular accounts of discovery: theseare scientists who wish to bring together their results within ex-pert circles that are largely separate from other sections of society;and that this is done in order to acquire a more accurate and truth-ful understanding of biological processes, resulting in the articula-tion of reliable explanations of those processes (and related formsof intervention).13 Indeed, inter-level integration is heavily con-cerned with the methodological and conceptual challenges derivingfrom the attempt to collaborate across disciplines, such as the chal-lenge of communicating ideas across different research communi-ties, which involves some standardization both of the propositionalknowledge produced by each community (so that others can under-stand it, as in the case of keywords such as in the Gene and PlantOntologies) and of the embodied knowledge developed (so thatexperimental results obtained by one community can be reproducedby others if needed, as in the case of evidence codes in TAIR). In orderto agree on the keywords and metadata used to classify and retrievedata on metabolic pathways, TAIR curators have engaged in exten-sive consultations with scientists working on all the relevant levelsof biological organisation (including molecular and cellular), so asto make sure that ensure that data integration tools within TAIR

    to organise and subdivide research topics that is still very popular within biology: thatquires a specic set of methods and tools of investigation that suit the dimensions and

    anistic explanation (Bechtel, 2013; Craver, 2005; Darden, 2005), though Craver (2005)

    uction.oing) and exploratory research as means to achieve biological discoveries (e.g., Burian,

    owledge: Three modes of integration in plant science. Studies in History and6/j.shpsc.2013.03.020

  • were set up in ways agreeable to and compatible with research atseveral levels of analysis.

    The efforts of TAIR curators are thus reminiscent of the challenge

    explore. Many researchers trained on Arabidopsis biology have thusswitched to comparative research on these two systems, which hashitherto proved very productive: many experiments needed to ac-

    o Br

    S. Leonelli / Studies in History and Philosophy of Biological and Biomedical Sciences xxx (2013) xxxxxx 7of communicating propositional knowledge across different scien-tic groups, whichmany philosophers of science have focused uponwhen reecting on scientic integration. This scholarship includesLindley Darden and Nancy Maulls classic paper on inter-eld theo-ries (1997) and, most recently, Bechtel and Richardson (2010),Mitchell (2003, 2009) and Brigandt (2010) on integrative explana-tions. My focus on data integration, rather than on the integrationof explanations, models and theories that these authors address,helps to highlight the importance of communicating embodiedknowledge, for example in the formofmeta-data, in order to achieveintegration (a factor that tends to be overlooked in literature focusedon explanations and conceptual structures). The focus on data alsohelps to stress thediversity theepistemic goals andprioritiesdrivingintegration, as well as the distinct forms of knowledge that may beachieved through pursuing these goals.14 To this end, I will now dis-cuss two forms of integration that operate differently from inter-levelintegration, and whose results are distinct from the knowledge ofplant biology acquired in this case.

    4.2. Cross-species integration and biofuels research

    In cross-species integration, scientists place more emphasis oncomparing data available on different species, and using such com-parisons as a springboard for new discoveries, rather than on inte-grating data across levels of organization of the same species.15

    Consider current research on grass species Miscanthus giganteus.Miscanthus is a perennial crop, which means that it can be cultivatedin all seasons, without interruptions to the production chain. Itgrows fast and tall, thus guaranteeing a high yield; and it grows eas-ily on marginal land. These characteristics have made it a good can-didate as a source of bioethanol, particularly because it poses less ofa threat to food production than other popular sources of biofuelssuch as corn (whose cultivation for the purposes of biofuel produc-tion has taken big chunks of land in the United States away fromagriculture, which is deemed to have affected the availability andprice of agricultural produce worldwide; Babcock, 2011). The poten-tial of Miscanthus as a source for biofuels is one of the factors thatrst spurred scientic research on this plant. And indeed, such re-search is ultimately aimed to engineerMiscanthus so that its growingseason is extended (by manipulating early season vigor and senes-cence) and its light intake is optimized (modify architecture, e.g. sev-eral sprawling stems, or increase stem height and number). In thisbroader sense, research on Miscanthus is a good example of researchaimed at developing techniques for intervening in the world, andultimately for improving human life. However, there is at least an-other reason why Miscanthus has become an important organismin contemporary plant science, which has little to do with its energyoutput. This is the opportunity to efciently cross-reference thestudy of Miscanthus with research on Arabidopsis.

    On the one hand, Arabidopsis provides the perfect platform onwhich to conduct exploratory experiments, given howmuch scien-tists already know about that system (thanks to inter-level integra-tive efforts) and the extensive infrastructure, standards forcollecting data and metadata, and modeling tools already availableon it (e.g. as incorporated in TAIR). On the other hand, Miscanthusprovides a good test case for ideas rst developed with reference toArabidopsis, whose value for other species researchers have yet to

    14 My approach is thus in line with the analysis of epistemic goals developed by Ingexplanatory level (OMalley & Soyer, 2012; OMalley, 2013).

    15 In plant biology, comparisons between strains and varieties regarded as belonging to thhere is thus not meant to endorse strict taxonomic distinctions, but rather to capture the imbroad sense, cross-species integration can also include cross-variety and cross-strain ana

    Please cite this article in press as: Leonelli, S. Integrating data to acquire new knPhilosophy of Biological and Biomedical Sciences (2013), http://dx.doi.org/10.101quire knowledge about molecular pathways relating to abioticstress can be more easily carried out on Arabidopsis than onMiscanthus; data collection and integration on Miscanthus itself isfacilitated by the standards, repositories and curatorial techniquesalready developed for Arabidopsis; and new data types, such as dataabout how Miscanthus behaves in the eld (e.g. its water intake),can be usefully integrated with data about Arabidopsismetabolism,resulting in new knowledge about how plants produce energy inboth species.

    Note that this research requires more than simply the transferof knowledge from one plant to the other: plant scientists needto iteratively move between the two species, compare resultsand integrate data at every step of the way, in order for newknowledge to be obtained. In other words, this research requiresgenuine integration between results obtained on Miscantus andArabidopsis. For instance, the consultation of TAIR data on Arabid-opsis genes that regulate oral transition has been a crucial impe-tus for research on owering time in Miscanthus, since those dataprovided Miscanthus researchers with a starting point for investi-gating the regulatory mechanisms for this process (Jensen, 2007);and the subsequent ndings on the susceptibility of Miscanthusowering to temperature and geographical sites are feeding backinto the study of owering time in Arabidopsis (Jensen et al., 2011).

    Perhaps the most important distinctive feature of cross-speciesintegration is that it fosters studies of organismal variation andbiodiversity in relation to the environment, with the aim to under-stand organisms as relational entities, rather than as complexyetself-containedwholes (as in the case of inter-level integration).This is because as soon as similarities and differences between spe-cies become the focus of research, plant researchers need to iden-tify at least some of the reasons for those similarities anddifferences, which unavoidably involves considering their evolu-tionary origins and/or the environmental conditions in which theydevelop. Hence, like inter-level integration, cross-species integra-tion may be construed as aiming to develop new scientic knowl-edge of biological entities. However, the way in which it proposesto expand the realm of existing knowledge is not necessarily byextending the range of inter-level explanations available, butrather by extending the range of organisms to which these expla-nations may apply. Indeed, while researchers can and often do pur-sue both inter-level and cross-species integration at the same time,it is also possible to achieve cross-species integration without nec-essarily fostering inter-level integration. This is the case, for in-stance, when using comparisons of data about owering timebetween Arabidopsis and Miscanthus to explore the respective re-sponses of the two plants to temperature; such cross-species com-parison can eventually be used to foster inter-level understandingof owering that integrates molecular, cellular and physiologicalinsights, but this is not necessary in order for cross-species integra-tion to be regarded as an important achievement in its own right.

    Further, cross-species integration poses a different set ofchallenges from inter-level integration, whose resolution caneasily constitute the sole research focus of a research project (seee.g. Kelling et al., 2009). It requires accumulating data that are spe-cically relevant for the purposes of comparison (for instance, bymaking sure that data obtained on Miscanthus are generated withtools and on materials similar to the ones available on Arabidopsis,

    igandt (2013) and Maureen OMalleys emphasis on forms of integration beyond thee same species can often be as interesting as comparisons among species. My analysisportance of comparing differences between groups of organisms. Understood in this

    lysis.

    owledge: Three modes of integration in plant science. Studies in History and6/j.shpsc.2013.03.020

  • iologso as to make comparison tenable) as well as developing infra-structure, algorithms and models that enable researchers to use-fully visualize and compare such data. Thus in our example, TAIRprovides a key reference point, but it is not sufcient as a datainfrastructure for such a project, for the simple reason that it fo-cuses on Arabidopsis data alone. Indeed, the difculties of usingTAIR for cross-species integration have become so pronouncedand visible within the plant science community, that TAIR itselfis now undergoing heavy restructuring to secure its future compat-ibility with both inter-level and cross-species analysis (The Inter-national Arabidopsis Informatics Consortium, 2012). This task ismade even harder by the terminological, conceptual and method-ological differences between communities working on differentorganisms, as well as differences in perceptions of what countsas good evidence and the degree to which specic traits are con-served across species through their evolutionary history. These dif-ferences need to be clearly signaled when constructing databasesthat include and integrate data acquired on different organisms.The Gene Ontology project, which as I mentioned earlier wasstarted as a means to disseminate data within a handful of modelorganism databases such as TAIR, is now used extensively as a plat-form for the integration of gene products data across species (TheGene Ontology Consortium 2004) and exemplies the difcultiesand controversies involved in rising to this challenge (Leonelliet al., 2011). Even the comparison of different genome sequences,which should be among the easiest to accomplish given the highlyautomated and standardized production of this type of data, isfraught with problems (e.g., Quirin et al., 2012). In this context,norms such as the principle of genetic conservation, by which sci-entists see regions of the genome that are highly conserved acrossspecies as potentially linked to important functions (since less rel-evant regions are assumed to have been selected away through theevolutionary process), matter over and above the norms of validityand accuracy used to achieve inter-level integration.

    In conclusion, the increasing emphasis on cross-species integra-tion can be seen as complementary to, and yet separate from,inter-level integration. The two forms of integration are clearlyinterconnected. I have illustrated how the inter-level integrationachieved for Arabidopsis throughmodel organism biology providesan important reference for cross-species integration in severalareas of plant science. Indeed, inter-level integration of data withinone species are often the starting point for cross-species investiga-tion and for the integration of data about the same process as itmanifests itself in different species. However, this does not neces-sarily mean that cross-species integration presupposes inter-levelintegration as a matter of principle, or in all cases. Further, thesetwo forms of integration raise different epistemic challenges,which do not have to be addressed within the same research pro-ject; and, perhaps most importantly, they require different sets ofdata and infrastructuresas illustrated by the practical difcultiesin using TAIR, whose primary focus is inter-level integration inArabidopsis, to study other plant species such as Miscanthus.

    4.3. Translational integration and plant-pathogen interaction

    Research on Miscanthus could be seen as exemplifying researchthat has been targeted and structured to serve societal goalsinthis case, the sustainable production of biofuels. However, whilethe goals set by funding agencies and industry have been crucialto the choice and funding of Miscanthus as an experimental organ-ism, plant scientists engaged in Miscanthus research have not, atleast until recently, worried much about how the plants that theyare engineering could actually be transformed in biofuel, whether

    8 S. Leonelli / Studies in History and Philosophy of Bthat process would be particularly sustainable and economical, andhow those downstream considerations might affect upstream re-search. This set of consideration has largely been left to politicians

    Please cite this article in press as: Leonelli, S. Integrating data to acquire new knPhilosophy of Biological and Biomedical Sciences (2013), http://dx.doi.org/10.101and industry analysis, while plant scientists focus on the task ofachieving new knowledge of Miscanthus biology. In other words,scientists and curators focusing on cross-species integration areprimarily focused on producing knowledge about plant biologythat is more accurate and all-encompassing than that alreadyavailable to them. Their expectation is that knowledge producedin this way will eventually inform the mass engineering of Miscan-thus plants, thus creating biomass from which bioethanol could beefciently extracted. This is a reasonable expectation, and theknowledge obtained from Miscanthus research will undoubtedlyinform biofuel production in the future. However, other researchprograms in plant science are explicitly planned and shaped soas to serve societal needs even before they improve on existing sci-entic knowledge of the organisms involved, thus de-emphasizingthe production of new biological knowledge in favor of producingstrategies for managing and manipulating organisms and environ-ments so that they support human survival and well-being in thelong term.

    Consider for instance research on plant pathogens, which arebecoming a serious threat to ecosystems and agriculture world-wide because of global trade and travel that facilitates the disper-sion of parasites well beyond their natural reach (Potter, Harwood,Knight, & Tomlinson, 2011). Dealing with plant pathogens that arenew to a given territory is a matter of urgency, since targeted inter-ventions need to be devised before the pathogen creates muchdamage; scientic research is a key contributor, as these pathogensare often relatively unknown within the scientic literature andare anyhow interacting with a whole new ecosystem, often withunprecedented results. Recently I participated in a workshop orga-nized at my university to discuss how plant science can help tosuppress an infestation of Phytophtora ramorum, a plant parasitethat landed in the South-West of the UK in the early 2000s andhas been ravaging the forests of Devon ever since. The infestationhad gotten particularly worrisome in 2009, when it started to af-fect large chunks of the local population of larches. The workshopbrought together representatives of relevant plant biology and datacuration conducted at several research institutes in the UK; the UKForestry Commission; the Food and Environment Research Agency;private landowners; social scientists; and representatives of othergovernmental agencies, such as the Biotechnology and BiologicalSciences Research Council.

    At the start of the workshop, it was made clear that there areseveral alternative ways to tackle the Phytophtora infestation,including burning the affected areas, using pesticides, cuttingdown the trees, letting trees live and introduce predators, makingaffected areas inaccessible to humans, or simply letting the infec-tion run its course. A focus of debate was then to determine whichscientic approach would provide empirical grounds to choose aneffective course of action among all those possible interventions.Acquiring novel understandings of the biology of Phytophtora wasobviously important in this respect; but it was not the primary goalof the meeting, and it was made clear that choices concerningwhich research approach would be privileged in the short termshould not be based on the long-term usefulness of that approachin providing new biological insights. This was particularly relevantin selecting strategies for data collection and types of data to beprivileged in further analysis. For instance, whole genomesequencing was agreed to be an excellent starting point for a tradi-tional research program seeking to understand the biology of Phy-tophtora through inter-level and cross-species integration,especially since data could be compared (through online dat-abases) to data generated on other strains of Phytophtora byEuropean and North-American labs. However, many participants

    ical and Biomedical Sciences xxx (2013) xxxxxxquestioned the efciency of this strategy in providing quickly ge-netic markers for Phytophtora ramorum, which could be of immedi-ate use to combat the infestation. It was argued that focusing

    owledge: Three modes of integration in plant science. Studies in History and6/j.shpsc.2013.03.020

  • genomic research on more specic parts of the genome, such asloci already known to be linked to pathogenic traits, would providea way for the forestry commission to test trees in areas not yet af-fected and determine immediately whether the infestation isspreading (the merits and drawbacks of using PCR-based diagnos-tic were debated at length). Further, much debate surrounded the

    tora ramorum into a long-term program that would investigate therelative virulence of the pathogen on different hosts (which wouldinvolve detailed studies of the hoststree speciesas well), assem-ble whole genome data on all available and emerging strains, andinvestigate the mechanisms that trigger virulence. All of these re-search programs, which clearly involve inter-level and cross-spe-

    gratdiffe

    ly frtheho

    S. Leonelli / Studies in History and Philosophy of Biological and Biomedical Sciences xxx (2013) xxxxxx 9possible ecological, economic and societal implications of eachmode of intervention under consideration, and the science relatedto it. Biological research was thus not the sole empirical ground toassess the quality and effectiveness of an intervention; other fac-tors included local ecology, touristic value of the areas and the eco-nomic value of the wood being cut down (that is, factors thatinclude the environmental considerations that I signaled as centralto the cross-species approach, as well as economic and social ele-ments that cross-species integration would not regard as relevant).Only through such an overall assessment could participants andscientists determine the overall sustainability of the research pro-gram that was being planned.

    Notably, each participant contributed not only a specic per-spective on what the priorities are in dealing with Phytophtora,but also their own datasets for integration with the molecularand phenotypic data to be gathered by plant scientists. These in-cluded data of great relevance to scientic research, which how-ever were collected for purposes other than the scientic studyof Phytophtora: for instance, geographical data about the spreadof the infestation, which was gathered by Forest Research (the re-search arm of the Forestry Commission) in the course of aerial sur-veillance and was picked up by mathematical modellers to helppredict future spread patterns; and photographs of affected treesin several areas, collected by the Forestry and local landowners,and seized upon by plant pathologists at the James Hutton Instituteas evidence for plant responses to biotic stress. Acquiring access tothose data constitutes an achievement in itself for plant scientists,since some of the stakeholders involved are more willing to dis-seminate their data than others. For instance, the Forestry Com-mission is more reluctant to share data than plant scientistsworking at the University of Exeter, for whom contributing to on-line sequence repositories such as the Sequence Read Archive orGenBank is a part of research routine. Further, scientists at themeeting were not sure about which existing online database, orcombinations of databases, would best serve the desired integra-tive efforts. One obvious candidate would be PathoPlant, a data-base explicitly devoted to data on plant-pathogen interaction,but its use was not explicitly discussed at the meeting I attended,perhaps because it was not clear to participants that such a data-base would serve their immediate research goals.16

    Indeed, plant scientists involved in the meeting at Exeter Uni-versity found themselves negotiating with stakeholders, some ofwhom are also arguably involved in scientic research (such asbiologists working for the Forestry), whose main aim was not theproduction of new insights on Phytophtora biology, but rather theachievement of a reliable body of evidence that would help decid-ing how to tackle Phytophtora infestations. This negotiation, whichis the key feature of this type of integrative effort, is not easy espe-cially given the tendency of plant scientists to reach for inter-leveland/or cross-species integration too. Thus, plant scientists at themeeting strongly advocated the expansion of research on Phytoph-

    16 Interestingly, the development of PlantPatho itself has involved considerable inteplants and pathogens) and cross-species (by integrating data coming from a variety ofthe data used to start this initiative; Blow, Schindler, & Hehl, 2007).17 It is important to stress here that human health can hardly be construed separateparticularly plants as key sources of air, fuel and nurture. In this paper, I thus endorseimproving human health. This may be perceived as counter-intuitive by scholars w

    knowledge that is concerned with human health; I hope that this analysis helps to correc18 For arguments concerning the use of biological research, including the forms of mathpapers by Bechtel (2013), Brigandt (2013) and Plutynski (2013).

    Please cite this article in press as: Leonelli, S. Integrating data to acquire new knPhilosophy of Biological and Biomedical Sciences (2013), http://dx.doi.org/10.101cies integration, would provide knowledge about the biology ofPhytophtora that scientists view as crucial to develop better inter-ventions; however, these programs would require substantialfunding and considerable time in order to yield results, and scien-tists were pushed by other parties to articulate more fully how thesystematic whole genome sequencing of Phytophtora strains wouldeventually lead to effective interventions on the infection. In par-ticular, it was argued that although the cross-species and inter-le-vel integration acquired through these approaches was desirable, itwas not necessary in order to facilitate decisions on how to eradi-cate Phytophtora. For example, PCR-based diagnostics, thougharguably useless to the pursuit of a better understanding of thebiology of Phytophtora and its hosts, might work perfectly wellfor the purposes of diagnosing infection.

    Negotiations among molecular biologists, scientists working atForest Research, and other stakeholders are still ongoing at thetime of writing, and engagement in these discussions is generatinga shared research programme, part of which will involve the devel-opment of a database that fosters the integration of data relevantto the study the virulence and potential environmental impact ofPhytophtora. This case nicely exemplies the characteristics oftranslational integration, which privileges the achievement ofimprovements to human health, for instance through targetedinterventions on the environment and the use of existing re-sources, over the production of new scientic knowledge for itsown sake.17 I take the term translational from current policy dis-cussions of the importance of making scientic research useful towider society, as instigated for instance by the NIH in the early2000s. However, I do not subscribe to the linear trajectory of re-search from basic to applied that is often used within such policydiscussions. Rather, I wish to use the category of translation to focuson specic ways in which scientists frame their research so as to re-spond to a social challenge. In the case I considered, scientists aim toproduce new forms of intervention that are targeted to the situationat hand. This is not in itself sufcient to differentiate translationalintegration from inter-level and cross-species integration. Many phi-losophers have rightly argued that learning to intervene in theworld, and particularly to manipulate organisms in the case of biol-ogy, is part and parcel of scientic research and is inseparable fromthe process of acquiring new knowledge about the world; there isthus no clear epistemic distinction between making and under-standing, since many scientists develop new types of experimentalinterventions as a way to acquire new knowledge, and vice versa.

    What I think makes translational integration distinct from theother two modes is the strong commitment to producing resultsthat affect (and hopefully improve) human health, which involvesthe development of research strategies and methods that are dis-tinct from the ones employed to achieve inter-level and cross-spe-cies integration.18 My denition of translation is therefore narrowerthan the denition provided by Maureen OMalley and Karola Stotz,according to which translational research consists of the capacity to

    ive efforts, both inter-level (by integrating data from different features of both hostrent species, though it must be noted that Arabidopsis research has provided much of

    om the health of the environments that support human survival and well-being, andidea of green biotechnology, and plant science, as playing a key role in protecting andview red biotechnology, and particularly the medical sciences, as the only form of

    t this common misconception.ematical modelling characteristic of systems biology, for medical purposes, see the

    owledge: Three modes of integration in plant science. Studies in History and6/j.shpsc.2013.03.020

  • transfer interventions from context to context during the pluralistic for research practices, and particularly processes of integration,

    of organisms over the study of biodiversity. By contrast, the mainchallenge in cross-species integration is to nd ways of making

    10 S. Leonelli / Studies in History and Philosophy of Biological and Biomedical Sciences xxx (2013) xxxxxxinvestigation of a system (OMalley & Stotz, 2011). I agree with themthat translational research involves such movement of knowledge,but I also think that these transfers can be geared to satisfying a vari-ety of specic agendas, several of which are not primarily concernedwith how scientic knowledge affects society. Many parts of biolog-ical research inherit and refashion techniques for intervening onorganisms, without necessarily aiming to produce socially valuableresults in the short term. Of course, all scientic research has the po-tential to ultimately improve human health, and yet some parts ofscience are not explicitly conducted to foster this goal in the shortterm (which is, incidentally, a very good thing, both because the po-tential social benets of science are unpredictable, and because thesocial agenda for what counts as benecial to humanity changeswith time and across domains). I see the extent to which a groupof scientists explicitly subscribes to the agenda of social changeand shapes its research accordinglyas marking the difference be-tween more foundational scientic research and translationalendeavours. I therefore would not agree with OMalley and Stotzwhen they conclude that translation is involved whenever tech-niques for intervention are transferred from one scientic contextto another. In their denition, translation is involved, potentially tothe same degree, in all three modes of integration which I considerhere; while in my analysis, translation becomes a primary concern,with important consequences for how research is conducted andwith which outcomes, when scientists commit to fullling specicsocial roles in the short term.

    A key implications of the commitment to improving humanhealth is that scientists engaged in translational integration needto pay attention to the sustainability of their research programnot only in the narrow sense of worrying about its nancial viabil-ity, but also in the broader sense of considering the potential envi-ronmental and social impact of its outcomes. In practice, thistypically involves engaging directly with contexts of production/use, so as to be able to assess the downstream applicability of spe-cic research strategies and prospective results. Crucially, biolo-gists do not possess the right expertise to determine, bythemselves, what counts as sustainable research outcomes. Thisis why they need to collaborate with scientists in industry, stateagencies and social scientists, among others, as it is through suchengagements that scientists determine what constitutes humanhealth and how to improve it in the case at hand. Indeed, the so-cial agenda for translational research cannot be xed, for the sim-ple reason that it depends heavily on the ever-changing viewpointsand needs of the many stakeholders involved. Scientists whochoose to take time to discuss the goal of their research with rele-vant parties outside the scientic world, and tailor their own re-search, tools and methods to t those discussions, are investing asignicant amount of their resources on producing results thatmight not be revolutionary in terms of their conceptual contribu-tion to existing biological knowledge (though they may prove tobe such!), but rather are primarily meant to serve a wider socialagenda.

    Researchers involved in these exchanges are often also forcedto compromise on their own views of what would constitute aproductive research strategy and attractive research ndings, inorder to accommodate requirements and suggestions by otherparties interested in achieving social, rather than scientic, goals.In particular, prioritizing the achievement of sustainable and ef-cient intervention (where what counts as sustainable and efcientis agreed upon among several different parties) over theacquisition of biological knowledge has important consequences19 In this sense, my approach is closely aligned with the socially relevant (philosophphilosophers interested in feminist philosophy and social studies of science as key sourcSynthese special issue edited by Fehr & Plaisance (2010); and the review symposium on K

    Please cite this article in press as: Leonelli, S. Integrating data to acquire new knPhilosophy of Biological and Biomedical Sciences (2013), http://dx.doi.org/10.101methods, terminologies, and material differences involved in thestudy of two different species (rather than disciplines) compatiblewith each other; and in exploring the reasons for the similaritiesand differences between species, which involves taking accountof the evolutionary and environmental context of the organismsat hand. Finally, the bulk of research efforts within translationalintegration goes into the assessment, in dialogue with a wide rangeof expertises beyond plant science itself, of the sustainability andsocial impact of the methods and potential outcomes of research.such as the choice of relevant data, and the speed with which dataneed to be collected and interpreted. This choice of priorities oftencomes at the expense of time dedicated to exploratory research,and yet this does not necessarily compromise the quality ofresearch and its outcomes. On the contrary, the development ofresearch strategies to pursue socially relevant goals, especiallywhen it is coupled with the awareness that achieving such goalscan sometimes be a long-term and complex endeavor, is a formof inquiry that philosophers of science should value and support.19

    5. Conclusions

    I have here identied and discussed three modes of integrationat work in contemporary plant biology: inter-level, aimed atunderstanding organisms as complex wholes across a range of dis-ciplinary approaches; cross-species, aimed at understandingorganisms comparatively and in relation to their environment;and translational, aimed at intervening effectively over organismsin order to improve human health. In all three cases that I exam-ined, data integration is required to generate new knowledge,and curators of online databases play an important role in facilitat-ing it. Collaborative research across groups of scientists is key tothis process, as is the commitment to data sharing and re-use (evenif such commitment is compatible with a large spectrum of re-search practices, and some of the parties involved would rather ac-cess data produced by others than share their own); and thedissemination of data through databases, as well as their integra-tion within research, entails the standardization of embodied andpropositional knowledge, as well as the development of new formsof intervention in and conceptualization of the biological world.Beyond these commonalities, however, each of these cases exem-plies a different way to integrate data, which prioritizes specicmethods and prospective results over others, takes different typesof data as its starting point, and poses distinct challenges to thecurators of online databases. This means that achieving one ofthese types of integration does not necessarily provide the meansto pursue another type: inter-level, cross-species and translationalintegration are the result of signicantly different research strate-gies and related clusters of instruments, concepts, methods, mate-rials and data.

    Inter-level integration has been particularly prominent within20th century plant molecular biology. This form of integration in-volves data produced by different subdisciplines on different levelsof organization of the same plant species. Thus, most of the re-search efforts focus on nding ways to overcome disciplinary bar-riers, such as differences in methods and terminology betweenmolecular and cellular biology, in order to collect and visualizethose data within a single framework; and biologists involved inthose efforts have tended to prioritize mechanistic understandingsy of) science proposed by Longino (2002) and Kourany (2010) and other leadinges of insight for the development of the philosophy of science. See for instance theourany (2010) in Perspectives on Science 2012, vol. 20, no. 3.

    owledge: Three modes of integration in plant science. Studies in History and6/j.shpsc.2013.03.020

  • S. Leonelli / Studies in History and Philosophy of Biological and Biomedical Sciences xxx (2013) xxxxxx 11This form of research might not necessarily yield new insights onthe biology of organisms, or their evolutionary or developmentalhistory; however, it does yield an understanding of how scienticresearch needs to be set up and developed so as to yield sociallydesirable outcomes. Given the context-dependent and ever-chang-ing nature of what counts as socially desirable, this involves theconstant re-assessment of which data are most relevant to achiev-ing such goal, which expertises are involved in producing suchdata, and which methods and data infrastructures are used to dis-seminate, visualize and interpret them.

    It is important to note that the typology proposed here is notmeant to be exhaustive of all the ways in which data integrationcan happen in plant research. These forms of integration are alsonot meant to be mutually exclusive, and very often they happenalongside each other, and in dialogue with each other, within thesame scientic laboratory (as my examples have shown). Most re-search groups that I have come across are interested and involvedin all three types of integration. Indeed, the interplay betweenthose three modes is often crucial to the scientic success of alab, especially at a time when both scientic excellence and the so-cial impact of research are highly valued by funding bodies acrossthe globe, and heated discussions surround the choice of metrics toassess how scientists fare on these two counts. The development ofstandards enabling the integration of data across species has beena key step in the development of model organism databases, thussignaling the willingness of researchers to move beyond inter-levelintegration and towards cross-species comparisons. Similarly, bothinter-level and cross-species integration routinely generate resultson which new forms of translational integration can be built. Thevery case of Miscanthus might work in this way if plant scientistsactively engage in the search for efcient and sustainable waysto downstream the production of bioethanol, and use these inter-actions to inform their own bioengineering practices (which seemsto be exactly what plant scientists are starting to do, for instancethrough initiatives such as the UK Plant Science Federation in theUK). There are even cases where all three types of integration areattempted simultaneously, such as the current effort to combinetranscriptomic data obtained from large groups of plants (an idealcase of cross-species integration) with metabolic proling, func-tional genomics, and systems biology approaches (inter-level inte-gration) so as to reveal entire pathways for medicinal products, inways that promise to revolutionize drug discovery and thus pro-vide a perfect instance of translational integration (De Luca, Salim,Masada-Atsumi, & Yu, 2012, p. 1660).

    Given the multiple and complex interrelation between the threeforms of integration that I have identied, one might wonder whyit matters to distinguish them at all. The reason has to do withimproving existing philosophical understandings of scientic prac-tices and of the temporality and constraints within which researchis carried out. Even if these three forms of integration are inter-twined in the overall vision of what science is supposed to achievefor humanity, and in the overall trajectory of any one specic re-search group, their distinctive aims, methods, strategies and normsrequire that they are taken up to different degrees at any one pointin time. All the plant scientists whose work I have discussed hereare interested in understanding the biology of specic plants asan integrated whole; comparing different species so as to reachas encompassing an understanding of the plant kingdom as possi-ble (including its evolution, inner diversity and environmentalrole); and eventually using their research results to address keychallenges to human life in the 21st century, such as climatechange, urbanization and population increase (the water-food-en-ergy nexus, as dened by the World Economic Forum; 2011, p.8).20 This is something that many funding bodies, particularly those subscribing to the Im

    Please cite this article in press as: Leonelli, S. Integrating data to acquire new knPhilosophy of Biological and Biomedical Sciences (2013), http://dx.doi.org/10.101Yet, they are typically unable to pursue all of these goals in equal mea-sure at the same time20; and the choices that they make when con-sidering which data to view as relevant to their research, how tointegrate those data and which expertises to involve in that processwill be crucial factors in determining which form of knowledge theyprioritize as the primary outcome of their efforts. In other words: thepursuit of different forms of integration gives rise to different formsof scientic knowledge, whose value and content shifts in relation tothe goals, expertises and methods involved in each research project.

    Through this analysis, I hope to have shown the relevance of theinfrastructure and standards used to integrate data to achievingdifferent epistemic goals and thus different forms of knowledge;and, at the same time, that prioritizing specic epistemic goalsover others might lead to structuring data integration, and theinfrastructures and standards used to that effect, in different ways.Considering the procedures and standards developed to facilitatedata integration provides important clues about the norms, prac-tices and implications of integrative processes, and the epistemicrole of the social and institutional contexts in which such effortstake place. Data integration and the production of scientic knowl-edge, both propositional and embodied, are strictly intertwined: acrucial question for both scientists and phil


Recommended