1
Sequencing through thick and thin: historiographical and philosophical
implications
James W. E. Lowe
Science, Technology and Innovation Studies, University of Edinburgh, UK.
Old Surgeons’ Hall, High School Yards, Edinburgh, EH1 1LZ. United Kingdom.
Email: [email protected]
2
Abstract
DNA sequencing has been characterised by scholars and life scientists as an example of ‘big’, ‘fast’
and ‘automated’ science in biology. This paper argues, however, that these characterisations are a
product of a particular interpretation of what sequencing is, what I call ‘thin sequencing’. The ‘thin
sequencing’ perspective focuses on the determination of the order of bases in a particular stretch of
DNA. Based upon my research on the pig genome mapping and sequencing projects, I provide an
alternative ‘thick sequencing’ perspective, which also includes a number of practices that enable the
sequence to travel across and be used in wider communities. If we take sequencing in the thin
manner to be an event demarcated by the determination of sequences in automated sequencing
machines and computers, this has consequences for the historical analysis of sequencing projects, as
it focuses attention on those parts of the work of sequencing that are more centralised, fast (and
accelerating) and automated. I argue instead that sequencing can be interpreted as a more open-
ended process including activities such as the generation of a minimum tile path or annotation, and
detail the historiographical and philosophical consequences of this move.
Highlights:
- DNA sequencing is primarily understood by a ‘thin sequencing’ perspective.
- I propose a ‘thick sequencing’ perspective.
- Thick sequencing includes different stages of assembly, evaluation and annotation.
- An alternative picture of the nature and organisation of sequencing is presented.
Keywords:
Genomics; sequencing; physical mapping; DNA; assembly; annotation
3
1. Introduction
Dominant narratives concerning genomics have hitherto focused on the process and product of
determining the order of bases (adenine, thymine, cytosine and guanine; A, T, C and G) along a given
DNA strand. I call this the ‘thin’ sequencing perspective, and contrast it to a ‘thick’ perspective that
encompasses all of the scientifically and technically important processes, procedures, materials and
stages leading to intermediary and never-quite-complete sequence products. These thick sequences
can potentially be used as a resource by various end-user communities not (necessarily) involved in
the practices leading to the production of those sequences. These practices may include improving
assemblies of sequences by closing gaps and correcting errors and annotating the sequences to
indicate where genes lie on chromosomes. Thick sequencing draws our attention to procedures such
as these just as much as the determination of the raw sequence that thin sequencing concentrates
on; they are vital in ensuring that sequences can be used more fruitfully as a resource. Thick
sequencing therefore calls attention to those processes and methods that do not themselves
constitute DNA sequencing, but condition what is sequenced, how sequence data are compiled into
assemblies and the augmentation of the sequence data to enable it to relay more information than a
long string of bases.
I argue that interpretations of the science of genomics that foreground speed, acceleration, large-
scale operations and automation are a product of a thin characterisation of sequencing that
primarily concerns the procedures that take place in automated machines and associated
computers. In the initial era of whole genome sequencing, sequencing machines were often
arranged in parallel in factory-style centralised genome sequencing centres, an approach pioneered
by J. Craig Venter at Celera and John Sulston at the Sanger Institute (Sulston and Ferry, 2002; Venter,
2008). This sequencing is therefore characterised by large-scale centralised facilities with automated
sequencers, mainly staffed by technical employees. The thin picture of genomics has shifted
somewhat, with the lowering of the cost of sequencing per base pair making work in large
centralised facilities seem less necessary. This work is additionally devolved (for reasons of
convenience or cost) to the corporations that build the machines (e.g. Illumina, Pacific Biosciences)
and service-oriented laboratories. The process is nonetheless mainly automated, fast, and organised
in an industrial way.
‘Thick sequencing’ is based on the idea that there is no final product, and that the work and insight
required to create any publicly available sequence cannot be fully captured under a thin
understanding of sequencing. In this interpretation, sequencing can include creating genome
libraries, establishing a detailed physical map, and producing and validating the statistical tools and
software required for analysis. As much as determining the base order, thick sequencing
encompasses ongoing assembly to increase the size of contiguous stretches of sequence and close
gaps, revision, modification, resequencing of particular areas of interest, improving the quality and
coverage, verifying and comparing with other sequences. It is about the creation of annotated
sequences indicating the position of genes and other potentially relevant genomic elements, which
is in part driven by the prospective uses of the sequence. Many of these stages require active
interpretation and intervention in the production of data.
Thick sequencing has no fixed referent: it does not denote a particular event, process, object or
project. It is, rather, a concept that encourages scholars of genomics to concentrate efforts on
understanding those aspects of genomics that, as I will demonstrate, are as crucial to the products
and processes of sequencing as the well-studied and crucial stages in which sequence reads are
4
generated and compiled in successive procedures to produce ever larger contiguous stretches of
DNA sequence.1
In developing this distinction between thin and thick characterisations of sequencing, I am building
upon Leonelli’s (2016) work on data, which highlights the importance of understanding the
processes involved in “packaging” data to enable it to be mobilised, integrated and employed by a
variety of potential users. The distinction between thick and thin perspectives does not disrupt the
centrality of data and data practices to our understanding of sequencing. Rather, the practices,
collaborations, infrastructure and data that are included in the thick sequencing perspective are
more varied, complex and networked than those associated with thin sequencing. Furthermore,
reinterpreting sequencing as thick allows us to adopt the rich conceptual apparatus that has been
developed to understand the epistemologies and pragmatics of data-centric or data-intensive
science.2
In this paper, I provide a thick account of the mapping and sequencing of the genome of the
domestic pig (Sus scrofa), encompassing more than just the production of ‘raw’ sequence. To quote
Christopher Tuggle, a pig genome researcher at Iowa State University: “the sequence itself isn’t very
useful, we need to know where the landmarks are.”3 This concern with the usability of the sequence
is allied with current policy directions in funding organisations that aim to improve and accelerate
the translation of genomic data into (usually clinical) outcomes (e.g. Wellcome Trust, 2010; for the
National Human Genome Research Institute, Green et al., 2011).
The pig genome sequencing work that I examine presents a well-resolved distinction between the
thin and thick sequencing perspectives. For example, if we just take the determination of the order
of bases, which was conducted between 2006 and 2009 at the Wellcome Trust Sanger Institute (in
Hinxton, UK), then it looks centralised and automated. But considering the broader conception of
sequencing provides a whole other picture in which a range of institutions contributed, over a longer
timeframe, to both work preceding sequence determination and the development and processing of
the Sanger Institute’s raw sequence. This thick picture of sequencing includes the obtaining of DNA
from several different breeds of pig, construction of four genomic DNA libraries containing clones of
parts of this DNA, physical mapping, distributed revised assembly, sequencing genomic regions of
particular interest in higher resolution, annotation and comparison with human and other species’
genomes. This picture reveals a different organisation of the work and roles of particular skills.
Pig genome sequencing represents a genomics that took established organising principles and
methodologies from prior projects, such as mice, cattle and – especially – human. The early stages of
a new form of work involve a considerable amount of improvisation and trial-and-error. Early
genome projects will therefore not necessarily be representative of sequencing once it became a
1 In asking for further clarification on the thick-thin distinction, a reviewer asked whether 'what Incyte and
Human Genome Sciences were doing with cDNA sequencing would be thin or thick sequencing,' referring to
two private sector genomics companies established in the early-1990s. My answer would be that, as thin
sequencing addresses a sub-set of operations encompassed by a thick sequencing perspective, these
companies were conducting both thin and thick sequencing. Depending on one's scholarly interest in the
workings of these companies, however, either a thin or a thick approach may be more pertinent. 2 In particular, the perspective on data-centric science that has been developed by Leonelli (2016) and others
(Stevens, 2013, for example, regarding genomics) that focuses on the active construction of the means to
produce and circulate data, and the effect of such infrastructures on the status and value of particular data.
This approach demands that understanding the role of data in data-centric areas of biology requires more than
the circumstances of its immediate generation and eventual use. 3 Christopher Tuggle, Skype interview with author, 3rd March 2017.
5
more established part of biological research (García-Sancho, 2012, especially pp. 21-64). Pig genome
sequencing used the two dominant approaches to sequencing that arose out of the efforts to
sequence the human genome: map-based (hierarchical) shotgun and whole genome shotgun (see
figure 1). The map-based sequencing relied on a physical map produced from 2003 to 2005, and the
whole genome shotgun data supplemented the map-based sequence data. Additionally, the pig
genome community was and is relatively small and it is therefore possible to investigate most of
those who were involved.4 This helps me to avoid the reliance on accounts by prominent people
based in large centres or ethnographic research conducted in those same centres, which
methodologically structures a thin view of sequencing.5
Figure 1 - A simplified depiction of the two chief approaches to genomics during and after the human genome project. On
the left is the hierarchical map-based shotgun approach, which uses a physical map to produce a minimum tiling path to
inform which Bacterial Artificial Chromosome (BAC) clones to sequence and consequently assemble into contigs. BACs and
their yeast equivalent YACs are fragments of DNA sequences – clones – stored within the plasmids (circular DNA) of
microorganisms. On the right is the whole genome shotgun approach in which the DNA is sheared into fragments, which
are sequenced, and then assembled through high-powered computation to calculate the probabilities of overlaps between
fragments.
In expounding upon the thick perspective, I outline an expanded view of sequencing. The explication
of it in this paper concentrates, however, on only one aspect of genomics, the production of
reference sequences. This is only one motivation for sequencing among many, and of one species
among millions. The organisation of work and processes involved in the sequencing of other species,
and for other purposes such as examining biological diversity, tracing evolutionary history, food
testing, functional ecology and forensic investigation, may differ in significant respects from the
4 It was certainly considerably smaller and more cohesive than the human genome community. Compared with
the Swine Genome Sequencing Consortium, however, the sequencing consortia associated with the human
and mouse genomes (for example) had different histories, organisation and composition, and therefore the
data concerning their size and composition are not comparable. With colleagues who are working with data on
human and yeast sequence submissions to the European Nucleotide Archive database, I am currently working
on quantitative and qualitative analyses of data derived from pig sequence submissions to the same database,
with a view to characterising the communities and networks involved in sequencing (but not necessarily whole
genome sequencing) for each of the three species. 5 See García-Sancho (2016), on a possible strategy to avoid this in historical research on human genomics.
6
account detailed in this paper. I make no claims for the representativeness either of the pig as the
subject of sequencing, or of the production of reference sequences as its object. What I do aim to do
is to demonstrate the power of the distinction I present and the possibilities opened up by taking a
thick perspective on sequencing. Most pertinently, the thick perspective has the potential to
stimulate an examination of all of the relevant practices and operations associated with sequencing
conducted for different purposes, which would help to underpin more fine-grained comparative
analyses between them.
This paper is based on archival research including on Alan Archibald’s personal papers, documents
and emails sent to me by key participants such as Lawrence Schook, examination of published
materials and a series of oral history interviews that I have conducted with members of the pig
genetics community and people who worked at the Sanger Institute.
I begin the paper with a discussion of the historiography of genome sequencing, before providing
first an historical background to pig genomics, and then a detailed account of the sequencing of the
pig genome. I demonstrate that the range of actors, practices and outcomes that the thick
perspective covers is broader than those encompassed by thin sequencing. Throughout I will point
to the historiographical and philosophical consequences of adopting a thick approach and including
certain practices and processes in narratives of sequencing projects.
2. Historiographical background
The historiography of sequencing has been understandably dominated by the human genome
project, and the practices, institutions, and actors associated with it. The human genome project
lends itself to thin interpretations of sequencing due to the prominent role of the so-called G5
sequencing centres and the private company Celera.6 As a result of its scale and salience, human
genome sequencing has enabled the thin perspective to dominate scholarly interpretations of
sequencing in general. The account of the sequencing of the pig genome in the rest of this paper is
intended to help supplement and broaden the historiography that has been shaped considerably by
human genome sequencing.
A common theme in accounts of the human genome project by scholars and participants alike was
that this enterprise imported ‘big science’ and its associated characteristics into the biological
sciences (Collins et al., 2003; Davis and Colleagues, 1990; Glasner, 2002; Hilgartner, 2013). Big
science is characterised by the use of “large, expensive instruments, industrialization, centralization,
multi-disciplinary collaboration, institutionalization, science-government relations, cooperation with
industry and internationalization” (Vermeulen, 2016, pp. 199-200; see also Galison & Hevly, 1992).
The effort to sequence the human genome has been compared to the Manhattan Project (Lenoir
and Hays, 2000) and the US Space Program, in that “an immense, generalized capacity for technical
action has been created” by the establishment and evolution of institutions, the training and
deployment of personnel, and the development of techniques, instruments and protocols (Barnes &
Dupré, 2008, p. 43). Sequencing the human genome, after all, involved large teams working towards
an ambitious goal.
6 I have chosen not to capitalise the words of the ‘human genome project’ to reflect that there was no such
single organisational entity as the ‘Human Genome Project’ responsible for the sequencing, but a shifting
collaboration of laboratories, centres, and funding and coordination initiatives.
7
Sequencing centres sought to industrialise the processes, and the focus was on the improvement of
the efficiency of production and pipelines.7 This implied a greater role for automation,
standardisation and improving the flow from one part of the process to the next (Hilgartner, 2013;
Stevens, 2011). There was not a uniform approach to sequencing the human genome, however, with
two different approaches pursued by the ‘official’ public project and the main private sector
initiative. The preference for hierarchical shotgun sequencing on the part of the ‘official’ human
genome project allowed sequencing to be coordinated, with different centres sequencing different
parts of the genome. This also permitted laboratories to try alternative methods and strategies
(Bostanci, 2004, pp. 169-170). It therefore allowed different research interests and capabilities
between laboratories and across nations differences to be accommodated.
As the human genome project proceeded, “automated machine rooms were established in a
triumph of organization and routinization” (Barnes & Dupré, 2008, pp. 42-43). The automated
sequencers were deemed crucial to requiring less human intervention and, later, to making
sequencing more ‘efficient’ by being more cost-effective and requiring less (skilled) labour-intensive
work (García-Sancho, 2012; especially pp. 131-143 and 163-168). An interpretation of this is that
sequencing came to rely less on the labour of highly skilled scientists, and more on the routinised,
standardised labour of technicians trained to operate machines developed and built by companies
such as Applied Biosystems. Interestingly, however, rather than de-skilling sequencing, the
establishment of large high-throughput genome centres has tended to foster the development of
new skilled work. For instance, the sequencing process itself still requires highly skilled “careful
laboratory work, testing, and judgment calls” (Stevens, 2013, p. 112). The work involved in creating
and maintaining data infrastructures and pipelines also requires highly skilled teams (Leonelli, 2016).
The rapid automation of sequencing, and the remarkable development of newer machines able to
sequence longer sections of DNA, often increasingly in parallel, increased the speed of sequence
production, and contributed to a dramatic decline in the cost of sequencing.8 This acceleration has
led some scholars to characterise the human genome project as ‘Fast Science’ as well as ‘Big Science’
(Fortun, 1999). The speed and quantity of data production led many areas of the biosciences to
develop means to store, manage, make accessible and make sense of the produced data. As a
consequence, computers and other information technological infrastructure became central to the
storage, transmission and management of the large amounts of data generated through sequencing
work (García-Sancho, 2012; Stevens, 2013; Strasser, 2011).9 In addition, software was developed to
enable manual and automated approaches towards the handling, integration, analysis, comparison
and interpretation of data.
In some areas, this has led to new forms of data-centric science, in which the practices, organisation
of research and the formulation of knowledge claims are reshaped to make use of the large amounts
of genomic information that has become available. Various scientific communities have developed
7 With respect to the narrative of ‘industrialisation,’ Bartlett (2008), p. 99, comments that “The Human
Genome Project appears, therefore, to be an island of Modernity in a perceived sea of post-Fordism, post-
industrialism, and post-Modernism.” However, as Stevens (2013), pp. 86-105, has shown, large-scale
industrialised sequencing centres such as the Broad Institute are very much post-Fordist in their organisation
of work and space. 8 See, for example, https://www.genome.gov/sequencingcostsdata/ 9 In this paper I take no sides in the debate over the extent to which the importation of computers and
information technological practices into biology has shaped the reconfiguration of biological research towards
the production and handling of certain forms of data suited to those technologies and associated practices
(Chow-White and García-Sancho, 2012; Lenoir, 1999; Leonelli, 2016; Stevens, 2013).
8
standards for the production, labelling and circulation of data, and its entry into data infrastructures
such as databases and ontologies (Leonelli, 2016).
As a result of the nature of the organisation of the human genome project, sequencing has been
conceived as an activity centred on large international collaborations. Despite this, the
collaborations could also be characterised as more networked and decentralised (or more locally
centralised on particular instruments) than projects in ‘big physics’, which typically require
instruments that are orders of magnitude vaster in scale, cost and associated organisational
complexity. Access to information infrastructures and the data contained therein is also widely
dispersed and decentralised (Vermeulen, 2016, p. 204-205).
Changes in the attribution of credit have been associated with the development of large
collaborations, most strikingly the dramatic increase in the number of authors listed on sequencing
papers (Glasner, 2002). This is certainly the case for the pig sequencing that will be explored in this
paper, with the thick sequencing perspective in particular revealing extra dimensions to
international collaboration beyond the steering committee of the formal Swine Genome Sequencing
Project through which the thin sequencing was coordinated. The thick perspective reveals more
actors, and different regional organisational patterns among those actors. For instance, there were
clusters of collaborative networks associated with the overall project that contributed particular
aspects, for example the sequencing of cDNA to aid with annotation, a task performed by Japanese
scientists.
Much of the historiography of genomics has been based on the sequencing of the human genome,
due to the scale and political and cultural salience of this enterprise. However, examining other work
beyond this will be important to furthering our understanding of the range, nature and development
of genomics and sequencing. Examining the sequencing of other organisms can help us identify
different ways in which the organisation and conduct of this work can occur (on yeast, for example,
see Parolini, 2018). To that end, I first provide some background to genomic research involving pigs
before detailing sequencing work involving those animals, and how its organisation differs from the
large-scale, automated and centralised models of human genomics. As will be shown, a thick
approach to assessing the sequencing of the pig genome allows one to elucidate different models of
how genomics is and has been organised, an approach which may be applicable for genome projects
involving other species.
3. Pig genomics – background
Collaborative projects to systematically investigate the pig genome began in the early 1990s. One
project was based at the United States Department of Agriculture Meat Animal Research Center
(USDA-MARC) and another was funded by the extramural Cooperative State Research, Education,
and Extension Service of the USDA and took place largely in universities. In Europe, there was a
collaboration funded by national agencies, ministries and the European Commission (the Pig Gene
Mapping Project; PiGMaP). Initiatives also involved groups in Japan, Korea, Australia and
Scandinavia. The aim was to identify genetic markers on each of the 20 distinct chromosomes of the
pig and locate them, primarily to produce maps that could be used to advance the detection of
Quantitative Trait Loci (QTL): areas of the genome linked to quantitative variation in phenotypic
features such as fatness or meat quality. The idea was that once these were identified and mapped,
breeders could use markers to select for or against traits with greater precision than previous
livestock improvement efforts.
9
There are two main means of mapping QTL: genetic (or linkage) mapping and physical mapping.
Genetic mapping allows one to ascertain the relative order of genes (or genomic markers) in linkage
groups – i.e. areas of the chromosome whose genes tend to be inherited together (see figure 2).
Physical mapping can ascertain the precise positions of genes and genomic markers on
chromosomes (see figure 3).
Figure 2 - Linkage maps for three pig chromosomes, taken from a 1995 paper authored by key participants in the European
Commission funded PiGMaP consortium, together with collaborators outside Europe (Archibald et al., 1995).
10
Figure 3 - Physical map of 12 pig chromosomes depicting the exact positions of markers. In the early 1990s, ‘physical’ and
‘cytogenetic’ were both used for this form of work. The map is from a 1995 paper authored by key participants in the
European Commission funded PiGMaP consortium, together with collaborators outside Europe (Yerle et al., 1995).
Genetic mapping primarily involved the identification of microsatellites (what are called type II
markers), regions of repetitive sequences that are highly variable across individuals. Different breeds
were crossed to maximise the chances of revealing polymorphisms, for example different lengths of
repetitive sequence, to enable the degree of linkage between markers to be ascertained. The results
were then integrated into databases and linkage relationships and groups identified using software
developed and adapted for the purpose. The assignment of groups of linked type II markers to
relative positions on chromosomes was aided by the physical mapping of predominantly type I loci
to regions of the chromosome – type I loci are known genes linked to variation in particular
phenotypic traits. Genetic and physical mapping therefore generated and related sets of markers
11
derived from different techniques. Where they produced markers in common, these would be used
to integrate different maps (e.g. Rohrer et al., 1996).
The collaborative work required to make maps using genotyping data generated in different
locations and to integrate the resultant maps helped to consolidate an international network of pig
geneticists. From the 1990s, the members of this network met and coordinated their efforts at
international events, such as at the annual Plant and Animal Genome conference and the meetings
of the International Society of Animal Genetics, in addition to more regular contact between core
members of the community by email, telephone and teleconference. The community was further
nourished by the creation and distribution of genomic resources such as the primers made available
by Max Rothschild, then pig coordinator of the USDA’s National Animal Genome Research Program,
and the IMpRH radiation hybrid panel produced by a collaboration between L'Institut National de la
Recherche Agronomique (INRA) in France and Lawrence Schook’s group at the University of
Minnesota in the US.
In the early 2000s, this international community of pig geneticists sought to secure funds to produce
a reference sequence of the pig genome. At first, this was under the auspices of an ‘agricultural
genome’ first mooted in the early 2000s (e.g. Pool & Waddell, 2002). The agricultural orientation
reflected the research agenda of much of the community and their prior and ongoing funding
streams, which were concerned with producing data, tools and methods to be able to identify genes
and other markers that could then be used in programmes of selective breeding in the livestock
industry. Some of these researchers combined this work with an interest in the biology of
domestication.
The community soon shifted its arguments, however, towards securing National Institutes of Health
(NIH) funding. The NIH were looking to fund the sequencing of a mammalian genome to aid in the
analysis of the human genome. A White Paper published by key figures in the pig genome research
community therefore cited the genetic similarity between pigs and humans and emphasised the
potential of the pig as a model for biomedical research (Rohrer et al., 2002). Although they were
unsuccessful in acquiring NIH funding, the positioning of the pig as a biomedical model by members
of the pig genome research community has continued (e.g. Groenen et al., 2012; Kuzmuk and
Schook, 2011; Schook et al., 2005a).
The efforts to advance research into the pig genome built upon prior work on the human genome.
Participants in human genome mapping and sequencing efforts also attended pig genome mapping
meetings, for example Peter Marynen at a meeting of European pig genome mappers in Ghent in
1993 and Aravinda Chakravarti at the First International Workshop on Swine Chromosome 7 in
Wisconsin in 1995 (Chakravarti, 1996).10 Alan Archibald, based at the Roslin Institute near Edinburgh
in Scotland and the co-ordinator of PiGMaP, served on the Co-ordinating Committee of the Medical
Research Council-funded UK Human Genome Mapping Project (HGMP) in the 1990s.11 The
techniques and standards employed in human genome mapping were adopted and adapted by the
pig genome research community, for instance by using the same kind of nomenclature and adhering
to the Bermuda Principles, Fort Lauderdale agreement and the Toronto statement regarding the
release of data (Archibald et al., 2010). The more comprehensive maps of human genes and
10 Marynen is listed as a speaker in: Chris Haley (ed.) ‘4th EC PiGMaP MEETING 17-19 June 1993 HET PAND,
University of Ghent, Belgium’ report, in Alan Archibald’s personal papers. 11 ‘MRC HGMP Co-ordinating Committee’, page 27, G Nome News, Number 16, February 1994, edited by Nigel
K. Spurr and originally published by the UK Human Genome Mapping Project. Cold Spring Harbor Laboratory
Archives Repository, Identifier SB/9/2/54. Available online at: http://libgallery.cshl.edu/items/show/75888
Accessed 20th October 2017.
12
sequences of human DNA were used as an important basis of comparison for the pig mappers, and
probes containing human DNA were used to identify markers in the pig genome. Comparisons took
advantage of the evolutionary relatedness and conservation among mammals, and therefore the
relative similarity of their DNA.
In September 2003, the community’s efforts to coordinate the strategy and funding of pig genome
sequencing led to the formation of the Swine Genome Sequencing Consortium (SGSC). The meeting
to launch the SGSC was held at the INRA facility in Jouy-en-Josas near Paris and co-hosted by INRA
and the University of Illinois Urbana-Champaign. The co-hosting duties re-capitulated the
partnership that had produced the IMpRH panel, as Lawrence Schook had moved to the University
of Illinois Urbana-Champaign from the University of Minnesota in 2000. After attempts to obtain
funds from the NIH failed, funding for the SGSC was acquired from institutions that had previously
sponsored pig genetic research. These were funders of agricultural rather than biomedical research
and included the Biotechnology and Biological Sciences Research Council (BBSRC) and the
Department for Environment, Food and Rural Affairs (DEFRA) in the UK; the European Union; and
the US Department of Agriculture (USDA), the National Pork Board, the Iowa Pork Board, the North
Carolina Pork Council, Iowa State University and North Carolina State University in the United States.
On the award of $10 million dollars towards the sequencing of the swine genome, USDA Under
Secretary for Research, Education and Economics, Joseph Jen, commented that “[b]y decoding the
sequence of the pig genome, scientists can explore new ways to improve swine health and to
increase the efficiency of swine production.”12 While biomedical applications were still hoped for,
agricultural research priorities now dominated the sequencing project.
The SGSC had decided that the pig sequencing would be primarily map-based and hierarchical, with
some additional whole-genome shotgun sequencing to provide some data for the final genome
assembly. The first task was to construct a high resolution physical map of the pig genome. There
was continuity in the personnel involved in previous projects to map the pig genome, and producing
a high resolution physical map of the precise position of markers drew upon previous work in
genetic, physical and comparative mapping.
A comprehensive physical map of the pig genome was produced, and clones from four genome
libraries were sent to the Sanger Institute for hierarchical map-based shotgun sequencing. There was
also therefore continuity between this new mapping work and sequencing. Sequencing understood
thinly took place primarily at the Sanger Institute from 2006 to 2009. A thick approach to
sequencing, however, allows me to expand the historical narrative beyond the Sanger Institute.
4. Pig genomic sequencing
A thick approach to sequencing provides the opportunity to identify all of the steps in a particular
sequencing process. In doing so, there is also the potential to foreground the iterative and recursive
qualities of sequencing. Therefore, while I will detail my account of the different aspects of thick
sequencing in separate sections, the different processes can and in some cases do overlap.
Furthermore, when one moves beyond specific projects such as the one that took place at the
Sanger Institute, the linearity and unidirectionality of the sequencing process is challenged still
further.
12 Joseph Jen, quoted in ‘USDA AWARDS $10 MILLION TO SEQUENCE THE SWINE GENOME’, USDA News
Release, Washington DC, January 13th 2006. Found in Alan Archibald’s personal papers.
13
4.1. Elucidating a minimum tile path
An attention to the thickness of sequencing helps break down the distinction between sequencing
and other forms of work such as gene mapping. Scholarly literature on genomics has paid close
attention to the practices and conceptual inputs, developments and implications associated with
both mapping (e.g. Gaudillière & Rheinberger, eds., 2004; Hogan, 2014; Rheinberger & Gaudillière,
eds., 2004) and sequencing (e.g. Barnes & Dupré, 2008; García-Sancho, 2012), yet this work has still
largely been partitioned. Here I include the production of a physical map of the pig genome in my
thick discussion of the sequencing.
Funding for the production of a high-utility integrated physical map were provided from grants
awarded to Alan Archibald at the Roslin Institute by the BBSRC, DEFRA, the private company Sygen
and the Roslin Institute itself. Funds were available from 2003 to 2005 to enable the work to take
place. Two programmes of the USDA also provided support. Archibald first approached the chief
executive of the BBSRC with a proposal in August 2000, after consultation with figures in the USDA’s
Agricultural Research Service, prompted by the announcement by Wes Warren of Monsanto that the
company were developing BAC contig maps for swine and cattle. Archibald and colleagues
emphasised the importance of ensuring such maps were in the public domain, and detailed the
potential uses of the data, including “improving the resolution of trait gene mapping” in part by
being better able to characterise and then map Single Nucleotide Polymorphisms (SNPs) that may
then be associated with variation in traits of interest.13
To aid with mapping efforts, researchers created first yeast (YAC) and later, bacterial artificial
chromosome (BAC) libraries of cloned pig DNA in several laboratories around the world during the
late 1990s and early 2000s.14 Four BAC libraries were used in the construction of the high-resolution
physical map used in the sequencing of the pig genome. Two (CHORI-242 and RPCI-44) were from
the Children’s Hospital Oakland Research Institute BACPAC Resources Center in the United States,
led by Pieter de Jong.
The CHORI-242 BAC library was produced by Baoli Zhu from the DNA extracted from the white blood
cells of a single Duroc (a North American domestic breed) sow named TJ Tabasco, who was born at
the University of Illinois at Urbana-Champaign in 2001. The cloning was conducted according to a
protocol developed in de Jong’s laboratory (Osoegawa et al., 1998). The clones were inserted into a
vector (a DNA construct) called pTARBAC1.3, and then E. coli cells were transformed to host the
vector and the cloned pig DNA contained within. The CHORI-242 BAC library incorporates nearly
200,000 recombinant clones and was preferentially sequenced mainly due to its greater coverage of
the genome (the overlapping DNA fragments contained in it were equivalent to 11 whole pig
genomes). The other library developed in Oakland, RPCI-44, was funded by USDA-MARC and
constructed by Chung-Li Shu. DNA for this was isolated from the white blood cells of four boars
(each crosses of the Yorkshire, Landrace and Meishan breeds).
The third library, PigE BAC, was constructed and developed in the UK by Susan Anderson and Alan
Archibald at ARK-Genomics, a unit of the Roslin Institute, and distributed from the Human Genome
Mapping Project Resource Centre in Hinxton. The DNA was derived from the white blood cells of
13 Alan Archibald, 17th August 2000, ‘International Farm Animal Genome Projects’, in Alan Archibald’s personal
papers. 14 Initially, YAC libraries were developed due to the large numbers of recombinants required in BAC libraries.
Due to problems with YACs, however, including their stability, chimerism and the presence of repeat
sequences in the yeast genome, it was eventually decided to develop and use BAC libraries (Gary Rohrer,
Skype interview with author, 30th March 2017).
14
male crosses between Chinese Meishan and European Large White pigs (Anderson et al., 2000).
Finally, the INRA Porcine BAC library from Laboratoire de Radiobiologie et d'Etude du Génome
(LREG) at INRA in France was constructed using DNA from the skin fibroblasts (connective tissue cells
that synthesise collagen and other fibres) of a Large White male. The group was primarily interested
in identifying retroviral elements, viral sequences incorporated into porcine DNA that it was thought
could infect humans if pig tissues were transplanted into them for therapeutic purposes (Rogel-
Gaillard et al., 1999). All four libraries involved the transformation of E. coli bacteria to host the
libraries of clones.
The BAC libraries were sent to the Sanger Institute, that was contracted to perform the majority of
the physical mapping work. Work was also conducted at The Keck Center for Comparative and
Functional Genomics at University of Illinois at Urbana-Champaign under the auspices of the
Livestock Genome Sequencing Initiative. Genoscope, the French national sequencing centre,
sequenced the BAC-ends of the INRA BAC library.
The clones contained in the BAC libraries were digested with the restriction enzyme HindIII. The
fragments thus generated were fingerprinted by electrophoresis on agarose gels.15 This process
involves running an electric current through the gel, separating the negatively charged DNA
molecules according to size. Banding patterns produced and detected by a fluorimager as well as
images entered into a fingerprint database were used as inputs into the software programme
WebFPC to identify overlaps between fragments from different clones. Through using this
programme, the 267,884 individual fingerprints were initially assembled into over 12,000 contigs,
fragments containing unbroken stretches of base pairs. To reduce the number of contigs while
increasing the average size, several procedures were used.
Firstly, sequences comprising an average of 707 bases at the end of each cloned fragment were
determined (Groenen et al., 2012). These BAC-end sequences (BES) were deposited in the Ensembl
and GenBank trace repositories, which stored raw data. Sequencing is therefore also a key part of
this form of mapping, as it is with others. By aligning the BES with the human genome, using the
database searching programme BLASTN, they were able to order them and thus merge contigs. This
stage drew heavily on the established structural and sequence similarity between pigs and humans,
and upon detailed prior studies of the synteny (the conservation of blocks of genomic order
between two chromosomes) of pig and humans. Secondly, the statistical thresholds used in
calculating the overlapping of clones and the merging of contigs could also be relaxed to merge the
remaining contigs still further. As Alan Archibald put it to me, however, “you don’t want to produce
a humanised pig genome,” so contigs were only joined if already supported by the fingerprint data.16
Through these procedures and others involving the use of radiation hybrid maps, the initial
thousands of contigs were reduced to 172, greatly increasing the contiguity of the map (Humphray
et al., 2007).
Physical maps are important tools and resources in and of themselves. In pig research, they have
been developed and used for the identification of QTL. Either the molecular basis of the QTL can
then be investigated, or genetic markers situated close to the QTL identified. Pigs can then be
genotyped for these and other markers, and these data can be used to inform which pigs to
15 Not to be confused with the DNA fingerprinting developed by Alec Jeffreys of the University of Leicester,
which is most famously used in forensic science and the determination of paternity. Jeffreys, incidentally, had
some minor involvement in the early years of pig gene mapping. His colleague Esther Signer was a key
participant and contributor to PiGMaP. 16 Alan Archibald, interview with author, Roslin Institute, 17th November 2016.
15
incorporate in selective breeding programmes. Identification and mapping of QTL and associated
genetic markers are key elements in practices aiming to effect phenotypic improvement in
populations.
The physical mapping just described had its uses for these purposes, but was also an integral part of
the overall project of sequencing. Through using “information about the extent of clone overlaps
derived from the finger-print data and re-assessing the relative positions of paired BES alignments to
the human genome,” the physical mappers were “able to optimize the selection of an initial tilepath
of minimally redundant clones through assembled clone contigs across the pig genome” (Humphray
et al., 2007). They thus enabled the sequencers to identify the clones to be sequenced, optimised
the sequencing operation (using the minimum number of clones) and helped to assemble the
sequenced fragments.
Sequencing has been described as simply the production of an “ultimate map,” more finely grained
than is possible by genetic or physical mapping (McKusick & Ruddle, 1987; McKusick, 1991 and
1997). When physical maps are used as resources for sequencing, there is therefore no firm
distinction between the work of sequencing and mapping; sequencing is a form of mapping. As
historian Soraya de Chadarevian observes, in the same database for the nematode worm
Caenorhabditis elegans, “a simple click on the mouse allows users to move from a locus on the
genetic linkage map to its representation on the physical map and on to the sequence of the
corresponding gene or, vice versa” (de Chadarevian, 2004, p. 95). This integration of different kinds
of representation, as well as the value added to the earlier maps by the newer sequence data,
however, came despite key differences between linkage and physical mapping and between
mapping and sequencing that can be attributed to the different cultures, institutions and
organisation of researchers involved. The different forms of mapping and sequencing are thus still
considered by de Chadarevian to be distinct domains of activity with different products. A thick
perspective enables one to encompass these different activities and cultures under one analytical
umbrella.
4.2. Determination of ‘raw’ sequence
The thin sequencing perspective focuses on the determination of a ‘raw’ sequence of DNA bases.
Frederick Sanger and his colleagues pioneered the sequencing of DNA in the 1970s.17 ‘Sanger
sequencing’ was the most prevalent technique used in sequencing before the development of ‘next-
generation’ sequencing methods in recent years. It was based on the ‘chain-termination’ or ‘dideoxy’
technique. Initially, this was a laborious process that took a great deal of skill and time. From the
early 1980s, efforts were underway to automate this process to improve the practicability of
sequencing genomes of organisms.
17 This is a simplified account of the history of DNA sequencing, for more historical and analytical detail,
including the history of sequencing before the advances mentioned here, see García-Sancho (2012), pp. 21-64.
Additionally, Onaga (2014) recovers the contribution of Ray Wu to early sequencing techniques.
16
Sequencing has historically depended on the sequencing of fragments of DNA rather than whole
strands, and this is true currently, although methods to sequence whole strands have been proposed
and are in development. Some automated sequencing machines can sequence longer fragments,
though they are typically more expensive to use. This technical limitation means that approaches
have had to be developed to sequence relatively small stretches of DNA at a time, and then
integrate those sequenced stretches or fingerprints to produce ever larger and fewer contigs.
In hierarchical map-based shotgun sequencing, the chromosomes are cut up into pieces of around
100,000 to 150,000 base pairs, which are then inserted into BACs. The clones from these BACs are
then cut up using enzymes, and the fragments are then sequenced. Finally, the order and location of
the fragments is determined by an automated assembly method in which a computer programme
identifies complementary sequences of DNA exposed at the end of the fragments produced by the
enzymatic cutting. With the sequenced fragments placed in order, the sequence of the larger piece
of chromosome is now known. In whole-genome shotgun sequencing, the genome as a whole is cut
into small fragments, which are sequenced and then reassembled back into a whole genome. In the
competing projects to sequence the human genome around the turn of the century, advocates of
the map-based approach cited its accuracy, while partisans of whole-genome shotgun emphasised
its speed (Bostanci, 2004, pp. 172-173; Brown, 2006, pp. 119-124; Wade, 2001, pp. 81-84). There
were also deeper differences based on different organisational models and moral economies as well
as different conceptions of the nature and structure of the genome which affected what partisans on
either side saw as a feasible or valid approach (Bostanci, 2004, pp. 169-172).
The competition between two different models of sequencing demonstrates that choices, within
particular material, social, disciplinary, political and policy constraints, have been made as to how
thin sequencing is conducted. These choices have consequences in terms of the organisational and
technical models by which they are realised.
In the case of the thin-focused sequencing for the pig genome conducted at the Sanger Institute, the
basic approach outlined in a 2005 paper and described in more detail below was followed
throughout (Schook et al., 2005b). The technical instantiation of that approach, however, changed
over the course of the Sanger Institute’s contribution to the project. The technology platforms
changed, but also the organisation of the work, the latter inspired by the demands of sequencing the
whole genome of an organism with 36 autosomes (non-sex chromosomes), with limited funding and
time available. Chief amongst these changes to the organisation of the work was a shift to a stricter
division of labour, and greater automation of certain tasks, including the use of robot colony-pickers.
Even at the point at which the pig genome was being sequenced, therefore, there was scope to
automate previously non-automated tasks, and to institute changes to make the organisation of the
sequencing work more like that of a factory than a traditional laboratory.18 At the time of the pig
genome project, the Sanger Institute research and development team would develop bespoke
protocols appropriate to the genome being sequenced.19 Since then, their aim has been to generate
18 Stephen Hilgartner, writing on the human genome project, argues that the promoters of genome projects
aimed to build large-scale specialised genome centres with factory-like organisation precisely to carve out a
domain separate from molecular biology and genetics conducted in smaller-scale laboratories, so as to appear
unthreatening to the existing organisational modes and moral economies in those disciplines (Hilgartner, 2017,
especially chapter 4). 19 This remains the case for organisms with genomes with potentially problematic properties, for example
Plasmodium falciparum. This example and the information about the protocol for the pig was given to me by
Carol Churcher, Head of Sequencing Operations at the Sanger Institute from 2008 to 2011, interview with
author, Wellcome Trust Sanger Institute, 9th March 2017.
17
protocols and processes that are more generic and widely-applicable, and therefore standardised.
Considered thinly, there has been a tendency in sequencing towards the greater standardisation of
protocols and procedures, more factory-based organisation in which individual tasks are separated
in space and conducted by individuals working only on that particular task and increased automation
of particular tasks. However, these tendencies are uneven and partial. They reflect particular
decisions, made on grounds of finance, policy, community standards and interests, disciplinary
make-up, intellectual and practical aims, challenges and outputs, relationships and other factors.
Manual work requiring experience and skill is interspersed throughout automated work. Sequencing,
even considered thinly, is “an active process of extraction and construction shot through with
difficult manual tasks and active judgment calls” (Stevens, 2013, p. 115).
Based on the experience of the physical mapping, the relationships that had developed, the fact that
the Sanger Institute already had the clones and the ability of the USDA to fund work outside the US,
the formal sequencing project was also to take place at the Sanger Institute. Several respondents
described this as “logical,” although according to correspondence and proposals dating from 2004 in
Alan Archibald’s personal papers, the possibility of having Baylor Human Genome Center in the US
lead the project and conduct assembly and annotation of sequence data generated by them and five
other centres was briefly considered.20 To apprehend why the choice of the Sanger Institute as the
sequencing centre with a different model to that of Baylor seemed logical, one therefore needs to
turn back to the physical mapping, and so here the thick perspective is valuable. Once mapping is
brought into the picture, the number of actors, institutions and practices multiply beyond the Sanger
Institute and a centralised model based on it being the main sequencing centre for this project. The
thin part of the sequencing was therefore dependent upon activities included in the thick
interpretation of sequencing.
Following the model of the human genome project, the idea was to determine the sequence of base
pairs using the BACs that were the shortest route through the physical map, the minimum tile path.
This constituted 98.3% of the physical map. Once again, for this part of the process, corresponding
to the ‘thin sequencing’, the four BAC libraries from the US, UK and France were used. In addition, a
fosmid library (in which the DNA is inserted into a circular bacterial chromosome called an F-
plasmid) was used, which incorporated DNA from the same sow used to construct the CHORI-242
library. Once again, clones from that library were preferentially used.
The (Sanger-method) sequencing was capillary-based rather than gel-based, which obviated the
need for gel pouring or lane tracking (detecting the lanes in the gel was an extremely thorny
problem for computers and therefore required human intervention to help the software read the
gels). The DNA fragments pass through a capillary tube. The chain-terminating bases are tagged with
a different colour depending on the base, and these fluoresce when hit with a laser. A camera
records this and “traces” in the form of graphs of the four different colours are transmitted and
recorded, and a computer programme detects the peaks and therefore assigns a base. Paired-end
sequencing was employed, which meant that each fragment was sequenced from both ends.
In addition to the map-based approach, some whole-genome shotgun data were generated. Some of
these data were incorporated into the assembly of the pig genome that was heralded with a paper
published in Nature in 2012. The paper analysed the evolutionary implications of the data, and
20 11th January 2004, ‘Proposed Hybrid Model for Swine Genome Sequencing’, in ‘Swine Genome Sequencing
Project’ folder, Alan Archibald’s personal papers.
18
attempted to demonstrate the usefulness of the pig for biomedical research (Groenen et al., 2012).21
The large number of authors (136, with 54 institutional affiliations) listed on the 2012 Nature paper
seem to indicate that sequencing is indeed a more large-scale effort than previous modes of
biological work. What is striking, however, is the extent to which the authorship of the paper reflects
the involvement of many members of the pig genetics research community in many of the aspects of
the initiation and coordination of the project, as well as involvement in the sequencing itself through
the production of libraries, physical mapping, assembly and annotation; that is, if we are to
understand sequencing from a thick perspective.
Following the determination of the order of base pairs and initial assembly, the genome then
underwent further assembly and annotation. If one looked at sequencing thinly, the account would
end here or early in the assembly section, rather than accounting for the fuller practices
encompassed by assembly and annotation. Considering these produces a different picture of the
topography and temporality of sequencing.
4.3. Assembly
The purpose of assembly is to build ever larger stretches of DNA from the sequence reads coming
out of the sequencing machines. The average read length for sub-clones generated from each BAC
was 707 base pairs. This clone-based sequencing generated 4X coverage, the equivalent of four
whole genomes. The greater the coverage, the less likely that errors will make it into the final
sequence assembly. A piece of software called Phrap was used to analyse the sequence data to
assemble it into contigs. As well as the main body of the work at the Sanger Institute, some clones
were also sequenced at the National Institute of Agrobiological Sciences in Japan, and some
assembly work took place at The Genome Analysis Centre in Norwich, UK, after some Sanger
Institute staff moved there following a strategic re-orientation at the Sanger Institute.22
With the stage of assembly into contigs complete, there were 279 contiguous pig clones. To further
improve the quality of the genomic sequence, it had to undergo automated pre-finishing, gap
closure and finishing by additional sequencing of selected BAC clones or genomic regions. The
automated pre-finishing was accomplished by “primer walking” from the ends of contigs, by
introducing a short strand of DNA called a primer with a sequence complementary to that to be
determined at the end of the contig. From this a complementary DNA strand is synthesised, which is
then itself sequenced. This enables contigs to be joined into fewer and larger DNA fragments, and
21 A meeting of the Swine Genome Consortium in January 2011 reviewed 48 proposals for ‘companion papers’
to the main Nature paper. Of these, 12 were under the heading ‘Application focused’ and were oriented
towards agricultural applications. 36 were under the heading ‘Genomics Focused’, most being concerned with
further development of genomic data and resources and comparative and phylogenetic-style studies, with
some biomedically-oriented papers included as well (Source: document dated 13th January 2011, ‘Swine
Genome Sequencing Consortium (SGSC) Genome and Companion Manuscripts Meeting’ agenda, in ‘Swine
Genome Sequencing Project’ folder, Alan Archibald’s personal papers). In the end, 17 companion papers were
published, of which 3 were wholly oriented towards agricultural applications, 2 directed towards biomedical
applications, and the rest concerned with further development of genomic data and resources and
comparative and phylogenetic-style studies, though with several of these being potentially relevant for
agriculture and biomedicine (see: https://www.biomedcentral.com/collections/swine Accessed 10th July
2018). 22 Over the period of Allan Bradley’s directorship (2000-2010), moving “from sequencing genomes to using
sequence data to answer important biological questions” (Wellcome Trust, 2005).
19
therefore to reduce the number of gaps in the sequence. After automated pre-finishing, 1,681 pig
clones were contiguous.
20
Figure 4 - Comparative map depicting the 18 distinct porcine non-sex chromosomes, with equivalent parts of human non-
sex chromosomes indicated adjacently (Meyers et al, 2005).
21
Assembly also requires checking that the sequence is substantially complete, and in addition to the
automated methods it required judgement to do that. The judgement was informed by prior
mapping efforts, including comparative maps detailing the correspondence between parts of the pig
and human genomes (see figure 4). The Genome Evaluation Browser – gEVAL – produced and
managed by the Genome Reference Informatics Team (GRIT) at the Sanger Institute was made
available to the pig community to allow them to assess particular regions and suggest how to correct
the assembly to improve it. Alan Archibald worked closely with GRIT to identify and correct errors.
Evaluating and improving the quality of assembly is key to its potential use as data.23 When a draft
assembly was produced, Alan Archibald was able to check it using gEVAL. He scrolled through the
assembly 2 megabases at a time and, comparing the orientation of the genes with a comparative
map of the pig and human genomes. Archibald, when examining the screen (figure 5), would ask
himself “does the pattern I'm seeing here fit with the expectations of that, if you like, that rough
comparative map?” If something did not seem right, he would try to rearrange parts in different
ways around, in his head or on scraps of paper, “but I’m not going to do that unless I've got some
pig-specific information,” ensuring that knowledge of the genetics of the pig disciplined the use of
comparative data, “because I don't want this to be a human genome.”24 This was perceived to be a
danger due to the extensive use of data and materials derived from human genomic research from
the earliest days of the systematic mapping of the pig genome.
Figure 5 - Alan Archibald depicted in his office at the Roslin Institute. On the left screen he has a pdf document depicting the
pig-human comparative map shown in figure 5. On the right screen he has the Sscrofa11.1 genome assembly open in a
genome browser. He used the comparative map to identify equivalent regions in the human genome to the parts of the pig
genome where gaps still exist, to indicate what may have caused problems in the assembly and thus identify which BAC
clones to order and re-sequence. Photograph taken by author, 25th May 2017.
23 Kerstin Howe, interview with author, Wellcome Trust Sanger Institute, 4th October 2017. 24 The picture in figure 5 was taken from a video recording of Archibald taken by the author at the Roslin
Institute on 25th May 2017. In the video, he is systematically working through gaps in a new sequence
assembly. The purpose of the recording was to document usually undocumented scientific work, and to
provide empirical materials concerning the role of comparison and homology in genomic research.
22
Other participants assisted in this work. For example, when at Roslin for a faculty sabbatical,
Christopher Tuggle of Iowa State University noticed that there was something not right about the
Sscrofa10 assembly while examining the interleukin-1 beta gene (IL-1β). Upon investigation, it was
found that the programme used to assemble the BACs had not been written correctly. This meant
that the way the algorithm was assembling didn’t impart information about the orientation – the
way round relative to the sequence being assembled – that the BAC should be in. The problem was
thus identified (by Archibald) and the algorithm fixed.25 There was therefore a need for expert
manual judgement to assess the validity of the computational tools being used.
Figure 6 - Genome assemblies for the pig submitted to GenBank. As GenBank is based in the USA, the date format is
MM/DD/YYYY. Table adapted from the GenBank website: https://www.ncbi.nlm.nih.gov/assembly/organism/9823/all/
Accessed 09/07/2018.
As of July 2018, there are 25 genome assemblies that have been submitted to the publicly-accessible
database GenBank. They are categorised in a number of ways, and have been submitted by multiple
groups. Seven have been submitted by the SGSC. Other submitters include BGI-Shenzhen (formerly
the Beijing Genomics Institute), the Sanger Institute (‘SC’ in the table), a genome sequencing
company called Novogene that has conducted sequencing in China with collaborators from
universities and has links with the Chinese Ministry of Agriculture, and two pharmaceutical
companies, Hoffman-LaRoche and GlaxoSmithKline. The most recent submission comes from the
USDA. Most of the submitted assemblies are full representations of the genome, meaning that the
25 Christopher Tuggle, Skype interview with author, 3rd March 2017.
23
data were acquired from the whole genome, rather than just a part of it, but with different levels of
assembly and assigned status. The assemblies submitted by the SGSC and the USDA are
chromosome-level assemblies, meaning that there is a sequence for at least one chromosome. This
sequence may still contain gaps. The most recent SGSC submission, which the National Center for
Biotechnology Information that runs GenBank has selected as the representative genome for the
pig, is not however designated as a complete genome. That designation would require that all
chromosomes be sequenced without gaps in the sequence, and fulfil other criteria that will be
discussed below. The other two assembly levels listed in the table are scaffold and contig. A contig is
a continuous sequence in which there is a high confidence level in the order of the bases. A scaffold
is a section of sequence that incorporates more than one contig, together with the gaps of unknown
sequence known to exist between them. The aim of sequencers is to reduce the number of gaps,
and therefore the number of contigs and scaffolds, and also to localise and place the scaffolds on the
chromosome. To qualify for complete genome assembly level, the sequence must have no
unlocalised or unplaced scaffolds. The following table shows how the submitted assemblies have
changed over time for the SGSC submissions.
Name (date
submitted)
Coverage Number of
chromosomes
Total sequence length Gaps between scaffolds
(number of scaffolds)
No. of contigs
Sscrofa5
(11.07.2008)
10 813,033,904 1,584 (1,585) 44,057
Sscrofa9
(02.11.2009)
4X 19 2,262,579,801 3,133 (3,133) 101,117
Sscrofa9.2
(23.02.2010)
4X 19 2,262,484,801 3,116 (3,135) 101,112
Sscrofa10
(19.05.2011)
24X 21 2,772,757,746 3,915 (8,519) 266,137
Sscrofa10.2
(07.09.2011)
24X 21 2,808,525,991 5,323 (9,906) 243,033
Sscrofa11
(06.12.2016)
65X 19 2,456,768,445 24 (626) 705
Sscrofa11.1
(07.02.2017)
65X 20 2,501,895,775 93 (705) 1,117
Figure 7 - Table providing figures for certain key measurements of successive genome assemblies submitted to GenBank
by the Swine Genome Sequencing Consortium. The row for Sscrofa5 has been highlighted in grey as it is only a partial
assembly. In this table I have used the DD.MM.YYYY date format used in Europe and much of the rest of the world.
The statistics can be confusing, because the assemblies are not necessarily directly comparable: for
example, the number of chromosomes sequenced and assembled may not be the same. Although
they are not equivalent in length, Sscrofa11 has a considerably smaller number of scaffolds and gaps
between scaffolds compared with Sscrofa9, and the same is true for contigs. This is an improvement,
as the fewer numbers of scaffolds or contigs there are, the greater confidence we can have that the
24
assembly is correct. More contigs and more scaffolds mean a greater likelihood of mistaken
placement. Despite greater coverage of the genome (the number of reads of any given nucleotide in
a sequence), Sscrofa10 has a higher number of scaffolds and gaps between scaffolds. This does not
mean that the assembly is of a lower quality, but that extra chromosomes and extra-nuclear DNA
have been included in the assembly. In particular, it includes an assembly for the Y chromosome
which, to give an example in the Sscrofa11.1 assembly, contains 69 of the 93 gaps between scaffolds
across all chromosomes. The Y chromosome notoriously contains many repetitive sequences that
are consequently difficult to assemble. Any assembly including the Y chromosome is therefore likely
to have its metrics negatively affected.
Although a representative genome is designated based on coverage and assembly statistics
(including those relating to gaps and error rates) there is not any one complete or final sequence.26
There are corrections to existing assemblies. There are numerous sequences of different breeds and,
although the SGSC assemblies show greater quality over time, the standard of what constitutes a
gold-standard assembly also changes over time. If we examine sequencing from a thick perspective,
we can therefore qualify prior historical accounts that take the completion of a determined raw
sequence as the end-point of genome projects in two ways. Firstly, by including the work done on a
particular sequence assembly beyond the initial stages of assembly, for example through later stages
and iterations (for the ‘same’ genome) of assembly and improvement. Secondly, by investigating
sequencing activities separate from the generation and development of reference genomes.
Sequencing understood thickly is not only concerned with an improvement in the statistics over time
for a representative genome. This is shown by the Novogene submissions that are sequence
assemblies for different breeds of pig to the SGSC assemblies based primarily on a Duroc sow. There
is an interest in the sequences of different breeds, and, for the purposes of animal breeding
genetics, an acute interest in the variation in sequences. Therefore, we may anticipate that new
needs will arise for which new categorisations and statistics will be generated for assemblies to be
judged against.
Sequencing and assembly is an open-ended process, which involves the periodic submission of
sequences and assemblies to databases like GenBank. At every stage, decisions are made of what to
sequence (for instance, the breed), what part of the genomes receive particular (or little, or no)
attention, the coverage, the sources (particular individuals, particular BAC libraries), the method, the
machines used (which can vary in technique and chemistry, and operation), and how the assembly is
conducted.
For the assembly, different software can be used, and decisions are made about the statistical
confidence levels. Lower the stringency of these, and one can reduce the number of contigs, but at
the price of an increased likelihood of assembly errors. Additional coverage and better techniques
can aid the assemblers in reducing the likelihood of errors. Also, access to a high-quality physical
map for the species they are working with, in conjunction with sequence data from related species
with known synteny can be vital aids in assembly, even if they are time, labour and cognitively
demanding. A consequence of the foregoing argument is that there is no a priori point at which we
can say that sequencing ends. This point was echoed by Kerstin Howe of GRIT, who reflected that
“there is always something to correct with a genome assembly, it’s never done; it’s only abandoned
at a certain point,” for instance because particular quality targets have been reached or resources
26 Similar points have been made by Adam Bostanci with regard to the human genome project, on the
publication of two versions of genomes produced using different methods and organisational models (2004),
and the problematic of acknowledging and accommodating intra-specific sequence variation (2006).
25
have run out.27 We may wish to define sequencing as the activities that occur under the aegis of a
particular project, but this would be problematic unless we were to explicitly acknowledge that a
study based on this definition is one of a particular sequencing project, rather than sequencing more
generally. To answer questions like what the Sanger Institute sequenced or where the pig genome
was sequenced requires shifting from a thin to a thick perspective, as only then can the constellation
of inputs (such as BAC libraries and physical maps), outputs and decisions (concerning strategy and
the division of labour) be captured.
In addition to published genome assemblies, there are sequence data submitted to DNA Data Bank
of Japan, GenBank and the European Nucleotide Archive (ENA).28 These are primarily sequences of
chromosomal regions relevant to particular research. Rather than being superseded by a published,
complete assembly, some of these sequences may still serve as reference sequences for specific
areas of research. This can be because of previous sequencing of a defined region being of greater
quality than the overall builds, or because of resequencing after the initial builds. An example of the
former is the sequence of the swine Major Histocompatibility Complex, which occupies a region on
chromosome 7 (Renard et al., 2006). For these researchers interested in the porcine immune
system, the reference sequences were not those generated at the Sanger Institute under the
auspices of the SGSC. Examples of de novo sequencing include the next generation sequencing of the
DNA of eight breeds of domesticated pigs and wild boars from across Europe and Asia, the data from
which was used in an investigation of the evolution and demography of the pig (Groenen et al.,
2012).29
4.4. Annotation
In this paper, I have provided an account of the thick sequencing perspective through detailing the
production of a sequence capable of wider travel and use. Annotation is a key part of making the
assembled sequence capable of this. It is the process of attaching contextual information, for
instance by identifying and assigning genes, to particular parts of a sequence assembly.30 Without
annotation, the sequence data is “by itself neither informative nor particularly interesting;”
information needs to be attached to it in order that it may be able to circulate and be incorporated
into the research of a variety of potential end-users (Nadim, 2016, p. 505; see also Leonelli, 2016).
The way in which the annotation takes place is not uniform across different genome projects. The
assignment of the task may differ, as may the precise balance of automated annotation and manual
annotation. If there is the time and resources to do it, manual annotation is preferred, but
sometimes where these are lacking, automated annotation may be the only option.
27 Kerstin Howe, interview with author, Wellcome Trust Sanger Institute, 4th October 2017. 28 These three databases synchronise new and updated data submitted to each, and together comprise the
International Nucleotide Sequence Database Collaboration. There are hundreds of thousands of individual
submissions of sequence data of varying lengths to publicly-accessible sequence databases. Furthermore,
sequence data that are not publicly-accessible will also likely be held, for example in private repositories for
proprietorial reasons. 29 The eight breeds of pig were Duroc, Hampshire, Jiangquhai, Landrace, Large White, Meishan, Pietrain, and
Xiang. The wild boars were from Sumatra, Japan, two locations in the Netherlands, France, Switzerland, South
China and North China. The study accession number at the European Nucleotide Archive is PRJEB1683. Other
sequence deposits for pigs and closely related species are listed in Groenen (2016). 30 This is what structural annotation consists of. Functional annotation involves attaching meta-data to
structural annotations, and therefore depends upon this initial form of annotation.
26
As well as augmenting automated annotation, manual annotation can feedback into the evaluation
and improvement of an assembly. For instance, if a known gene is not located in the process of
manual annotation, this suggests a possible mis-assembly and therefore highlights a potential region
that can be re-evaluated and corrected.31
To refine automated annotation approaches the ongoing contribution of data is required, for
example the tagging of certain transcripts (Harrow et al., 2014). Automated annotation therefore
requires informed manual work in order to function, and to be maintained and improved. It also
requires people in relevant communities to identify the need for particular software, to guide and
validate the development of that software, and to determine and discipline the particular inputs and
outputs of the software operation. Software that is being developed, including the ongoing
development of the Ensembl genome browser, involves constant interaction between the
developers and the research community. The browser was developed by the Ensembl project, which
was initially a joint initiative of the Sanger Institute and the European Molecular Biology Laboratory’s
European Bioinformatics Institute (EBI, which hosts the ENA, based on the same site as the Sanger
Institute). The Ensembl project team now works wholly within the EBI. The project was created to
formulate methods and tools for automated annotation, and the browser shows visualisations of
sections of the genome with details of genes and other potentially relevant information shown
parallel to the parts of the genome with which it has been associated through annotation.
In the pig genome project, several computational tools were used for annotation, including: scans
for sequence patterns; mapping of pig protein sequences (acquired from public databases) to the
genome; processing and alignment of cDNAs and Expressed Sequence Tags (ESTs) – many of which
for the pig were generated by groups in Japan based at the National Institute of Agrobiological
Sciences, Animal Genome Research Program and Japan Institute of Association for Techno-
innovation in Agriculture, Forestry and Fisheries – downloaded from GenBank to the genome, and
alignment of RNA sequencing data (RNA-Seq) to the genome. Redundancy was then removed, the
set of genes was screened for pseudogenes (similar but non-functional versions of genes) and Stable
Identifiers (identifying codes) were assigned to the genes and other elements of interest (Groenen et
al., 2012).32
In addition to the automated annotation, the manual annotation team then at the Sanger Institute
(and now based in the EBI), HAVANA (Human and Vertebrate Analysis and Annotation), provided
additional support to the pig project. Jim Reecy of Iowa State University interested the HAVANA
team in manual annotation of the pig genome when he spent a sabbatical there. As HAVANA did the
work at no extra cost to the SGSC, their approach was to supply the pig genetics community with the
tools and training to be able to annotate the genome themselves. In July 2008, the team at the
Sanger Institute organised a workshop (a ‘jamboree’) to train scientists associated with the SGSC on
how to annotate.33 Targeted annotation aimed at regions of particular interest to researchers was
pursued. In the case of the assemblies that took place during the initial swine genome sequencing
31 Jane Loveland, interview with author, Wellcome Trust Sanger Institute, 4th October 2017. 32 This is a simplified and abbreviated account of the annotation process. See the Supplementary Information
of Groenen, et al. (2012) for a fuller account. 33 The annotation ‘jamboree’ was pioneered by Celera Genomics and the Berkeley Drosophila Genome Project
to annotate the Drosophila melanogaster genome. The July 2008 jamboree mentioned here was more of a
lengthy training course than the “frontal charge on the genome” represented by the kinds of jamborees
inaugurated by the fruit fly collaboration. The pig genome annotation was more like a combination of the
factory and cottage industry models of annotation discerned by bioinformatician Lincoln Stein (2001), pp. 500-
501, comprising automated pipelines and distributed smaller-scale annotation by researchers in their own
laboratories, respectively.
27
project, annotation was conducted “on the fly,” as Craig Beattie (a member of the SGSC then based
at USDA-MARC) put it to me, while the assembly was ongoing, as well as after it was complete.34
There was additional training organised by one of the working groups that focused on annotating
genes relevant to the immune system. Many of the genes were annotated by leading members of
the Consortium in collaboration with the HAVANA group. The immune system genes were
distributed to multiple individuals, who together comprised the ‘Immune Response Annotation
Group.’ Researchers based in 13 institutions across 6 countries (China, France, India, Italy, Japan, UK
and USA) had manual annotations attributed to them in this project. They used the manual
annotation software Otterlace (developed at the Sanger) to annotate 1,300 genes, including
confirming automated annotations (Dawson et al., 2013).
The annotated sequences that are produced through the automated and manual processes that
form the Ensembl pipeline are published in the Ensembl database. Additional manual annotation is
published on the HAVANA-led Vertebrate Genome Annotation (VEGA) database, which is built on
the Ensembl database. Using the browsers, researchers can therefore access the annotated
genomes (see figure 8), and additionally make comparisons between regions of genomes (for more
on VEGA, see Harrow et al., 2014).
Figure 7 - Annotated region of chromosome 7 of Sscrofa10.2 in the Ensembl browser. This is just one part of the
visualisations depicted on the browser page for this region. This summary displays genes and contigs. The more detailed
one is capable of depicting multiple tracks pertaining to different kinds of data, and allows the user to zoom in until the
order of bases in the sequence can be shown. This particular image is obtained from:
http://may2017.archive.ensembl.org/Sus_scrofa/Location/View?r=7%3A60107914-60305245 Accessed 20.10.2017.
The way annotation works in practice shows that, as with assembly, automated processes form only
one part of this work that is included in the thick sequencing perspective I’m proposing. The choices
to use, adapt or develop particular automated procedures are made by skilled practitioners, usually
in ways appropriate to the kinds of data available for a particular organism. Until recently, specific
operating procedures have been used for each organism sequenced at the Sanger Institute.
Although more standardised protocols are now being developed and used, judgements still need to
be made about what annotation tools and strategies should be used to complete the sequencing of
particular organisms.
In the case of annotation and sequencing in general looked at from a thick – rather than thin –
perspective, the following historiographical consequences are posed: 1. sequencing still requires
considerable tacit knowledge, and expert (discipline-specific) interpretation and input; 2. it is not
34 Craig Beattie, Skype interview with author, 23rd March 2017.
28
necessarily centralised, top-down organised or fast/accelerating; and 3. automated processes are
only one part of the activity of sequencing. Even the ‘black-boxes’ of sequencing machines or
software may be better characterised as ‘grey-boxes’, because they are not fully closed; there is
constant dialogue between the relevant biological research communities and manufacturers, and a
choice of machines with different capacities and capabilities (and prices).
The thin perspective captures one stage of sequencing. As the preceding account of pig sequencing
demonstrates, however, the work emblematic of a thin focus on sequencing – the determination of
the raw sequence – relied on practices that both preceded and succeeded it, and that only a thick
perspective shows. These include the development of libraries of clones, physical mapping using
those clones to generate a minimum tile path, sequencing of selected clones and then iterative
stages of assembly to close gaps. Finally, the sequence required annotation in order to be of use –
and tailored to – different biological communities. Throughout this process, constant comparisons
were made with data from the human genome, and other resources such as cDNA, ESTs and RNA-
Seq data generated by groups across the world were drawn upon.
The pig project drew upon the methods and organisation pioneered in the human genome project:
in particular, the map and clone-based approach. Pig genomics enabled the Sanger Institute to
accentuate the ongoing drive to further improve the efficiency of the sequencing process through
increased automation and greater division of labour. There was continuity in staffing. Jane Rogers,
who had been Head of Sequencing at the Sanger Institute since 1993, was involved in the planning
and early stages of the pig sequencing. She was replaced by Carol Churcher, who had also been at
the Sanger Institute since its founding. Many of the key figures in the pig genetics community who
participated in various stages of sequencing had been involved in prior mapping projects, such as
Alan Archibald, Denis Milan and Lawrence Schook. There was even continuity in the source of
libraries – Pieter de Jong’s group was a source for the human project as well as the pig one. We may
thus consider this analysis of pig sequencing to also be relevant to analyses of human genomic
sequencing that preceded it, both the public and the private arms.35
35 Although the private Venter-led initiative famously featured a whole-genome shotgun approach, it also used
maps and data from the clone-based sequencing that the publicly (and charitably)-funded effort pursued.
29
Figure 9 - Diagram illustrating the principles of the distinction between thin and thick sequencing, through an illustrative
but not exhaustive or necessarily representative depiction of the institutions (in bold) and individuals or groups involved in
different activities related to the sequencing of the pig genome. Red lines indicate institutional affiliation, the dates given
are those for the activities associated with the production of the Sscrofa10.2 assembly, and black dotted lines indicate
involvement in activities.
5. Conclusions
Through detailing an account of pig genome sequencing using a thick sequencing perspective, this
paper has demonstrated that sequencing can be understood as a process that is open-ended:
spatially, temporally and intellectually. Sequencing as an ongoing process involves the creation of
libraries and maps, the working and workings of automated sequencing machines, and associated
decision-making related to the use of them. The process also involves the assembly of the sequence,
the development and improvement of statistical and computational tools, of chemistry and
machinery, annotation, extra sequencing of certain parts of the genome, improvement of the
contiguity and quality of the data, new reads, uploading, circulation and interpretation of data,
management, curation and maintenance of data and data infrastructures.
There may be instances at which start or end-points of the sequencing process can be ascertained.
One might conceive the approval by a body such as the Wellcome Trust, USDA or NIH of a proposal
to sequence a particular species as a start. The initial receipt of clone libraries at sequencing centres,
and the first entry into automated sequencing machines, may both be conceived as starting points.
Or one might wish to identify how the tools, organisational capacity and desire by the community to
sequence some or all of the genome came to be. The endpoints might be a published paper, or the
online publication of a completed sequence.
Yet these starting points and endpoints are less discrete and definitive than on first inspection. In
culinary terms, genome sequencing is more like cooking a perpetual stew in which ingredients can
be added and the pot kept constantly on the boil, never fully complete. Firstly, the product is almost
always incomplete. Gaps may remain, and there remain some errors in final published sequences.
Secondly, either the product is an abstraction (purporting to be a reference sequence for a species,
breed or strain where there is known to be genomic variation) or the product incorporates (or is
30
built to allow the incorporation of) genomic variants such as single nucleotide polymorphisms
(SNPs). The former is not definitive and is therefore subject to contestation and revision, the latter
can never be definitive.36 As well as not being able to (historiographically or philosophically) privilege
one stage of the sequencing process over any other, it is not possible to determine a priori the start
and end points of sequencing.
Any attempt to methodologically or epistemically delimit sequencing therefore requires a specific
historiographical (or philosophical) basis, and the limitations of this choice need to be acknowledged
and used to inform any conclusions drawn. In the case of pig genome sequencing, an attempt to
reduce sequencing to thin sequencing precludes one from understanding or appreciating many of
the key decisions and research directions, especially concerning the purported ‘logic’ of the location
and strategy of the thin-centred sequencing. The Sanger Institute was chosen – and seemed ‘logical’
– because significant parts of the physical mapping work had been conducted there, the clones were
already there, the conducting of human genome sequencing there and the adoption of the human
model by the pig community, and the relationships that had been established. So even to
understand the objects of a thin perspective of sequencing, one must invoke the work and actors
revealed by the thick perspective. In its attention to the production of a sequence with added value
and usability, the thick perspective will also allow us to apprehend how genomic research may
contribute to strategic policy directions concerning translation. It also helps one to recognise key
differences between institutions and their consequences, for example of the production of sequence
at institutions that devote different levels of resources to adding value to the sequence through
comprehensive evaluation and annotation.
An attention to the thickness of sequencing leads one to characterise the geography, the
temporality and the nature of sequencing work in a fundamentally different way than for the thin
perspective. Understood more thickly, sequencing takes longer, has less well-defined start and end
points, is more institutionally-diverse, involves a plethora of different skill sets and background
knowledge, and involves considerably more actors in general. A thick examination of sequencing
reveals the active interpretation, intervention, assessment, evaluation and creativity of scientists. It
requires an appreciation of the relationships between scientists, technical staff, project managers,
administrators, industries and funders. Throughout the sequencing process detailed in this paper,
there was an interplay and interpenetration between adapting and refining protocols and processes
and using standardised tools and procedures. Where elements of work have been automated, the
manual, creative and interpretive work of scientists may still be required both in and around the
automated processes. These scientists work in the processes to evaluate, maintain and refine them,
and around them to take advantage of the ‘black-boxing’ in order to concentrate on new problems.
In sequencing interpreted in a thicker manner, some of the features of this reconfiguration may be
discerned in the apparently centralised work conducted in massive genome centres. For example, in
the pig genome project described above, the development of the principles and processes of
assembly and annotation had culminated in the use of automated pipelines, yet there was still room
for manual intervention both in the later stages of assembly and also in annotation.
For automated sequencing, lower costs have made the geography and concentration of it more
diffuse and less centralised. Citing the sequencing services offered by shared facilities in research
institutions as enabling sequencing to be “reconfigured as a small-scale, slower and artisanal form of
work, subordinated to concrete research necessities,” García-Sancho observes that “other
36 See Bostanci (2006) for a discussion of the notion of ‘the human genome’ and its supersession by the
investigation of ‘human genomic variation’.
31
sequencing is possible” (2012, pp. 176-177). Even thin sequencing requires attention to the
particular (often-shifting) assemblages of people, institutions, machines and materials that are
involved in any particular project. We may therefore develop Fortun’s (1999) analysis of the
temporalities of genomics. In that, he drew a connection between speed and other factors such as
concentration, scale, capital intensivity, and the organisation of labour and space that accelerate the
speed of sequencing as well as driving the development and intensification of particular
organisational forms such as large-scale sequencing centres.
When one considers sequencing activities more thickly, we may observe different drivers of
temporality. Rather than the ever-enhanced speed driven in the thin parts of sequencing by the
factors Fortun identifies, alternative priorities may be exhibited. Different organisation of projects
and different temporal regimes may be apparent depending on whether we interpret sequencing in
a thick or thin manner. In the pig project at the Sanger Institute, the speed of sequencing was halved
due to issues of scale and some institutional opposition to the project. This was viewed by many in
the pig genome research community as beneficial, as processing and analysis had become –
according to SGSC Technical Committee member Craig Beattie – the “rate-limiting step.” The speed
of production of sequence data meant that they “were overwhelming the information pipeline.”37 So
a reduction in speed of production was not a problem. This was, still, fast science, although it was
not necessarily so at all stages of the sequencing described in the thick sense. By attending to the
thicker understanding of sequencing, one is able to grasp the institutional, collaborative,
translational and infrastructural contexts more fully.
In addition, the particularities of the sequencing work in a given community are defined in a sharper
and more finely-grained manner, enabling one to identify the conditions that guided particular
decisions and actions. In so doing, one can make comparisons between particular objects of study
with the aim of defining more precisely how, and to what extent, the conclusions drawn from one
may be applicable to the other. To provide one example of this, we may consider two potential
objects of study for historians, philosophers and sociologists of science: pig genome research and
human genome research. If we were to conduct research based on a thin interpretation of
sequencing, both of these objects of study look much the same. The work forming the focus of the
thin perspective on sequencing was conducted by specialist teams at large-scale, highly-automated,
high-throughput sequencing centres (one of them, the Sanger Institute, participated in both human
and pig genome sequencing). One might therefore expect that findings concerning one project will
likely be transferable to the other; to re-quote Alan Archibald “to produce a humanised pig
genome.” Yet based on a thicker study of sequencing, we not only de-humanise the pig genome
(research) but genome research altogether. We reveal important differences in library construction,
the continuing and leading role of the pig genetics community in the sequencing work (as against the
marginalisation of medical geneticists in the human genome project), the rationale for the
production of sequence data and the use of allied annotation. The thick perspective also leads us to
different characterisations of the projects in terms of scale and velocity.
In this paper I do not claim to establish what sequencing or genomics is, nor to base any of the
claims that I do make on a supposed representativeness or significance of pig genome sequencing. I
would suggest, however, that the characterisation of sequencing and genomics in much of the
scholarly literature is – understandably – dominated by human genome sequencing, and in
particular, the efforts that fall under the narrative umbrella of ‘The Human Genome Project’. In
human genome sequencing, the kinds of work and objects foregrounded by a thin account of
37 Craig Beattie, Skype interview with author, 23rd March 2017.
32
sequencing appear to be central, the object of the competition and race between the ‘private’ and
the ‘public’ initiatives, the area of the work most associated with charismatic and forceful individuals
such as John Sulston and Craig Venter, who themselves have helped shape the narratives
dominating journalistic and activist discourse (e.g. Sulston and Ferry, 2002; Venter, 2008; see
Hilgartner, 2017, chapter 7, for an acute dissection of the narratives; see also a discussion of the
“narrative gap” in accounts of the Human Genome Project in Bartlett, 2008, pp. 124-125).
It is precisely those prominent aspects of human genome sequencing that have been associated with
scale, automation, speed and many other properties attributed to genomics. Due to the level of
funding and the political stakes involved, the imperative to produce sequence data as quickly as
possible was more acute for this project than any other sequencing initiative. As the objects and
processes highlighted by a thin interpretation of sequencing are proximate to the immediate
production of sequence data – the traces transmitted to computers from the bases – it is the stage
encompassing these that has gained prominence. To a lesser extent, assembly garnered attention
insofar as it was the draft products of this that were announced at the White House in June 2000.
Thus, the human genome project was primarily understood in a thin way, and this is consequently
how sequencing has become understood.
Finally, I want to emphasise the iterative and recursive nature of sequencing. Sequencing is the
production of a tool as well as a dataset. The products of sequencing are intended to be used in
scientific investigation for the production of knowledge claims, but also to further improve tools that
can be used in investigation and intervention. It is in this sense that further investigation of the
development of sequences and sequencing practices towards their intended use and re-use as tools
for research and intervention can potentially be fruitful in improving understanding of translational
research processes. A thick perspective enables us to open up those processes.
Acknowledgements
For their comments on drafts at various stages, I would like to thank Miguel García-Sancho, Giuditta
Parolini and Mark Wong of the TRANSGENE group at the University of Edinburgh, participants in the
Perspectives on Genetics and Genomics group, Dominic Berry, Jane Calvert and Doug Lowe. I am
additionally grateful for the constructive and helpful comments of the anonymous reviewers. I
would like to thank all those who have participated in oral history interviews as part of my research
into pig genomics, in particular Alan Archibald of the Roslin Institute and Lawrence Schook of the
University of Illinois at Urbana-Champaign who have been generous with their time and allowed me
to examine their personal papers. The research for this paper was conducted through the
‘TRANSGENE: Medical translation in the history of modern genomics’ project which is funded by the
European Research Council (ERC) under the European Union's Horizon 2020 research and innovation
programme under grant agreement No. 678757. This European research funding is deeply
appreciated.
33
References
Anderson, S. I., Lopez-Corrales, N. L., Gorick, B., & Archibald, A. L. (2000). A large-fragment porcine
genomic library resource in a BAC vector. Mammalian Genome, 11, 811–814.
Archibald, A. L., Haley, C. S., Brown, J. F., Couperwhite, S., McQueen, H. A., Nicholson, D., Coppieters,
W., Van De Weghe, A., Stratil, A., Winterø, A. K., Fredholm, M., Larsen, N. J., Nielsen, V. H., Milan, D.,
Woloszyn, N., Robic, A., Dalens, M., Riquet, J., Gellin, J., Caritez, J.-C., Burgaud, G., Ollivier, L.,
Bidanel, J.-P., Vaiman, M., Renard, C., Geldermann, H., Davoli, R., Ruyter, D., Verstege, E. J. M.,
Groenen, M. A. M., Davies, W., Høyheim, B., Keiserud, A., Andersson, L., Ellegren, H., Johansson, M.,
Marklund, L., Miller, J. R., Anderson Dear, D. V., Signer, E., Jeffreys, A. J., Moran, C., Le Tissier, P.,
Muladno, Rothschild, M. F., Tuggle, C. K., Vaske, D., Helm, J., Liu, H.-C., Rahman, A., Yu, T.-P., Larson,
R. G., & Schmitz, C. B. (1995). The PiGMaP consortium linkage map of the pig (Sus scrofa).
Mammalian Genome, 6, 157–175.
Archibald, A. L., Bolund, L., Churcher, C., Fredholm, M., Groenen, M. A., Harlizius, B., Lee, K.-T.,
Milan, D., Rogers, J., & Rothschild, M. F. (2010). Pig genome sequence-analysis and publication
strategy. BMC genomics, 11, 1.
Barnes, B., & Dupré, J. (2008). Genomes and What to Make of Them. Chicago, IL: The University of
Chicago Press.
Bartlett, A. (2008). Accomplishing Sequencing the Human Genome. Unpublished PhD thesis. Cardiff
University. Available online at: http://orca.cf.ac.uk/54499/1/U584600.pdf - last accessed
12.05.2017.
Bostanci, A. (2004). Sequencing Human Genomes. In: J.-P. Gaudillière & H.-J. Rheinberger (Eds.)
From Molecular Genetics to Genomics: The mapping cultures of twentieth-century genetics, pp. 158–
179. Abingdon, UK: Routledge.
Bostanci, A. (2006). Two drafts, one genome? Human diversity and human genome research. Science
as Culture, 15, 183–198.
Brown, T. A. (2006). Genomes 3. New York and London: Garland Science Publishing.
de Chadarevian, S. (2004). Mapping the worm’s genome. Tools, networks, patronage. In: J.-P.
Gaudillière & H.-J. Rheinberger (Eds.) From Molecular Genetics to Genomics: The mapping cultures of
twentieth-century genetics, pp. 95–110. Abingdon, UK: Routledge.
Chakravarti, A. (1996). Genetic and Physical Map Integration. In: Abstracts: Swine Chromosome 7
Workshop. Animal Biotechnology, 7, 81–98.
Chow-White, P.A., & García-Sancho, M. (2012). Bidirectional Shaping and Spaces of Convergence:
Interactions between Biology and Computing from the First DNA Sequencers to Global Genome
Databases. Science, Technology, & Human Values, 37, 124–164.
Collins, F. S., Morgan, M., & Patrinos, A. (2003). The Human Genome Project: Lessons from Large-
Scale Biology. Science, 300, 286–290.
Davis B. D. and colleagues (1990). The Human Genome and Other Initiatives. Science, 249, 342–343.
Dawson, H. D., Loveland, J. E., Pascal, G., Gilbert, J. G. R., Uenishi, H., Mann, K. M., Sang, Y., Zhang, J.,
Carvalho-Silva, D., Hunt, T., Hardy, M., Hu, Z., Zhao, S.-H., Anselmo, A., Shinkai, H., Chen, C., Badaoui,
B., Berman, D., Amid, C., Kay, M., Lloyd, D., Snow, C., Morozumi, T., Cheng, R. P.-Y., Bystrom, M.,
34
Kapetanovic, R., Schwartz, J. C., Kataria, R., Astley, M., Fritz, E., Steward, C., Thomas, M., Wilming, L.,
Toki, D., Archibald, A. L., Bed’Hom, B., Beraldi, D., Huang, T.-H., Ait-Ali, T., Blecha, F., Botti, S.,
Freeman, T. C., Giuffra, E., Hume, D. A., Lunney, J. K., Murtaugh, M. P., Reecy, J. M., Harrow, J. L.,
Rogel-Gaillard, C., & Tuggle, C. K. (2013). Structural and functional annotation of the porcine
immunome. BMC Genomics, 14, 332.
Fortun, M. (1999). Projecting Speed Genomics. In: M. Fortun & E. Mendelsohn (Eds.) The Practices of
Human Genetics, pp. 25–48. Dordrecht, Germany: Springer.
Galison, P., & Hevly, B. (Eds.) (1992). Big Science: The Growth of Large-Scale Research. Stanford, CA:
Stanford University Press.
García-Sancho, M. (2012). Biology, Computing and the History of Molecular Sequencing: From
Proteins to DNA, 1945-2000. Basingstoke, UK: Palgrave Macmillan.
García-Sancho, M. (2016). The proactive historian: Methodological opportunities presented by the
new archives documenting genomics. Studies in History and Philosophy of Biological and Biomedical
Sciences, 55, 70–82.
Gaudillière, J.-P., & Rheinberger, H.-J. (Eds.) (2004). From Molecular Genetics to Genomics: The
mapping cultures of twentieth-century genetics. Abingdon, UK: Routledge.
Glasner, P. (2002). Beyond the genome: Reconstituting the new genetics. New Genetics and Society,
21, 267–277.
Green, E. D., Guyer, M. S., & National Human Genome Research Institute (2011). Charting a course
for genomic medicine from base pairs to bedside. Nature, 470, 204–213.
Groenen, M. A. (2016). A decade of pig genome sequencing: a window on pig domestication and
evolution. Genetics Selection Evolution, 48, 23.
Groenen, M. A., Archibald, A. L., Uenishi, H., Tuggle, C. K., Takeuchi, Y., Rothschild, M. F., Rogel-
Gaillard, C., Park, C., Milan, D., Megens, H.-J. Li, S., Larkin, D. M., Kim, H., Frantz, L. A. F., Caccamo,
M., Ahn, H., Aken, B. L., Anselmo, A., Anthon, C., Auvil, L., Badaoui, B., Beattie, C. W., Bendixen, C.,
Berman, D., Blecha, F., Blomberg, J., Bolund, L., Bosse, M., Botti, S., Bujie, Z., Bystrom, M., Capitanu,
B., Carvalho-Silva, D., Chardon, P., Chen, C., Cheng, R., Choi, S.-H., Chow, W., Clark, R. C., Clee, C.,
Crooijmans, R. P. M. A., Dawson, H. D., Dehais, P., De Sapio, F., Dibbits, B., Drou, N., Du, Z.-Q.,
Eversole, K., Fadista, J., Fairley, S., Faraut, T., Faulkner, G. J., Fowler, K. E., Fredholm, M., Fritz, E.,
Gilbert, J. G. R., Giuffra, E., Gorodkin, J., Griffin, D. K., Harrow, J. L., Hayward, A., Howe, K., Hu, Z.-L.,
Humphray, S. J., Hunt, T., Hornshøj, H., Jeon, J.-T., Jern, P., Jones, M., Jurka, J., Kanamori, H.,
Kapetanovic, R., Kim, J., Kim, J.-H., Kim, K.-W., Kim, T.-H., Larson, G., Lee, K., Lee, K.-T., Leggett, R.,
Lewin, H. A., Li, Y., Liu, W., Loveland, J. E., Lu, Y., Lunney, J. K., Ma, J., Madsen, O., Mann, K.,
Matthews, L., McLaren, S., Morozumi, T., Murtaugh, M. P., Narayan, J., Nguyen, D. T., Ni, P., Oh, S.-J.,
Onteru, S., Panitz, F., Park, E.-W., Park, H.-S., Pascal, G., Paudel, Y., Perez-Enciso, M., Ramirez-
Gonzalez, R., Reecy, J. M., Rodriguez-Zas, S., Rohrer, G. A., Rund, L., Sang, Y., Schachtschneider, K.,
Schraiber, J. G., Schwartz, J., Scobie, L., Scott, C., Searle, S., Servin, B., Southey, B. R., Sperber, G.,
Stadler, P., Sweedler, J. V., Tafer, H., Thomsen, B., Wali, R., Wang, J., Wang, J., White, S., Xu, X., Yerle,
M., Zhang, G., Zhang, J., Zhang, J., Zhao, S., Rogers, J., Churcher, C., & Schook, L. B. (2012). Analyses
of pig genomes provide insight into porcine demography and evolution. Nature, 491, 393–398.
35
Harrow, J. L., Steward, C. A., Frankish, A., Gilbert, J. G., Gonzalez, J. M., Loveland, J. E., Mudge, J.,
Sheppard, D., Thomas, M., Trevanion, S., & Wilming, L. G. (2014). The Vertebrate Genome
Annotation browser: 10 years on. Nucleic Acids Research, 42, D771–D779.
Hilgartner, S. (2013). Constituting Large-Scale Biology: Building a Regime of Governance in the Early
Years of the Human Genome Project. BioSocieties, 8, 397–416.
Hilgartner, S. (2017). Reordering Life: Knowledge and Control in the Genomics Revolution.
Cambridge, MA: The MIT Press.
Hogan, A. J. (2014). The ‘Morbid Anatomy’ of the Human Genome: Tracing the Observational and
Representational Approaches of Postwar Genetics and Biomedicine. Medical History, 58, 315–336.
Humphray, S. J., Scott, C. E., Clark, R., Marron, B., Bender, C., Camm, N., Davis, J., Jenks, A., Noon, A.,
Patel, M., Sehra, H., Yang, F., Rogatcheva, M. B., Milan, D., Chardon, P., Rohrer, G., Nonneman, D., de
Jong, P., Meyers, S. N., Archibald, A., Beever, J. E., Schook, L. B., & Rogers, J. (2007). A high utility
integrated map of the pig genome. Genome Biology, 8, R139.
Kuzmuk, K., & Schook, L. (2011). Pigs as a Model for Biomedical Sciences. In: M. Rothschild & A.
Ruvinsky (Eds.) The Genetics of the Pig, Second Edition, pp 426–444. Cambridge and Oxford, UK:
CABI.
Lenoir, T. (1999). Shaping Biomedicine as an Information Science. In M. E. Bowden, T. B. Hahn & R. V.
Williams (Eds.) Proceedings of the 1998 Conference on the History and Heritage of Science
Information Systems (pp. 27-45). Medford, NJ: Information Today, Inc.
Leonelli, S. (2016). Data-Centric Biology: A Philosophical Study. Chicago, IL: The University of Chicago
Press.
McKusick, V. A. (1991). Current trends in mapping human genes. The FASEB Journal, 5, 12–20.
McKusick, V. A. (1997). Mapping the Human Genome: Retrospective, Perspective and Prospective.
Proceedings of the American Philosophical Society, 141, 417–424.
McKusick, V. A., & Ruddle, F. H. (1987). Toward a Complete Map of the Human Genome. Genomics,
1, 103–106.
Meyers, S. N., Rogatcheva, M. B., Larkin, D. M., Yerle, M., Milan, D., Hawken, R. J., Schook, L. B., &
Beever, J. E. (2005). Piggy-BACing the human genome II. A high-resolution, physically anchored,
comparative map of the porcine autosomes. Genomics, 86, 739–752.
Nadim, T. (2016). Data Labours: How the Sequence Databases GenBank and EMBL-Bank Make Data.
Science as Culture, 25, 496–519.
Onaga, L. A. (2014). Ray Wu as Fifth Business: Deconstructing collective memory in the history of
DNA sequencing. Studies in History and Philosophy of Biological and Biomedical Sciences, 46, 1–14.
Osoegawa, K., Woon, P. Y., Zhao, B., Frengen, E., Tateno, M., Catanese, J. J., & de Jong, P. J. (1998).
An Improved Approach for Construction of Bacterial Artificial Chromosome Libraries. Genomics, 52,
1–8.
Parolini, G. (2018). Building Human and Industrial Capacity in European Biotechnology: The Yeast
Genome Sequencing Project (1989–1996). Technische Universität Berlin preprint.
http://dx.doi.org/10.14279/depositonce-6693 (Accessed 11 April 2018).
36
Pool, R., & Waddell, K. (2002). Exploring Horizons for Domestic Animal Genomics: Workshop
Summary. Washington, D.C.: National Academy Press.
Renard, C., Hart, E., Sehra, H., Beasley, H., Coggill, P., Howe, K., Harrow, J., Gilbert, J., Sims, S.,
Rogers, J., Ando, A., Shigenari, A., Shiina, T., Inoko, H., Chardon, P., & Beck, S. (2006). The genomic
sequence and analysis of the swine major histocompatibility complex. Genomics, 88, 96–110.
Rheinberger, H.-J., & Gaudillière, J.-P. (Eds.) (2004). Classical Genetic Research and its Legacy: The
mapping cultures of twentieth-century genetics. London and New York: Routledge.
Rogel-Gaillard, C., Bourgeaux, N., Billault, A., Vaiman, M., & Chardon, P. (1999). Construction of a
swine BAC library: application to the characterization and mapping of porcine type C endoviral
elements. Cytogenetics and Cell Genetics, 85, 205–211.
Rohrer, G. A., Alexander, L. J., Hu, Z. Smith, T. P. L., Keele, J. W., & Beattie, C. W. (1996). A
Comprehensive Map of the Porcine Genome. Genome Research, 6, 371–391.
Rohrer, G., Beever, J. E., Rothschild, M. F., Schook,, L., Gibbs, R., & Weinstock, G. (2002). Porcine
Sequencing White Paper: Porcine Genomic Sequencing Initiative. Available online at:
https://www.genome.gov/pages/research/sequencing/seqproposals/porcineseq021203.pdf Last
accessed 02.11.2017
Schook, L., Beattie, C., Beever, J., Donovan, S., Jamison, R., Zuckermann, F., Niemi, S., Rothschild, M.,
Rutherford, M., & Smith, D. (2005a). Swine in biomedical research: creating the building blocks of
animal models. Animal biotechnology, 16, 183–190.
Schook, L. B., Beever, J. E., Rogers, J., Humphray, S., Archibald, A., Chardon, P., Milan, D., Rohrer, G.,
& Eversole, K. (2005b). Swine Genome Sequencing Consortium (SGSC): a strategic roadmap for
sequencing the pig genome. Comparative Functional Genomics, 6, 251–255.
Stein, L. (2001) Genome annotation: from sequence to biology. Nature Reviews Genetics, 2, 493–
503.
Stevens, H. (2011). On the means of bio-production: Bioinformatics and how to make knowledge in a
high-throughput genomics laboratory. BioSocieties, 6, 217–242.
Stevens, H. (2013). Life Out of Sequence: A Data-Driven History of Bioinformatics. Chicago, IL: The
University of Chicago Press.
Strasser, B. J. (2011). The Experimenter's Museum: GenBank, Natural History, and the Moral
Economies of Biomedicine. Isis, 102, 60–96.
Sulston, J., & Ferry, G. (2002). The Common Thread: A Story of Science, Politics, Ethics and the Human
Genome. London, UK: Bantam Press.
Venter, J. C. (2008). A Life Decoded: My Genome, My Life. London, UK: Penguin Books.
Vermeulen, N. (2016). Big Biology: Supersizing Science During the Emergence of the 21st Century.
NTM Zeitschrift für Geschichte der Wissenschaften, Technik und Medizin, 24, 195–223.
Wade, N. (2001). Life Script: How the Human Genome Discoveries Will Transform Medicine and
Enhance Your Health. New York: Simon & Schuster.
Wellcome Trust (2005). Strategic Plan 2005-2010: Making a Difference. London, UK: Wellcome Trust.
37
Wellcome Trust (2010). Strategic Plan 2010-2020: Extraordinary Opportunities. London, UK:
Wellcome Trust.
Yerle, M., Lahbib-Mansais, Y., Mellink, C., Goureau, A., Pinton, P., Echard, G., Gellin, J., Zijlstra, C., De
Haan, N., Bosma, A. A., Chowdhary, B., Gu, F., Gustavsson, I., Thomsen, P.D., Christensen, K.,
Rettenberger, G., Hameister, H., Schmitz, A., Chaput, B., & Frelat, G. (1995). The PiGMaP consortium
cytogenetic map of the domestic pig (Sus scrofa domestica). Mammalian Genome, 6, 176–186.