+ All Categories
Home > Documents > Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ......

Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ......

Date post: 15-May-2018
Category:
Upload: phamtuyen
View: 231 times
Download: 2 times
Share this document with a friend
90
Genotype analysis and graph databases Andrew Stephen Law August 17, 2017 MSc in High Performance Computing with Data Science The University of Edinburgh Year of Presentation: 2017
Transcript
Page 1: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Genotype analysis and graph databases

Andrew Stephen Law

August 17, 2017

MSc in High Performance Computing with Data ScienceThe University of EdinburghYear of Presentation: 2017

Page 2: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Abstract

The continued improvement in efficiency of meat, milk and egg production from farmedanimals increasingly relies on the use of molecular marker or genome sequencing tech-nologies. The inheritance of particular versions of DNA sequences through pedigree-recorded generations can be analysed against performance measurements within in-dividual animals to identify regions of the genome which appear to harbour genesinvolved in the trait(s) under study. Variants of the genome sequence characteristicfor those regions can subsequently be used in animal breeding programs to predict thevalue of particular animals and hence to guide selective breeding programs.

Whilst progress can be made in this fashion, a better understanding of the molecularprocesses underpinning the expression of the trait of interest and the identificationof more tightly-linked markers – possibly even the causative mutations themselves –could lead to greater or better-targeted gains. This better understanding requires theincorporation of a great number of additional sources of information. As the volumeof data and the diversity of types of data to be incorporated increases, managing andquerying the data becomes a non-trivial process.

Traditionally, data for such analyses have been housed within relational database man-agement system (DBMS), but some data types represent an awkward fit to the strict,table-based models inherent within that approach. Specifically, pedigree informationand annotations relating to gene function represent challenges to the data modeller,with their tree-like structure and variable leaf-node depths. For these graph-form datatypes, a more natural fit would be a graph database and recent developments in thatarena mean that there are now several possible candidate graph database engines thatmight be more performant across animal breeding datasets. This project aims to inves-tigate the quantitative and qualitative performance characteristics of one such graphdatabase engine – Neo4J – versus those of a more traditional SQL-based relationaldatabase engine (PostgreSQL) when faced with representative simulated data loadsand queries.

Page 3: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Contents

1 Introduction 11.1 The Importance of Animal Breeding to the UK Economy . . . . . . . . 11.2 History of databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Use of DBMS for Biological and Genome Data Handling . . . . . . . . 31.4 Choice of Database Engines for Testing . . . . . . . . . . . . . . . . . 3

2 Description of Development and Test Systems 52.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Installation of Database Systems 73.1 PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Neo4J . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Data Model 124.1 Genes, Transcripts, Exons, UTRs . . . . . . . . . . . . . . . . . . . . 124.2 Markers, Pedigrees and Genotypes . . . . . . . . . . . . . . . . . . . . 154.3 Genes and the Gene Ontology . . . . . . . . . . . . . . . . . . . . . . 16

5 Creation of Simulated Data Sets 175.1 Choice of Simulation Program . . . . . . . . . . . . . . . . . . . . . . 17

5.1.1 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2 Generation of Minimal Exemplar Data Set . . . . . . . . . . . . . . . 185.3 Generation of Representative Medium Data Set . . . . . . . . . . . . . 20

5.3.1 Population Parameters . . . . . . . . . . . . . . . . . . . . . . 205.3.2 Genome Parameters . . . . . . . . . . . . . . . . . . . . . . . 205.3.3 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.4 Generation of Representative Large Data Set . . . . . . . . . . . . . . 215.5 Gene Build Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.6 Gene Ontology Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

i

Page 4: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

6 Loading Data 236.1 Pre-processing of Data . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.1.1 Gene Ontology Data . . . . . . . . . . . . . . . . . . . . . . . 236.1.2 Marker Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.1.3 Pedigree Data . . . . . . . . . . . . . . . . . . . . . . . . . . 246.1.4 Genotype Data . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.2 PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.2.1 Gene Build Data . . . . . . . . . . . . . . . . . . . . . . . . . 266.2.2 Gene Ontology Data . . . . . . . . . . . . . . . . . . . . . . . 296.2.3 Marker Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.2.4 Mapping Marker Data to the Gene Build . . . . . . . . . . . . 296.2.5 Pedigree Data . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2.6 Genotype Data . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.3 Neo4J . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.3.1 Gene Build Data . . . . . . . . . . . . . . . . . . . . . . . . . 326.3.2 Gene Ontology Data . . . . . . . . . . . . . . . . . . . . . . . 336.3.3 Marker Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.3.4 Mapping Marker Data to the Gene Build . . . . . . . . . . . . 346.3.5 Pedigree Data . . . . . . . . . . . . . . . . . . . . . . . . . . 346.3.6 Genotype Data . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7 Comparative Analysis 387.1 Disk Space Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.2 Speed of Data Import . . . . . . . . . . . . . . . . . . . . . . . . . . 447.3 Speed of Data Export . . . . . . . . . . . . . . . . . . . . . . . . . . 497.4 Qualitative Assessment . . . . . . . . . . . . . . . . . . . . . . . . . 50

8 Discussion and Conclusions 51

Appendix A Database Software 62A.1 PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62A.2 Neo4J . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Appendix B Data Simulation 63B.1 QMSim Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63B.2 Minimal Exemplar Data Set . . . . . . . . . . . . . . . . . . . . . . . 63

B.2.1 Input Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63B.2.2 Output Files . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

B.3 Representative Medium Data Set . . . . . . . . . . . . . . . . . . . . 64B.3.1 Pre-processing Files . . . . . . . . . . . . . . . . . . . . . . . 64B.3.2 Input Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

ii

Page 5: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

B.3.3 Output Files . . . . . . . . . . . . . . . . . . . . . . . . . . . 64B.3.4 Processed Output Files . . . . . . . . . . . . . . . . . . . . . 64

B.4 Representative Large Data Set . . . . . . . . . . . . . . . . . . . . . . 65B.4.1 Input Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65B.4.2 Output Files . . . . . . . . . . . . . . . . . . . . . . . . . . . 65B.4.3 Processed Output Files . . . . . . . . . . . . . . . . . . . . . 65

Appendix C Pre-Processing Prior to Loading 66

Appendix D Gene Build Data Files 67

Appendix E Gene Ontology Data Files and Scripts 69

Appendix F PostgreSQL Data Loading Script Details 71

Appendix G Neo4J Data Loading Script Details 72

Appendix H Neo4J Unmanaged Extension Code 73

iii

Page 6: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

List of Tables

7.1 Comparison of disk usage required to store the core Gene Build andGene Ontology data. Figures in the Neo4J and PostgreSQL columnsare in kilobytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7.2 Comparison of disk usage required to store the Pedigree/Phenotypedata. Figures in the Neo4J and PostgreSQL columns are in kilobytes. . 42

7.3 Comparison of disk usage required to store the marker data for theMedium Data Set. Figures in the Neo4J and PostgreSQL columns arein kilobytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7.4 Comparison of disk usage required to store the marker data for theLarge Data Set. Figures in the Neo4J and PostgreSQL columns are inkilobytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7.5 Comparison of disk usage required to store the Genotype data for theMedium Data Set. Figures in the Neo4J and PostgreSQL columns arein kilobytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

iv

Page 7: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

List of Figures

3.1 Layout of directories and key files in a default PostgreSQL installation . 93.2 Layout of directories and key files in a default Neo4J installation . . . . 11

4.1 Relationship between chromosomes, genes, transcripts, exons, 5-primeUntranslated Regions (5’-UTRs), 3-prime Untranslated Regions (3’-UTRs) and markers . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

6.1 Structured Query Language (SQL) table structure used to model chro-mosomes, markers, genes, transcripts, 5’-UTRs, 3’-UTRs and exons inPostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.2 SQL table structure used to model Gene Ontology terms in PostgreSQL 306.3 Graph structure used to model chromosomes, markers, genes, tran-

scripts, 5’-UTRs, 3’-UTRs and exons in Neo4J . . . . . . . . . . . . . 326.4 Graph structure used to model Gene Ontology terms in Neo4J . . . . . 35

7.1 Disk space required (in MB) to store the Gene Ontology terms, chro-mosomes, genes, transcripts, 5’-UTRs, 3’-UTRs and exons in Neo4Jand PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7.2 Disk space required (in MB) to represent the internal relationshipswithin the Gene Ontology data in Neo4J and PostgreSQL . . . . . . . 41

7.3 Disk space required (in MB) to store the mappings between the variousGene Build data objects and between genes and the Gene Ontologyterms in Neo4J and PostgreSQL . . . . . . . . . . . . . . . . . . . . . 43

7.4 Time required (ms) to store the Gene Ontology terms, chromosomes,genes, transcripts, 5’-UTRs, 3’-UTRs and exons in Neo4J and PostgreSQL 45

7.5 Time required (ms) to create the internal relationships within the GeneOntology data in Neo4J and PostgreSQL . . . . . . . . . . . . . . . . 46

7.6 Time required (ms) to store the mappings between the various GeneBuild data objects and between genes and the Gene Ontology terms inNeo4J and PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . 47

v

Page 8: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

7.7 Time required (ms) to load genotypes for varying numbers of individualsand markers using the “unmanaged extension” method in Neo4J . . . . 48

vi

Page 9: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Listings

6.1 SQL script to load Chromosome data . . . . . . . . . . . . . . . . . . 286.2 Neo4J Cypher script to load Chromosome data . . . . . . . . . . . . . 337.1 SQL script to count the number of markers that overlap Exons . . . . 497.2 Cypher script to create an “is in exon” relationship between overlapping

markers and Exons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50H.1 Java code to read genotypes from a file and import them into Neo4J . 73

vii

Page 10: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Glossary

ACID Atomicity, Consistency, Isolation, Durability – key properties of database trans-actions.

allele The name given to individual variants at a given location in the genome. Amarker must – by definition – have at least two alleles.

chromosome A linear or circular DNA molecule which constitutes a unit of inheritancefrom generation to generation within a species.

contig A contiguous sequence of DNA, usually in reference to a sequence assemblyprocess.

diploid A diploid organism is one that has two copies of each chromosome, one in-herited from each of its parents.

DNA Deoxy Ribonucleic Acid – the chemical molecule that carries the genetic instruc-tions required to create a functioning organism.

exon A stretch of DNA within a gene that will be retained after splicing to form partof the mature product of that gene. Often, these exons are responsible for codingfor the sequence of a protein product of the gene.

gamete The specialised sex cells (eggs, sperm) which fuse in the process of fertilisationto create an embryo. Each gamete contains half the DNA required (a single copyof each chromosome) to make the full complement for the species..

gene A sequence of DNA bases on a chromosome that is transcribed to performsome function within a cell, be that coding for a protein product or some otherfunction.

Gene Ontology A controlled vocabulary of terms describing the function of genes.

viii

Page 11: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

genetic variation The differences between individuals that can be accounted forsolely by differences in genome sequence between those individuals.

genome The sum total of all the DNA contained within a species or individuals chro-mosomes and which form the complete set of instructions for generating aninstance of that species.

genotype The pair of alleles for a specific marker in a single individual.

heterozygous The name of the state whereby an individual carries two different allelesat a given marker location.

homozygous The name of the state whereby an individual carries two identical allelesat a given marker location.

intron Stretches of DNA that exist within genes and between exons. They are removedfrom transcripts by the process of splicing.

marker Some variation that occurs in the DNA sequence of a genome such thatdifferent versions of the sequence exist and can be assayed. The inheritance ofthe different versions can be tracked from parent to offspring.

microsatellite A type of genetic marker, the variation within which is the result ofdifferent numbers of short repeating sequences (e.g. ‘ACACACACAC’ versus‘ACACAC’) at a particular point in the genome.

mutation The name given to a change in DNA sequence arising as a result of amistake in the DNA replication process (or by other means).

pedigree A record of the parentage of individuals within a population over manygenerations.

phenotype A measurable or observable characteristic in an organism (e.g. growthrate, hair colour etc.).

restriction enzyme An enzyme, usually extracted from a bacterial source, that cutsDNA at a characteristic (for the specific enzyme) sequence.

selective breeding The process of managing the breeding of a controlled populationof individuals, choosing to allow only those individuals that display a particulardesired phenotype or trait.

ix

Page 12: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

selective sweep The phenomenon whereby a population becomes homozygous forthe alleles in a region of the genome surrounding a gene under selection.

silent mutation A mutation in the DNA sequence that has no effect on the geneproduct. This occurs due to redundancy in the DNA to amino acid coding suchthat multiple DNA sequences code for the same amino acid.

species Often defined as a collection of all of a particular type of organisms that caninterbreed and produce viable offspring.

splicing A cellular process that acts to excise intron sequence - those that lie betweenexons - from immature transcripts.

trait An alternate term for phenotype.

transcript Copies of chromosomal DNA derived from genes. Transcripts are gen-erated within a cell by the process of transcription. Transcripts are processedby a process known as splicing to remove introns. One gene may have manytranscripts.

x

Page 13: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Acronyms

3’-UTR 3-prime Untranslated Region.

5’-UTR 5-prime Untranslated Region.

CSV Comma-Separated Value.

DBMS database management system.

QTL Quantitative Trait Locus.

SQL Structured Query Language.

xi

Page 14: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Data and script files relevant to this thesis can be found in a Git repository at:

https://bitbucket.org/studentlaw/s1578554-msc-august-2017/overview.

xii

Page 15: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Acknowledgements

I’d like to take the opportunity to thank my supervisors, Dr. Andreas Kranis and Dr.Amy Krause for their support and input into this dissertation project.

I’d also like to thank my employers – The University of Edinburgh – and my linemanagers at The Roslin Institute for allowing me the time and for providing the fundingfor me to study for this Masters degree.

Thanks are also due to the members of the Neo4J Slack channel, particularly Max DeMarzi, Michael Hunger and Andrew Bowman for their help in getting to grips with theNeo4J graph database engine.

Finally, I’d like to thank my family – and particularly my wife, Dawn – for their patience,understanding and continuous unwavering support, especially throughout the past twoyears of study.

Page 16: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Chapter 1

Introduction

1.1 The Importance of Animal Breeding to the UKEconomy

Animal production in the United Kingdom constitutes approximately 55% of the eco-nomic output from farming at around £12.7 billion [1, 2]. Whilst the headline real-terms figure is falling year on year, that is primarily through reductions in price ratherthan volume of production as cheaper products from outside the UK force Europe-wideprices lower. The harsh economic realities that the UK farming industry face demandincreases in the efficiency of production simply to stand still. It is important that UKanimal producers take advantage of any and all opportunities that may be available tothem to increase their competitiveness.

Within the UK, the pig and poultry production systems are dominated by a smallnumber of animal-breeding companies, which supply breeding animals to producers.Producers cross these breeding animals to produce fast-growing meat animals whichcan be grown efficiently to market weight. Animal-breeding companies have a compet-itive interest themselves in producing better breeding stock i.e. ones that will producemore efficient animals for their customers, as it is only by marketing themselves betterto the producers that they will maintain their edge.

Aviagen® is one of the major suppliers of breeder chickens in the world, marketinga variety of lines of birds under the Ross®, Arbor Acres® and Rowan Range® brandnames. These birds are the product of an extensive selective breeding program thathas been in operation for many years. Initially driven by simple selection on measurabletraits, interest has now shifted to molecular selection techniques whereby the genotypeof an animal can be used as a predictor of its – and its offsprings’ – phenotype perfor-

1

Page 17: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

mance. Underpinning these selection programmes are comprehensive data collectionand management operations. Optimising the data pipelines and incorporating moreextensive and complex data types into the process may assist in improving the selectioncriteria on which the programmes may operate. However, some of the data types areproblematic in conventional relational databases either because of their volume (in thecase of genotypes) or tree-like structure (gene function annotations).

1.2 History of databases

The process of storing data in organised record systems pre-dates the development ofthe computer – publications from 1951 describe the seemingly routine use of Hollerithcards as a physical database for storing and analysing chemical structure data [3],but it is in more recent times that databases have become an indispensable businessand scientific tool. For brief reviews of the history of databases – particularly SQLdatabases – see [4] and [5].

Modern SQL-based relational databases trace their origins back to the seminal paperby British mathematician E.F. Codd [6]. All modern relational DBMS build on thedesigns proposed nearly 50 years ago, albeit with extensive implementation improve-ment through the years. The relational model requires a fairly rigid, ‘table’-based datamodel with distinct data model types stored in separate compartmentalised blocks.Relationships between data types are built via table joins. Although data columns maybe constrained to only hold values that exist in other, ‘foreign’ tables, there is nothingin the relational database design and implementation that explicitly defines the rela-tionships. Rather, the relationships are formed within the query code and by columnnaming convention rather than anything else. The better-defined a data model canbe, the better the fit it is to a relational schema.

Graph databases, on the other hand are a kind of NoSQL database. Technically,NoSQL databases have existed since before relational DBMS, as all databases prior tothe development of the first SQL-based systems used methods other than SQL to storeand query their contents. However, the NoSQL phrase appears to have been coined inrecent times as the new generation of free-form databases have emerged to address thehigh volume, high throughput data processing challenges faced by companies such asFacebook and Twitter. See [7] for a discussion of the history of the NoSQL terminology.

Graph databases are a specific type of NoSQL database, focussing on between-objectlinks as a central tenet. Rather than define tables to represent each object type and thenjoin between them based on object ids stored in columns in tables each side of the join,graph databases store links or pointers to the related objects directly in the storage of

2

Page 18: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

each object itself. Thus, it is claimed, relationships can be traversed much more rapidlythan via traditional relational DBMS. In addition, the syntax of the various graphdatabase access languages is claimed to intuitively convey the relationships betweenobjects much better than relational SQL commands [8].

1.3 Use of DBMS for Biological and Genome DataHandling

The journal Nucleic Acids Research publishes a special issue each year dedicated toreporting biological databases. The 2017 issue [9] reports 54 new databases andprovides updates on nearly 100 others. The specifics of implementation are not reportedin all cases – the focus of the publication is on the data that each system containsrather than the engine in which it is stored – but where the information is availablewe see a very wide range of databases being deployed. These range from “simpleflat file databases” [10] to SQL-lite [11], various versions of MySQL [12, 13] right upto massive systems that operate with Oracle internally, presenting data to the widerscientific community via dumps into public-facing MySQL instances [14].

Closer to home, we have previously developed systems for handling genetic linkagemapping and QTL-mapping data at the Roslin Institute using PostgreSQL [15, 16].

Given that we frequently encounter network-like interactions within biological systems,graph databases would seem to be a logical fit to many biological problem domains.However, there do not currently seem to be many systems deploying graph databasesolutions. This is unsurprising, as the biological scientific community appears tradi-tionally conservative in its approach and has little time to spare to adapt functionalsystems to new technologies unless there is a significant advantage to doing so.

A recent workshop [17] hosted by Neo4J to address the life sciences community didshowcase a few graph-based systems, including the EMBL-EBI ontology lookup ser-vice [18], the Reactome pathway database [19] and the MetaProteomeAnalyzer [20],suggesting that there is some interest in putting these tools to use. However, it isprobably fair to say that these are currently the exception rather than the rule.

1.4 Choice of Database Engines for Testing

The DBEngines website provides a set of rankings of popular (and less popular) re-lational and non-relational DBMS. The rankings are based on perceived popularity in

3

Page 19: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

online fora, including for example the number of questions asked and answered onthe “StackOverflow” programmers’ web site, number of Tweets mentioning each sys-tem and several other measures of activity and interest. The most popular relationalDBMS as of August 2017 [21] was Oracle [22], followed by MySQL [23], Microsoft SQLServer [24] and then PostgreSQL [25]. Of these, Oracle and Microsoft SQL Server arecommercial products whereas MySQL and PostgreSQL are free, open source products.Traditionally, PostgreSQL – derived as it is from the INGRES DBMS (now Actian X;[26]) first developed by the University of California, Berkeley – has been viewed as themore robust of the two, but MySQL has had fully-functional transactional ACID [27]compliance via the InnoDB engine [28] which has been the default engine since version5.5, so either of these are good choices for implementations that wish to be free fromlicensing fees. In the graph database rankings, Neo4J [29] is far and away the mostpopular graph database system with a calculated score more than 4 times that of itsclosest rival [30].

Based on current popularity and apparent cleanness of syntax of the database querylanguage, we chose to test Neo4J as our graph database system. Given our knowledgeof the platform, we chose PostgreSQL as the relational DBMS.

4

Page 20: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Chapter 2

Description of Development andTest Systems

2.1 Hardware

The majority of the development and preparation for the work described in this thesiswas performed on a Late 2012 27-inch ‘all-in-one’ iMac system running OS X ‘ElCapitan‘ Version 10.11.6 (development machine). The system was equipped with asingle Intel Core i5 running at 3.2 GHz and 24GB DDR3 RAM. The single internal diskwas an Apple-supplied Western Digital SATA Hard Disk spinning at 7200 rpm.

Performance testing was carried out on a Dell Linux Server running Ubuntu 16.04.1LTS (performance-testing system). This machine was equipped with four Intel® Xeon®E7-8870 v3 CPUs running at 2.10GHz (1.2 GHz to 2.0 GHz) and 1TB DDR4 RAM.Internal storage was in the form of a 1TB disk, with a further 15TB of enterprise-class storage provisioned via a Dell NAS server, presented as four separate partitionsof 3.3TB, 6.6TB, 1.7TB and 3.3TB respectively. Unless stated otherwise all workcarried out on the performance testing server utilised the 6.6TB external partition fordata storage.

2.2 Software

The version of make on the performance test system was GNU Make 4.1 and thegcc compiler was from the GNU Compiler Collection gcc version 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609. The version of tar used to unpack downloaded

5

Page 21: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

database software distributions was GNU tar 1.28.

The generation of the chromosome-specific QMSim parameter files using the Pythonscript file described in Section 5.3.2 was run on the performance-testing system usingPython version 2.7.13 :: Anaconda 4.3.1 (64-bit).

The pre-processing of the Gene Ontology data described in Section 6.1.1 and thesplitting of the Medium Data set simulated genotype data file described in Section6.1.4 was run on the performance-testing system using the version of perl supplied bythe operating system (version v5.22.1).

The conversion of ‘pretty-printed’, space-delimited pedigree and marker data files toComma-Separated Value (CSV) files (Sections 6.1.3 and 6.1.2) was performed usingthe command line scripting program awk . The versions of awk available were GNUAwk 4.1.3 and ‘awk version 20070501’ on performance-testing system and developmentmachine respectively.

Java code was written (see Section 6.3.6) using the NetBeans IDE version 8.2 [31] onthe development machine described above. Code was compiled on the developmentmachine against Oracle Java version 8, Update 131 build 11 [32] using the versionof maven ( mvn ) supplied with NetBeans (Apache Maven 3.0.5 [33]) as the buildautomation tool. Code was run on the performance-testing system using Oracle Javaversion 8, Update 101 build 13.

6

Page 22: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Chapter 3

Installation of Database Systems

Although both PostgreSQL and Neo4J are available in pre-packaged, installer form forMacintosh and Windows systems and in apt-format for Debian-based Linux systems,the tar, gzipped bundles that are also available provide much greater flexibility. Inparticular, for Neo4J, multiple instances of the database can be installed alongsideeach other and managed under a normal user account whereas the centrally-installedpackaged versions require super-user permissions to administer. This is particularlyimportant when multiple clean Neo4J instances are required in rapid succession, suchas when assessing the speed and storage requirements for the loading of different datashapes (see Section 6.3.6).

In addition, for both systems, the control over the location of data storage is muchmore easily managed in a custom-built and configured installation. This is importantwhen considering the performance of the underlying disk infrastructure and permitsdirect system-to-system comparisons to be made.

3.1 PostgreSQL

We downloaded the source code of the latest production version (version 9.6.3) of thePostgreSQL software from the PostgreSQL website on 7th August 2017. Details ofthe file downloaded are presented in Appendix A.1. The source code is supplied witha standard ./configure script which allows the specification of a top-level installationdirectory using the –prefix command-line argument. We configured the build processwith a prefix argument pointing to a directory on the 6.6TB NAS-mounted partitionmentioned in Section 2.1 (PG_HOME) and then ran the make and make installcommands. The code compiled and installed successfully, giving us a local, user-

7

Page 23: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

owned installation. We then initialised the data store by specifying a subdirectory calleddata within the PG_HOME directory to be the location of all data and configurationfiles for this instance of PostgreSQL (this is the default location) in an environmentvariable called PGDATA and ran the initdb command. Having done so, we edited thepostgresql.conf file thus created to specify a non-standard port number (so as notto clash with any versions of PostgreSQL already running on the system) and startedthe instance using the pg_ctl command.

Details of the directory and file layout for a default PostgreSQL installation are shown inFigure 3.1. Helpful command line utilities to create and delete databases are provided.A PostgreSQL installation may contain multiple separate database instances, each ofwhich may be isolated or integrated with other instances within the single, managedenvironment. The installation is started and stopped by means of the pg_ctl com-mand, and the server process can be accessed via the psql binary which provides aSQL-based shell interface.

As mentioned earlier, all data within the PostgreSQL database(s) are stored in thedata subdirectory and configuration is by means of the postgresql.conf text file.Further access control is configured in the pg_hba.conf and pg_ident.conf files.

3.2 Neo4J

We downloaded the 3.2.1 Community Edition version of the Neo4J Software from theNeo4J website on 25th June 2017. Details of the file downloaded are presented inAppendix A.2.

Subsequently, we uncovered a number of bugs in this version of the software whichhampered our ability to perform queries correctly. We therefore downloaded the 3.2.3Community Edition on 14th August 2017 and used that for later testing. Details ofthe file downloaded for this later version of the software are also presented in AppendixA.2.

Installation is a simple matter of unpacking the tar, gzip file into a suitable locationand renaming the directory thus extracted. All instances of Neo4J used during thisstudy were created on the same 6.6TB NAS-mounted partition mentioned in Section2.1 that was used for the PostgreSQL data storage. The location of key files anddirectories within a default Neo4J installation is shown in Figure 3.2.

The binary files necessary to start, stop and administer the server instance ( neo4j )are found in the bin directory, along with tools that allow query and manipulation of thedata graph using the Neo4J graph query language, Cypher ( cypher-shell and the

8

Page 24: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Figure 3.1: Layout of directories and key files in a default PostgreSQL installation

base-directory

bincreatedbdropdbpg_ctlpg_dumppostgrespsql...

data...pg_hba.confpg_ident.confpostgresql.conf

include

liblogfile

share

9

Page 25: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

now-deprecated neo4j-shell ). Data to be imported via Cypher commands must,by default, be located in the import directory and log files are found in the eponymousdirectory. Custom code written in Java and accessible by means of POST requests tothe server instance must be compiled, and the resulting jar file placed in the pluginsdirectory. Configuration of the all the above, as well as all other aspects of the Neo4Jinstance is by means of key:value pair entries in the neo4j.conf file found withinthe conf directory.

The data themselves (nodes, relationships, indexes etc.) are stored in files and sub-directories under the data/databases/graph.db tree. The data/dbms tree con-tains user authentication and role-based data.

10

Page 26: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Figure 3.2: Layout of directories and key files in a default Neo4J installation

base-directory

bincypher-shellneo4jneo4j-adminneo4j-importneo4j-shell

certificates

confneo4j.conf

data

databases

graph.db

dbms

import

lib

logs

plugins

run

11

Page 27: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Chapter 4

Data Model

4.1 Genes, Transcripts, Exons, UTRs

We attempted to design the data model to closely map to the core biological conceptsunderpinning the data sets. The inherited biological material containing the “blueprint”or “set of instructions” for making a given individual are in the form of DNA. DNA iscarried within the cell in long, chain-like structures known as chromosomes. Collectivelythe full set of DNA held as chromosomes and characteristic of a given species is knownas the genome of that species.

Only some of the DNA can be considered to be “blueprint” material in the conventionalsense – a large proportion of the chromosome sequence is structural or serves someother purpose. The parts of the genome that provide the blueprint instructions areknown as genes. The role of the structural DNA that sits between genes is complexand beyond the direct scope of this study.

Genes themselves do not always represent single, contiguous stretches of DNA withina chromosome, and may be composed of multiple, smaller stretches of DNA calledexons. The cell contains enzymatic machinery that works along the genes within thechromosome, making copies of the DNA sequence and producing what are known astranscripts. Transcripts are then further processing to excise portions of the sequencethat sit between the exons which are known as introns. Exons – in large part – containthe instructions that can be translated to produce proteins, which are the ultimateoutput of the majority of genes within the cell. However, some parts of exons atthe start and end of transcripts contain further sequences that are not translated.These are known as the 5’-UTR and 3’-UTR respectively and they act to modulatethe translation process, amongst other things.

12

Page 28: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Thus, one gene will give rise to one or more transcripts, each of which is composedof a number of exons. Each transcript for a protein-coding gene will have a 5’-UTR,followed by a series of exons spliced together followed by a 3’-UTR.

The relationships between these multiple biological entities is illustrated in Figure 4.1.

Variations (mutations) in the genome sequence naturally occur as mistakes are madein the DNA replication process. The different variants at a particular location inthe genome are known as the alleles for that location. Variations that occur in thecells that go on to form the gametes are subsequently inherited down through thegenerations, assuming that the changes do not introduce deleterious effects in theoffspring. The effect of these variations on the functioning of genes is what underpinsgenetic variation. Selective breeding results in the gradual replacement of the ‘poorer’allele with the more favourable one as animals carrying the favourable allele are allowedto breed whereas those that do not carry it are not.

The location of a mutation influences the likely effect of the mutation on the func-tioning of the gene. The red numbered arrows in Figure 4.1 indicate some possiblelocations of mutations in the genome. Mutation 3 occurs inside an exon. Exons pro-vide the coding for the protein products of the genes; changes in the sequence heremay, potentially, alter the protein sequence or disrupt it entirely, resulting in the pro-duction of a different protein structure. This altered protein structure may performits expected role better or worse than the non-mutated version, but the majority ofnon-silent mutation result in the removal of the protein function entirely. Mutationsin exons therefore tend to have major effects.

Mutations 1 and 4 occur in the 5’-UTR and 3’-UTR respectively. Mutations in theuntranslated regions can alter the rate at which the transcripts are converted intoprotein, as well as the lifespan of the transcript within the cell. Thus mutations in theUTRs can result in more (or less) functional protein being produced for a longer (orshorter) time.

Mutation 2 lies between exons in an intron. Mutations here can alter the splicingof a transcript. Indeed, mutations at location 2 may alter the relative proportions ofthe two transcripts shown in Figure 4.1. Since the two transcripts code for differentfunctional proteins, this also has the potential to change the phenotype of the animalin subtle or not-so-subtle ways.

Finally, mutation 5 lies outside of the gene. Mutations in these locations may alter theplaces and times that a gene is expressed (not all genes are expressed in all cells at alltimes) and so these mutations may also result in phenotypic changes.

13

Page 29: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Figu

re4.1:

Relat

ionship

betweenchromosom

es,g

enes,transcripts,

exon

s,5’-U

TRs,3’-U

TRsa

ndmarkers

14

Page 30: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Although scientists do not claim to have a full understanding of the nature of genomefunctionality, the above classes of mutation locations do provide some insight into thepotential effects that each might have. Conversely, if we have an observed phenotypicchange, we may be able to infer something about the type of location of the mutationwith which it is underpinned and so home in on possible candidate variants and alleles.

4.2 Markers, Pedigrees and Genotypes

As described in the previous section, it is genetic variation that underpins differencesbetween individuals within a population. Selective breeding on the basis of one or morephenotypes will result in the favouring of the ‘good’ alleles within that population,resulting in a gradual increase in the proportion of the ‘good’ allele at the expenseof the ‘poor’ allele. Because there are a small, finite number of chromosomes with aconsistent, fixed order, genes and variants close to the selected ‘good’ allele will alsobe favoured.

The genetic variation at a given point in the genome can be measured or assayed, bya number of different methods. Being able to repeatedly and reliably assay the vari-ants present at a specific location (whether that location is known or not) affords thatvariant/location combination the status of marker. Early marker techniques involvedamplification of the genomic DNA, followed by digestion of the amplified product witha restriction enzyme and the separation and visualisation of the resulting fragments onan acrylic gel (e.g. see [34] and [35]). Other methods, for example using microsatelliteswere also developed (e.g. [36],[37]) which allowed a greater number of markers to bemeasured simultaneously. Following the sequencing of farm animal genomes (chicken[38], pig [39], cow [40]) and the identification of whole-genome population-level vari-ation, a number of high throughput ‘SNP-chips’ became available (chicken [41], pig[42], cow [43]) which further increased the number of markers that could be assayedin a given study.

Because higher organisms are diploid, each individual has two copies of each locationin the genome. Assaying the marker for any given location will therefore measure bothcopies. The description of the two variants or alleles of a marker in one individualis known as the individual’s genotype for that marker. If both alleles are the same,then the individual is said to be homozygous for the marker. If the individual has twodifferent alleles for a marker, then the individual is said to be heterozygous.

By recording pedigree and genotype data during selective breeding experiments, it ispossible to analyse changes in the proportions of alleles to look for regions of thegenome that are associated with (or are being altered by selection for) the trait. Such

15

Page 31: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

experiments are known as Quantitative Trait Locus (QTL)-mapping or selective sweepstudies. Databases to support smaller-scale examples of these kind of studies havepreviously been established [16].

4.3 Genes and the Gene Ontology

Approaching the experiment from the other point of view, there are a wealth of datain the published scientific literature that describe the effects of removing (“knockingout”) or over-expressing specific gene products in cells or in entire organisms. Givenknowledge of the gene targeted, and the effect of the targeting on the phenotypeobserved in the targeted cell or organism, it is possible to ascribe a function to thegene. In an attempt to impose some systematic rigour over the terms being used todescribe the phenotypes, a consortium of experimental scientists working primarily onDrosophila fruit flies developed a controlled vocabulary known as the Gene Ontology[44]. This ontology has since expanded to incorporate terms and phenotypes from alarge number of other species and has become the de facto standard for annotatinggene function.

In its simplest form, the Gene Ontology represents a directed graph of terms, splitinto three general classifications, namely: ‘biological process’, ‘cellular component’and ‘molecular function’. Terms are related amongst themselves by “is a”, “is partof”, “is alias of” and “regulates” relationships. The tree structure allows functions tobe annotated at the lowest level possible whilst still inheriting their terminology fromhigher nodes in the tree.

16

Page 32: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Chapter 5

Creation of Simulated Data Sets

5.1 Choice of Simulation Program

The National Cancer Institute (part of the United States government’s National In-stitutes of Health) have a specific Genetic Simulation Resources team who maintainan interactive reference/comparison page with which to compare genetic simulationsoftware packages [45]. This provides an extensive list of criteria on which to com-pare, including the type of data and population to be simulated, the method used, theinputs and outputs and the interface provided. Using these search criteria, combinedwith previous experience and recommendation from project sponsor and supervisor DrAndreas Kranis, we elected to use the QMSim package [46] for generating data.

The most up-to-date version of the QMSim software (Version 1.10 – dated 12th July2013) was downloaded from the University of Guelph web server [47] on 30th June2017. We downloaded both the Mac and Linux versions of the software. The check-sums of the files downloaded are reported in Appendix B.

The software is supplied with a comprehensive User’s Guide which describes the nu-merous configuration options available, along with a number of example parameterfiles illustrating typical simulation scenarios.

5.1.1 Reproducibility

Any process that generates simulated data usually incorporates some element of ran-domness. This stochastic element can make simulations difficult to repeat, whichmeans that users must store and publish the output files from the simulations in order

17

Page 33: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

to allow others to repeat or extend their analyses. Given that genotype files in partic-ular can be large, this imposes an unwelcome burden on scientists in the form of thecost of long-term data storage.

Fortunately, the QMSim software has been designed to alleviate this problem. It uses aseed file to initialise the random number generator, so is guaranteed to produce identicalresults if given the same seed file and parameter file in the future (subject to the correctconfiguration of the parameter file). The reproducibility of the outputs however, is alsodependent on running the code in single-threaded mode. There is therefore a trade-offbetween reproducibility and speed of simulation. In all our simulations, we chose toretain the seed file and configure the parameter file to run single-threaded using thatseed file. In that way we need only retain the parameter file and seed file and canregenerate our simulated data at any point in the future.

5.2 Generation of Minimal Exemplar Data Set

The software has an extensive range of configuration options. These fall into five dis-tinct groupings, namely global parameters, those that describe the history of the mainpopulation to be simulated, some defining the details of the selective breeding process,a further set describing the genome structure and a final collection of parameters thatdefine the outputs to be generated.

In order to investigate the running of the program and to allow us to examine typicaloutput files ahead of data modelling and import into the database systems, we rana simple simulation using a set of parameters based on the QMSim-supplied exampleparameter file ex01.prm . This describes a population selected over 10 generationsfrom a simple historical set of animals. Each generation comprises 20 males and 400females selected from the previous generation based on their performance trait score,which is derived from their QTL genotype complement. Pairs of selected animals aremated at random and produce litters of 2 animals per generation. The ratio of malesto females amongst the animals produced in each generation is 50:50.

In our minimal simulation, we reduced the number of chromosomes to 2, each of whichwas 50cM long with 100 markers and 25 QTL apiece. All markers had 2 alleles withequal likelihood in the initial population and were positioned at random. Quantitativetrait loci had 2, 3 or 4 variants with equal likelihood and were also positioned at randomacross the chromosomes.

We chose to keep the outputs as defined in the supplied parameter file ex01.prm ,namely data relating to individuals (pedigree, sex, generation number, number ofprogeny grouped by sex, inbreeding coefficient, degree of homozygosity, performance

18

Page 34: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

trait value and residual value), brief statistical data, genotype data for each of gener-ations 8, 9 and 10 and linkage maps for markers and QTLs.

The simulation ran effectively instantaneously on both development machine andperformance-testing system and generated identical output files on each. The namesand checksums of the input and output files relating to this data simulation are listedin Appendix B.2.

The output files from the QMSim program are space-delimited, ‘pretty-printed’ filesdesigned to be easy to view using mono-spaced fonts in a command-line or text editorenvironment. Of relevance to us in the initial phase of data loading and system testingare the lm_mrk_001.txt , p1_data_001.txt and the p1_mrk_001.txt files.

The lm_mrk_001.txt file (Linkage Map Marker file) contains details of the simulatedgenetic markers. The file presents the markers in serial order along the chromosomes,detailing marker name, chromosome to which it is mapped and position on that chro-mosome (expressed in genetic linkage map coordinate units).

The p1_mrk_001.txt file (Marker genotype data for population p1) contains a gridof genotype values for the generations of animals stipulated in the QMSim parameterfile (generations 8, 9 and 10 in this instance). The genotypes for each individual arepresented on a separate line, with marker genotypes presented in columns in the sameorder as the markers are listed in the lm_mrk_001.txt file. The first column oneach line is the unique identifier pertaining to the individual (animal) in question andeach subsequent pair of columns represent the two alleles for the genotype for the nextmarker in sequence. Thus, if there are M markers, each row in the p1_mrk_001.txtfile will contain (M × 2) + 1 columns (2 for each marker plus one for the individualidentifier).

The p1_data_001.txt file (performance and pedigree data for population p1) con-tains details of parentage, sex, generation number and trait performance, along witha number of measurements relating to the genetic makeup of each individual. Again,each individual is represented exactly once in the file. The individual identifiers in thisfile correspond directly to those in the p1_mrk_001.txt file.

Having characterised the types of output to expect from the QMSim software, wemoved on to simulate a data set more representative of a typical commercial breedingprogramme.

19

Page 35: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

5.3 Generation of Representative Medium Data Set

5.3.1 Population Parameters

To simulate a typical avian selective breeding program, we defined an initial historicalpopulation of 2000 individuals, evenly split between male and female. We then setparameters to require QMSim to simulate 1000 generations of random matings betweenindividuals, with no selection. Having done that, we then further defined a small, ‘bottleneck’ population of some 550 individuals (50 male, 500 female) selected on high traitperformance and bred for one generation with a litter size per mating of 10 animals.From this population, 100 males and 1000 females were used to establish a selectivebreeding population which was then bred and selected over a further 20 generations,again with a litter size of 10 per mating.

5.3.2 Genome Parameters

Avian genomes in general and the chicken genome in particular are far more complexthan typical mammalian genomes. The size of the chromosomes varies enormouslyfrom the largest to the smallest so in order to simulate the effects of a selectivebreeding programme on the chicken genome we needed to configure the simulationto take account of this variation in size. Therefore, rather than defining a singlechromosome configuration section to represent the entire genome, we configured oneper chromosome as appropriate to the actual chromosome sizes.

To do this, we extracted the sizes of the chicken chromosomes (as defined by thesequence assembly Gallus_gallus-5.0 [48]) as listed on the relevant NCBI GenomeOverview page [49]. Not all chicken chromosomes have been definitively identified andallocated to the sequence assembly (see Section 5.5) and the sex chromosomes in-troduce unnecessary complexity. We therefore restricted the chromosomes detailedin the sequence lengths file accordingly. We then wrote a simple Python script( build_params.py ) to take the chromosome lengths, convert them from sequencebase counts to approximate genetic map lengths and distribute a number of markers(supplied as a command-line parameter) proportionately across those chromosomes.The output from the script is the text necessary for the genome configuration sec-tion of a QMSim parameter file. The details of the Python script and the chickenchromosome lengths input file are listed in Appendix B.3.

For the purposes of the representative medium data set, we configured the script toassume a total genetic map length of 3,000 cM and distributed 50,000 markers. This

20

Page 36: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

corresponds to the approximate size of the Illumina 60K SNP chip described by Groenenet al. [41].

5.3.3 Outputs

As previously, we configured the QMSim simulation program to export data relating toindividuals, brief statistical data, genotype data for animals in the last 3 generations(18, 19 and 20 in this instance) and linkage maps for markers and QTLs.

The simulation ran on the Linux performance-testing system in just under 3 minutes.Details of the files thus generated are listed in Appendix B.3.

After distributing the markers proportionately across the chicken genome, as describedin Section 5.3.2 above, we were left with a dataset containing 49,985 markers afterrounding. The medium (50k) population therefore contains 1,499,550,000 (approxi-mately 1.5 billion) genotypes.

5.4 Generation of Representative Large Data Set

For the large data set, we used population parameters identical to those for the mediumdata set (see Section 5.3.1). We also used the same chromosome sizes and pythonparameter-building script but configured to distribute 5 million markers rather than50 thousand. This corresponds to the range of marker variants that are likely to bedetectable using whole genome re-sequencing methods. With this number of markers,in conjunction with the skewed distribution of the chromosome sizes in the chickengenome, we ran into limitations of the QMSim simulation software which restricts thenumber of markers to 400 thousand per chromosome. We therefore modified the codewe had written to generate the genome section of the QMSim parameter file in order toaccommodate this limit. The excess markers left over from the allocations to the largermarkers were evenly distributed across the smaller chromosomes to ensure the overalltotal number was unaffected. Thus the marker density on the larger chromosomes forthe Large Data Set is lower than that across the smaller chromosomes.

Again, we configured the simulation to export pedigree, trait, statistics, genotype datafor the last 3 generations (18, 19 and 20) and linkage maps. The simulation ran onthe linux performance testing server in just under 5 hours and 30 minutes minutes.Details of the files thus generated are listed in Appendix B.4.

This simulation distributed 5 million markers across the genome, resulting in 4,999,971after rounding in the distribution calculations. Thus, the large (5m) population con-

21

Page 37: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

tains 149,999,130,000 (approximately 150 billion) genotypes. It is of note that thegenotype output file for this simulation takes up 560GB of disk space.

5.5 Gene Build Data

Data pertaining to the current state of knowledge of the gene content of the chickengenome were downloaded from the Biomart system [50] provided as a part of Ensemblgenomes [51, 52]. This corresponds to Ensembl gene build 89 (May 2017 – [53]) whichis based on chicken genome assembly Gallus_gallus-5.0 [48].

As discussed in Section 5.3.2, the chicken genome differs from mammalian genomes inthat the lengths of chromosomes cover a very wide range, with the longest chromosomebeing orders of magnitude bigger than the smaller, “micro” chromosomes. The smallerchromosomes are so small that it has been difficult to identify sequences that map tosome of them. Consequently, not all the chromosomes are represented in named formwithin the sequence assembly, which contains a huge number of short, unmapped‘contigs’. For the purposes of this study, we elected to ignore all non-named contigsequences and focus on the better-defined, named chromosome sequences. We alsoexplicitly ignored the sex chromosomes (‘Z’ and ‘W’) as the management of genotypeson the sex chromosomes introduces a further level of complexity which is beyond thatwhich we aim to address in this study. Thus all data exported from the EnsemblBiomart was restricted to chromosomes 1-28,30-33. Chromosomes 29 and 34-37 arenot currently distinguishable in the chicken gene build [48].

We downloaded data files to describe chromosomes, genes, transcripts, exons and 5’-UTRs and 3’-UTRs, as well as files containing data to link the various biological entitiestogether. Details of the files downloaded are presented in Appendix D.

Briefly, the gene build data contains details of 17,717 genes, of which 14,624 aredesignated as ‘protein coding’. Collectively those genes are represented by 28,307transcripts and 185,395 exons. There are 24,322 5’-UTRs and 16,511 3’-UTRs. Thesecombined genome features map to locations on 32 named chromosomes (1-28, 30-33).

5.6 Gene Ontology Data

We downloaded the basic version of the Gene Ontology in OBO format using the linkprovided on the Gene Ontology website. The Gene Ontology is constantly updated,corrected and improved. We accessed the file on 21st June 2017 at 11:33 am. Detailsof the file thus obtained are listed in Appendix E.

22

Page 38: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Chapter 6

Loading Data

6.1 Pre-processing of Data

Whilst some data were obtained in a format suitable for immediate loading into eitherPostgreSQL or Neo4J, others required some pre-processing. Details of the manipula-tions performed on specific parts of the data sets are described below.

6.1.1 Gene Ontology Data

The OBO file format is not well-suited for direct loading into databases and requiressome conversion. We constructed a perl script for the purpose which was designedto extract the data into several distinct, but related files. In processing the file andattempting to load the data, we discovered some additional anomalies within the data.This required re-working of the perl script to deal with the special cases detected.

Each separate output file dealt with a single aspect of the Gene Ontology data, namelythe terms themselves and their aliases, the “is a” relationships, “part of” relationshipsand the regulatory relationships between terms. Details of the perl script and the filesgenerated as a result of processing the raw data file are listed in Appendix E.

6.1.2 Marker Data

The names and genetic locations of the markers were generated as part of the datasimulation exercise described in Chapter 5. As described in Section 5.2, these markerdata were generated in ‘pretty-printed’ format, padded with a large number of spaces

23

Page 39: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

to make them more readily human-readable. Before they can be efficiently importedinto either database system, they must first be converted into a CSV format withno unnecessary embedded spaces. This was done by piping the QMSim output filesthrough the Unix command line utility program awk , converting the field separatorsto commas. The details of the processed files from medium and large data sets arereported in Appendices B.3.4 and B.4.3 respectively.

In addition the marker data were generated with genetic distances rather than sequencecoordinates. In order to map the markers against the chromosomes and to comparetheir locations against those of genes, transcripts and exons, those genetic distancesmust be converted into sequence coordinates.

For simplicity, this can be done using a linear conversion based on the following formula:

markerSeqPos = markerGenPos × chromSeqLen

chromGenLen

where markerSeqPos is the required marker position expressed in sequence coor-dinates, markerGenPos is the marker position in genetic linkage distance units,chromSeqLen is the length of the chromosome to which the marker is mapped inbase pair units and chromGenLen is the length of the chromosome when expressedin genetic linkage units (centiMorgans).

Since this requires information from both marker and chromosome data files, thistransformation step is best performed inside the database systems. The method ofdoing so is described in the relevant database-specific sections below.

6.1.3 Pedigree Data

The identities of simulated individuals and the pedigree relationships between themwere also generated by the data simulation program as described in Chapter 5. Thepopulation sizes were identical in both medium and large data simulations (whichdiffered only in the number of markers). Each data file contained a total of 201,100individual records. Again, the data were exported as ‘pretty print’, space-delimitedfiles which had to be converted into CSV format using awk . Each line within the filecorresponds to a single animal, listing the animal’s id, the ids of its sire and dam, its sex,its generation number and the number of male and female progeny obtained from theindividual in question along with a number of statistical and phenotype measurements.

In addition to converting the runs of spaces separating columns within the data file intocommas, animals from generation zero (i.e. those at the very start of the pedigree forwhom no parental information is available) had to have their sire and dam ids convertedfrom the supplied ’0’ value to a blank value. The database systems consider an id of 0

24

Page 40: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

to be perfectly valid, whereas the blank, ’NULL’ value can be detected and dealt withsensibly. This prevents the loading scripts from incorrectly creating an individual recordwith an id of ’0’. Again, the details of the processed pedigree/phenotype files frommedium and large data sets are reported in Appendices B.3.4 and B.4.3 respectively.

6.1.4 Genotype Data

Genotypes represent the intersection between individual and marker – thus there areN × M genotypes, where N is the number of individuals in the generations for whichgenotypes are reported and M is the number of markers. In both our simulated datasets, we specified the same pedigree history and selection parameters. Thus, both datasets contain 30,000 individuals.The data are exported from the simulation program in lines, each corresponding toa single individual. Within each line, the first column is the animal’s unique id (cor-responding to entries in the pedigree data) followed by pairs of columns containingthe alleles that describe the genotype for that animal for each marker in turn. Thus,columns 2 and 3 contain the alleles for the genotype corresponding to Marker 1,columns 4 and 5 correspond to Marker 2 etc.. Data lines for the medium-scale sim-ulation contain 99,971 columns while data lines for the large-scale simulation contain9,999,943 columns.SQL data loading syntax requires a direct mapping of input file field to databasecolumn. Thus, to directly load these data using standard SQL constructs would requireus to define a temporary table with around 10 million columns, before running 5 millionSELECT statements to transform the data into a useful table structure. The Neo4JCypher language will allow us to import lines of arbitrary length from suitably formattedfiles, but does not have any loop-based constructs that we might use to iterate throughthe columns on a given row in order to create the genotype relationships. Thus, inorder to load the data directly using Cypher or SQL constructs, we must first reformatthe file such that each line contains data for a single Genotype.This reformatted file must specify the individual animal id and the marker name oneach line, whereas the generated genotypes file presents the data as a matrix withanimal id specified once and the marker name implicit based on column indices. Thereformatted file therefore contains a large amount of redundancy of information andthus takes up significantly more disk space.We wrote a small perl script to process the simulated genotype date files. An initialtrial using the medium-scale data set showed that the data duplication caused by thereformatting resulted in a more than four-fold increase in file size from 5.5GB to 24GB.The reformatting process took 25 minutes to complete. Extrapolating that to the large

25

Page 41: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

data set suggests that we would require 2.4TB of disk space and approximately 42hours of serial processing time, although the wall-clock time could be reduced byparallelising the process on suitable infrastructure. We did not test these hypotheses.

6.2 PostgreSQL

PostgreSQL provides a command-line shell utility, psql with which to access thedatabase engine. Script files containing PostgreSQL SQL commands can be piped intothis psql command for execution, and this provides a repeatable means of executingcomplex “ETL” (Extraction, Transformation, Loading) tasks. In addition, PostgreSQLis fully transactional and ACID compliant and queries can therefore be embedded inBEGIN TRANSACTION/(ROLLBACK|COMMIT) blocks which makes developing and testingscript files a relatively safe process.

PostgreSQL also provides a mode of operation which reports the timing (to micro-second level) of each query executed. This is activated and deactivated by meansof a simple \timing (on|off) statement passed to the psql shell. In addition,PostgreSQL maintains internal catalogues – themselves stored in database tables – ofthe current sizes of all relations (tables) and indexes within a schema. Collectivelythese make comparing the disk requirements and processing times for different datasets, shapes and sizes a trivial matter.

Details of all SQL scripts used in the loading and transforming of data are provided inAppendix F.

6.2.1 Gene Build Data

We loaded the Gene Build data into PostgreSQL directly from the downloaded filesdetailed in Section 5.5 using a SQL query script. Figure 6.1 illustrates the SQL tablestructure we used to model this portion of the data.

Data files were located in a single directory, the path to which was substituted intoplaceholders within the SQL script by piping through the sed command (see thescript header for details). The processed script file was than piped into the psqlbinary. Details of a portion of the script are presented in Listing 6.1. This is representsa typical PostgreSQL import task, albeit with the added complication that data mustbe transformed from those in the initial data file. This necessitates the use of a

26

Page 42: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Figure 6.1: SQL table structure used to model chromosomes, markers, genes, tran-scripts, 5’-UTRs, 3’-UTRs and exons in PostgreSQL

27

Page 43: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

temporary table from which data can be copied into its final location in the permanenttable. For the majority of the data loaded in this study, no temporary table wasrequired and the raw data were simply copied directly into the database. Note thecreation of the index on the ‘name’ field in the chromosome table. The use of indexesboth in PostgreSQL and in Neo4J speed up queries and calculations against the fieldsindexed. In this instance, the table is very small (only 32 records) so the index isprobably inconsequential but for larger and more complex data sets, the indexes areindispensable.

Listing 6.1: SQL script to load Chromosome data1 −− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−2 −− Load the Chromosome l i s t and i n d e x i t3 −−4 −− Note t h a t we have to l o a d i n t o a temporary t a b l e and then5 −− t r a n s f o r m the l e n g t h cod ing67 CREATE TEMPORARY TABLE tchromosome (8 i d INTEGER ,9 name CHARACTER VARYING(5) ,

10 l e n g t h REAL ,11 g e n e t i c _ l e n g t h INTEGER12 ) ;1314 COPY tchromosome (15 id ,16 name ,17 l eng th ,18 g e n e t i c _ l e n g t h19 )20 FROM ’<<BASEDIR>>/data−f o r −impor t /chromosomes−with−l e n g t h s . t x t ’21 WITH (22 HEADER TRUE,23 FORMAT CSV24 ) ;2526 −−27 −− Crea te the r e a l t a b l e2829 CREATE TABLE chromosome (30 i d INTEGER NOT NULL PRIMARY KEY,31 name CHARACTER VARYING(5) NOT NULL ,32 l e n g t h INTEGER ,33 g e n e t i c _ l e n g t h INTEGER34 ) ;3536 −−37 −− Copy the data a c r o s s , t r a n s f o r m i n g the l e n g t h f i e l d as we go3839 INSERT INTO chromosome (40 id ,41 name ,42 l eng th ,43 g e n e t i c _ l e n g t h44 )45 SELECT46 id ,47 name ,

28

Page 44: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

48 l e n g t h ∗ 1000000 ,49 g e n e t i c _ l e n g t h50 FROM51 tchromosome ;5253 −−54 −− Crea te an i n d e x on name to speed l oo kup s l a t e r5556 CREATE UNIQUE INDEX chromosome_name_idx57 ON chromosome ( name ) ;

6.2.2 Gene Ontology Data

We loaded the Gene Ontology data from the data files prepared in Section 6.1.1 intoPostgreSQL using a second SQL script. Again, the script file was passed through ased filter to substitute the path to the data file directory into placeholders embeddedwithin it. With the Gene Build and Gene Ontology data loaded, the Genes could bematched up to their assigned Gene Ontology terms.

Figure 6.2 illustrates the SQL table structure we used to model the Gene Ontologydata.

6.2.3 Marker Data

Marker data generated as part of the simulation process were pre-processed as describedin Section 6.1.2 and then loaded by means of a SQL script. SQL syntax requires databe mapped directly from fields within the input data file (a CSV file in this instance) tocolumns within a table. Thus the transformation of the position of the markers cannotbe performed as part of the loading process. We therefore loaded the data initially intoa temporary table, then transformed them as we copied the values from the temporarytable into the marker table which was their ultimate destination.

6.2.4 Mapping Marker Data to the Gene Build

Marker data were also mapped against the Gene Build data to identify markers thatmight be of additional biological significance (see Section 4.1). In particular we mayneed to extract marker data for those that fall inside coding regions of genes (exons),those that are in genes/transcripts but not in exons (intronic markers), those that arein 5’- or 3’-UTRs etc. This is best effected after the creation of suitable indexes on the

29

Page 45: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Figure 6.2: SQL table structure used to model Gene Ontology terms in PostgreSQL

30

Page 46: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

mapped positions of the markers and the separate Gene Build entities, against whichthe position of the markers will be assessed.

We wrote a script to create the indexes and perform the mapping and ran it againstthe database.

6.2.5 Pedigree Data

Similarly, pedigree (and phenotype) data from the simulation program were pre-processedas described in Section 6.1.3 and then loaded with a further SQL script. The pedigreedata is self-referencing, as sires and dams of individuals within the data set are them-selves represented as individual records. In order to optimise the querying of these datasubsequently, we chose to create an index on the sire and dam id columns within thetable.

6.2.6 Genotype Data

We loaded the reformatted Medium Data Set genotype data from Section 6.1.4 directly,again by means of a script. Because of the size of the data set, we elected to createindexes for future searching prior to loading the data.

6.3 Neo4J

Neo4J also provides a command-line shell utility, cypher-shell , to enable directsystem interaction. Script files containing Cypher commands can be piped throughthis utility, which again provides a repeatable method of data loading, transformationand query. Cypher also provides transactional support, but with some caveats. Firstly,commands that manipulate data (e.g. those that load data from CSV files) cannotexist in the same transaction as commands that manipulate what Neo4J refers to asthe database ‘schema’ (i.e. those that create or modify indexes or constraints). Thusit is frequently impossible to completely isolate a given script and have the systemroll back to the exact state it was in before execution as at least one transactionusually must commit before another can begin. Secondly, Neo4J recommends usinga PERIODIC COMMIT clause in data loading scripts because otherwise the system willattempt to hold the entire loading data set in memory and may therefore fail to loadlarge data sets. Seemingly, transactional commits are viewed more as a cache-flushingmechanism in Neo4J than as a genuine transaction-isolating safety mechanism.

31

Page 47: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Figure 6.3: Graph structure used to model chromosomes, markers, genes, transcripts,5’-UTRs, 3’-UTRs and exons in Neo4J

Neo4J also provides no built-in disk usage reporting commands. Disk usage can onlyapparently be assessed by use of operating system commands aimed at the data sub-directory, and these are only accurate if Neo4J has first been shut down (in orderto flush the current graph from memory to disk). Care must also be taken to ex-clude transaction logs from the usage calculation, as these are also stored in the datadirectory.

Details of the script used are provided in Appendix G.

6.3.1 Gene Build Data

Figure 6.3 illustrates the Neo4J graph structure we used to model this portion of thedata.

We loaded the Gene Build data into Neo4J directly from the downloaded files detailedin Section 5.5 by means of a Cypher query script. Data files were copied into theimport directory which is the default location for file-based data sources in a Neo4J

32

Page 48: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

instance. A script file was than piped into the cypher-shell command, specifyingthe –format verbose command line flag to instruct the shell to report query timingsamongst other things.

A portion of a typical Cypher script are presented in Listing 6.2. In this script, whichis the functional equivalent of the PostgreSQL script detailed in Listing 6.1, we readin from the raw data file and create a series of nodes labelled with the :Chromosomelabel. Each has a series of properties characteristic to that chromosome. Because wecan manipulate the input data as part of the loading process, we do not need to resortto temporary structures as we did when loading into PostgreSQL. We can also referto the input data file columns by name (extracted from the header line) and includeor exclude columns at will.

Similarly to the PostgreSQL process, we create indexes on name and index propertiesto speed future queries.

Listing 6.2: Neo4J Cypher script to load Chromosome data1 : beg in2 // −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−3 // Load the Chromosome l i s t and i n d e x i t4 //5 LOAD CSV WITH HEADERS FROM " f i l e : /// chromosomes−with−l e n g t h s . t x t " AS l i n e6 CREATE (7 : Chromosome {8 i n d e x : t o I n t e g e r ( l i n e . ‘ Index ‘ ) ,9 name : l i n e . ‘ Chromosome ‘ ,

10 l e n g t h : t o I n t e g e r ( t o F l o a t ( l i n e . ‘ Length ‘ ) ∗ 1000000) ,11 g e n e t i c _ l e n g t h : t o I n t e g e r ( l i n e . ‘ Cmlength ‘ )12 }13 ) ;14 : commit1516 : beg in17 CREATE INDEX ON : Chromosome ( name ) ;18 CREATE INDEX ON : Chromosome ( i n d e x ) ;19 : commit2021 : beg in22 // Wait f o r the i n d e x e s to become a v a i l a b l e23 CALL db . a w a i t I n d e x ( " : Chromosome ( name ) " , 10) ;24 CALL db . a w a i t I n d e x ( " : Chromosome ( i n d e x ) " , 10) ;25 : commit

6.3.2 Gene Ontology Data

Similarly we loaded the Gene Ontology data into Neo4J using a Cypher script targetingthe data files described in Section 6.1.1 which were copied into the Neo4J import datafolder. Again, timings were extracted by means of the cypher-shell –format

33

Page 49: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

verbose command line flag. Figure 6.4 illustrates the Neo4J graph structure we usedto model the Gene Ontology data.

Having loaded both the Gene Build and the Gene Ontology data, we ran a furtherscript to map Genes to their assigned Gene Ontology terms.

6.3.3 Marker Data

Marker data generated as part of the simulation process were pre-processed as describedin Section 6.1.2 and then loaded by means of a Cypher script. Because Cypher permitsthe manipulation of data as part of the loading of CSV files, the transformation fromgenetic distances to sequence coordinates (see section 6.1.2) was performed directly.

6.3.4 Mapping Marker Data to the Gene Build

As mentioned in Section 4.1, and re-iterated in Section 6.2.4, the location of a markerin relation to the genes, transcripts, exons and UTRs alter the relative interest that wemay or may not have in that marker. This is based on the potential biological effectthe genetic variation may cause.

We wrote a script to create suitable relationships between markers and the Gene Builddata objects.

6.3.5 Pedigree Data

CSV-formatted pedigree and phenotype data (see Section 6.1.3) were also loaded witha Cypher script.

6.3.6 Genotype Data

We were unable to load the reformatted Medium Data Set genotype data (Section6.1.4) successfully. Initial attempts on the development machine failed due to lack ofdisk space as the disk requirements ran to over 560GB (the initial, unmodified genotypedata file was 6GB and the reformatted version was 24GB). Attempts to run code on theperformance-testing system also failed, with unlogged crashes killing the load processbefore it could complete. Investigation suggested that the disk space problem is caused

34

Page 50: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Figure 6.4: Graph structure used to model Gene Ontology terms in Neo4J

35

Page 51: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

by Neo4J writing excessive transaction log files to disk during the load process. Thecause of the crashes on the performance-testing system remains unknown, and the lackof debugging or logging output made further investigation impossible. Reconfiguringthe installation by adding appropriate transaction log configuration directives to theneo4j.conf file failed to address the problem.We subsequently discovered that Neo4J has a technical limit on the number of rela-tionships/edges that it can support. That limit is reportedly 235 relationships - approx34 billion (source - Neo4J online sizing calculator), but we strongly suspect that theactual limit may be considerably less than that.In the face of our inability to load the genotype data we therefore elected to adopt analternate, hopefully more efficient route.After discussion with Neo4J experts in the public Neo4J Slack channel [54] (see [55] foraccess details), we decided that the best and most flexible option would be to create an‘unmanaged extension’ to Neo4J. Neo4J is written in Java and provides an extensionpoint via the http-based ‘Bolt’ interface. Java code written against the Neo4J API canbe plugged in to the interface by placing the compiled jar file into the plugins directory(see Figure 3.2) and adding suitable configuration directives to the neo4j.conf filebefore starting the Neo4J instance.Our first version attempted to load the entire, unmodified Medium Data Set geno-type file as created by QMSim. Again this failed on the development machine due tolack of disk space, with the transaction logging problems still unresolved. We there-fore modified and extended our code to allow us to load different shaped subsets ofdata and to test a number of different means of representing the genotypes in orderto better understand the relationship between data representation, volume and diskrequirements.The key part of the code is shown in Listing H.1. This code reads in a file, the locationof which is specified in the POST request submitted to the Neo4J http interface. Itskips the first, header line and then reads lines until it reaches the data for the firstindividual requested by the user (again specified in the POST request). Each data lineis then split and the first column taken to be the name of an individual animal. Theremaining columns are assumed to be pairs of alleles pertaining to an ordered set ofmarkers. To save time, markers are retrieved or created (depending on the state of thegraph prior to loading) when dealing with the first line of real input data and cachedlocally for rapid access during the remainder of the loading process. The genotype foreach marker in turn is then processed for the current individual, before moving on tothe next individual (row).The user can specify the row number of the first and last individuals to be loaded,and the first and last index numbers of the markers to be processed. Thus, should an

36

Page 52: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

appropriate computing environment be available, it should be possible to submit mul-tiple simultaneous requests and have each process deal with separate, non-overlappingportions of the genotype data in parallel.

It is also possible to specify one of two different methods for coding the genotypeswhich allows for testing of the suitability and performance characteristics of thosemethods.

We compiled the code, installed it in the Neo4J plugin directory and tested the loadingperformance of several different shapes and sizes of genotype data.

37

Page 53: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Chapter 7

Comparative Analysis

7.1 Disk Space Used

Details of the disk space required by PostgreSQL to store the core Gene Build and GeneOntology data, the pedigree and phenotype data, the marker data for the Medium andLarge Data sets and the Medium Data set genotype data are shown in Tables 7.1 to7.5.

Figure 7.1 plots the disk usage required for the core Gene Ontology and Gene Buildobjects (data in Table 7.1). As can be seen, there is little difference between the twosystems. PostgreSQL requires more disk to store the UTR data, but this is readilyexplained as these data are stored differently in the two systems. Because multipleUTRs may map to the same transcript, the data are stored as a table for the UTRsand a separate table mapping the UTRs to the transcripts in PostgreSQL. In Neo4J,the nodes are linked directly with no intermediate mapping.

Figure 7.2 shows the disk space required to record the internal relationships within theset of Gene Ontology terms. These relationships are the “is a”, “part of”, “is an aliasof” and “regulates” relationships that give the ontology its tree-like structure. Here,PostgreSQL requires significantly more disk space than Neo4J. Clearly handling therelationships as joins through a mapping table requires a greater amount of storagethan expressing the relationships as links directly in the data structure, which is theNeo4J approach.

38

Page 54: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Figure 7.1: Disk space required (in MB) to store the Gene Ontology terms, chromo-somes, genes, transcripts, 5’-UTRs, 3’-UTRs and exons in Neo4J and PostgreSQL

39

Page 55: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Table 7.1: Comparison of disk usage required to store the core Gene Build and GeneOntology data. Figures in the Neo4J and PostgreSQL columns are in kilobytes.

Data Grouping Number of Entities Neo4J PostgreSQLRels NodesGene Ontology (GO) terms 48,878 20,616 20,616* 17,304GO ’is a’ relationships 76,261 2,544 2,544* 6,936GO ’part of’ relationships 6,865 232 232* 640GO ’alias of’ relationships 2,021 64 64* 200GO ’regulates’ relationships 8,723 640 640* 968Chromosomes 32 20 20* 40Genes 17,717 3,528 3,528* 2,688Gene to Chromosome mappings 17,717 1,292 3,540* 4,592Transcripts 28,307 3,016 3,016* 2,752Transcript to Chrom. mappings 28,307 2,076 4,764* 7,296Transcript to Gene mappings 28,307 940 940* 1,8885’-UTR (mapped to Transcripts) 24,322 2,964 5,624* 6,2803’-UTR (mapped to Transcripts) 16,511 2,008 3,952* 4,288Exons 185,395 14,588 14,588* 16,824Exon to Chromosome mappings 185,395 13,568 35,644* 47,608Exon to Transcript mappings 276,999 20,316 20,316* 40,488Gene to GO term mappings 108,181 7,972 7,972* 13,736

40

Page 56: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Figure 7.2: Disk space required (in MB) to represent the internal relationships withinthe Gene Ontology data in Neo4J and PostgreSQL

41

Page 57: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Table 7.2: Comparison of disk usage required to store the Pedigree/Phenotype data.Figures in the Neo4J and PostgreSQL columns are in kilobytes.

Data Grouping Number of Entities Neo4J PostgreSQLIndividuals 201,100 53,456 33,328

Table 7.3: Comparison of disk usage required to store the marker data for the MediumData Set. Figures in the Neo4J and PostgreSQL columns are in kilobytes.

Data Grouping Number of Entities Neo4J PostgreSQLMarkers (medium set) 49,985 7,512 2,888Marker to Chromosome mappings 49,985 3,632Total 49,985 7,512 6,520

Figure 7.3 shows the additional disk space required in PostgreSQL and Neo4J aftermapping objects one to another (based on their genomic coordinates). In this instance,PostgreSQL does require more disk space, but this is because we have created indexesin PostgreSQL on the mapping locations themselves. Because we handle the mappingbetween objects (e.g. between genes and chromosomes) as a relationship in Neo4J, weare unable to create usable indexes through Cypher alone. This is because the proper-ties – in this case, the mapping start, end and strand – of graph edges (relationships)are not indexable in Neo4J, in direct contrast to the properties of nodes. This inabilityto index relationships is stated to be a deliberate design philosophy, as graph theorytraditionally only considers an edge to be able to have a single weight property. How-ever, given that Neo4J allows the user to specify multiple, arbitrary properties on anedge, it is somewhat surprising to discover this limitation. This limitation has sizeableconsequences (see Section 7.3).

Table 7.4: Comparison of disk usage required to store the marker data for the LargeData Set. Figures in the Neo4J and PostgreSQL columns are in kilobytes.

Data Grouping Number of Entities Neo4J PostgreSQLMarkers (large set) 4,999,971 759,980 353,392Marker to Chromosome mappings 4,999,971 370,176Total 4,999,971 759,980 723,568

42

Page 58: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Figure 7.3: Disk space required (in MB) to store the mappings between the variousGene Build data objects and between genes and the Gene Ontology terms in Neo4Jand PostgreSQL

Table 7.5: Comparison of disk usage required to store the Genotype data for theMedium Data Set. Figures in the Neo4J and PostgreSQL columns are in kilobytes.

Data Grouping Number of Entities Neo4J PostgreSQLGenotypes (medium set) 1,499,550,000 N/A 74,097,144Genotype to Chromosome mappings 1,499,550,000 87,703,056Total 1,499,550,000 N/A 161,800,200

43

Page 59: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

7.2 Speed of Data Import

Although PostgreSQL requires more disk than Neo4J to handle the same data, it doesso at a far higher speed. Histogram plots of the speed of data import for loadingthe core Gene Build and Gene Ontology data objects, for creating the relationshipsbetween Gene Ontology data objects and for mapping the Gene Build objects one toanother and to the Gene Ontology terms are shown in Figures 7.4 to 7.6.

As can be seen, PostgreSQL is orders of magnitude faster at loading these smallerdata sets than the Cypher-based queries. Whilst the volumes of data here consideredare modest in terms of the genotype data, the relative performance of the PostgreSQLsystem is impressive. The majority of the speed advantage is likely to stem fromthe structured nature of the relational database design. With each table record rigidlydefined in advance, disk writes can more readily be optimised than if ultimate flexibilityin schema is the goal. This hypothesis is supported by the huge discrepancy that wesee with loading the Gene Ontology data. These data have widely differing lengths ofdescriptive text which needed to be parsed and measured before the table definitioncould be created in PostgreSQL. The pre-defined table structure then allows for muchmore efficient disk access.

We were unable to directly compare the loading performance of the two systems withregard to the genotype data as the Neo4J system was unable to complete the task.Our PostgreSQL installation loaded the Medium Data Set genotype data – some 1.5billion records – in just short of 12 hours, a rate of 125 million records per hour oraround 35,000 per second.The time taken to load differing numbers of markers and individuals into Neo4J us-ing our “unmanaged extension” loader is shown in Figure 7.7. As expected, loadinglarger numbers of individuals or markers increased the total load time in rough pro-portion to the number of genotypes being read from and written to disk. Using thiscode, we managed to load 25 million records in around 525 seconds, a rate of nearly48,000 per second. This is certainly comparable with the speed that we obtained withthe PostgreSQL installation. This contrasts with the speed of data loading that wesee when using Cypher constructs, which is consistently considerably slower than thePostgreSQL equivalent. However, we should note that this data loading cached therelevant marker objects in memory and performed no error-checking.

44

Page 60: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Figure 7.4: Time required (ms) to store the Gene Ontology terms, chromosomes,genes, transcripts, 5’-UTRs, 3’-UTRs and exons in Neo4J and PostgreSQL

45

Page 61: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Figure 7.5: Time required (ms) to create the internal relationships within the GeneOntology data in Neo4J and PostgreSQL

46

Page 62: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Figure 7.6: Time required (ms) to store the mappings between the various GeneBuild data objects and between genes and the Gene Ontology terms in Neo4J andPostgreSQL

47

Page 63: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Figure 7.7: Time required (ms) to load genotypes for varying numbers of individualsand markers using the “unmanaged extension” method in Neo4J

48

Page 64: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

7.3 Speed of Data Export

Similarly to the general loading performance, we find the speed of query and exportfrom PostgreSQL to be at least as fast as that in Neo4J and occasionally ordersof magnitude faster. The queries on which the PostgreSQL queries gain the mostadvantage are – unsurprisingly – those that take advantage of indexes on the mappingrelationships.

For example, with the benefit of indexes on the relevant tables, which themselves tooka total of less than 1 second to generate in PostgreSQL, we were able to extract listsof markers overlapping all the key Gene Build objects (genes, transcripts, exons andUTRs) in less than 1 second per query. The equivalent queries in Neo4J typically tookbetween 3 and 5 minutes, extending to over 30 minutes when mapping against exons.Admittedly, these are extreme examples but they highlight the importance of indexesand the limitations caused by the lack of indexes on relationship properties that wehighlighted in Section 7.1. Examples of the queries to identify the marker objectsoverlapping exons are shown in Listings 7.1 and 7.2.

Note that the SQL query completes in just 727 ms, whereas the Cypher query takes1,876,312 ms (approximately 31 minutes). The difference in the number of rowsreported is due to rounding differences when calculating marker positions based on themapping from genetic distance to physical, sequence coordinates.

Listing 7.1: SQL script to count the number of markers that overlap Exons1 SELECT2 count (∗ )3 FROM4 marker m,5 marker_to_chromosome mtc ,6 chromosome c ,7 exon_to_chromosome etc ,8 exon e9 WHERE

10 mtc . marker_name = m. name11 AND12 mtc . chromosome_id = c . i d13 AND14 c . i d = e t c . chromosome_id15 AND16 e t c . exon_id = e . i d17 AND18 mtc . p o s i t i o n <= e t c . map_end19 AND20 mtc . p o s i t i o n >= e t c . map_start ;2122 count23 −−−−−−−24 215725 (1 row )2627 Time : 727 .988 ms

49

Page 65: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Listing 7.2: Cypher script to create an “is in exon” relationship between overlappingmarkers and Exons

1 match (m: Marker ) −[mm:MAPS_TO]−>(c : Chromosome )<−[gm:MAPS_TO] −(g : Exon )2 where mm. p o s i t i o n <= gm. end3 and mm. p o s i t i o n >= gm. s t a r t4 merge (m) − [ : IS_IN_EXON]−>(g ) ;5 0 rows a v a i l a b l e a f t e r 1876312 ms , consumed a f t e r ano the r 0 ms6 Created 2158 r e l a t i o n s h i p s

7.4 Qualitative Assessment

We were interested to assess how readily understandable the Cypher syntax is, whencompared to SQL code. To this end, we designed an unscientific survey questionnairewhich we presented to a number of local bioinformatics colleagues to see what level ofcomprehension they could achieve coming to the language ‘cold’.

There was a range of prior database experience, from virtually none to SQL expertlevel. Experience of graph databases was, as expected, more limited with only onerespondent having used Neo4J before.

Universally, the respondents failed to find Cypher queries any more understandablethan their SQL equivalents. Rather than the language syntax, per se, the most impor-tant factor in conveying the meaning behind a given query appeared to be the nam-ing conventions used. Thus, the Cypher query MATCH (m:M) RETURN (m); was farless readily understood than the functionally equivalent MATCH (m:Marker) RETURN(m);. Similar observations were made for the equivalent SQL code.

The more complex queries were noted to be ‘more compact’ in Cypher than in SQL,but several programmers bemoaned the ‘lack of implementation detail’ in the Cypherquery, stating that they much preferred the explicit, step-by-step nature of the SQLstatements.

50

Page 66: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Chapter 8

Discussion and Conclusions

There are many factors that must be taken into consideration when choosing a databaseengine for a project. The data generated from experimental research or commercialoperations can be the most valuable asset a scientist or company owns and it iscritical that it be entrusted to a solid infrastructure. The measure of a good databaseinfrastructure is hard to define fully, but it must include a full assurance that thedata are backed up and safe and that the operations performed across those datagenerate correct results. In addition, the ability to monitor resource usage is helpful indetermining capacity and growth requirements.

Added on top of that are considerations such as “ease of use”, which is always difficultto define and “performance” which is easier to judge for certain sets of use cases.

We were disappointed with the lack of “enterprise” grade tools in the Neo4J distribu-tion. Admittedly we tested the free Community Edition version of the software, but thelack of online backup tools make it difficult to recommend the use of this version forreal world work. Pricing for the Enterprise version of the software is not transparentlyavailable – presumably it is negotiated on a case by case basis.

In addition, we found the lack of tools for reporting the actual disk space usage requiredtroublesome. Whilst it is clear that the best performance can be obtained by keepingas much of the graph in memory as possible, the lack of urgency on the part of thesoftware to write to disk does seem strange to someone who comes from a traditionalrelational DBMS background. Presumably this is a design feature to minimise disktraffic in rapidly-changing data environments. The side-effect of this feature is thatit is virtually impossible to monitor the effect of data loading on disk requirementwithout shutting the server down. Another side-effect is the need for large amountsof disk-space – considerably more than is required to actually store the data – tomaintain transaction logs which should be able to recover the state of a database in

51

Page 67: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

the event of the server failing before it has finally written the memory contents backout to main storage. The retention of these transaction log files is supposed to beuser-configurable, but we found that the directives intended to manage this aspect ofthe system performance seemed to be ignored.Worse still, though, was the failure of the Neo4J system to load the genotype data at all– or at least to fail gracefully. In addition, we found a serious bug in the original versionof the software that we tested such that incorrect results were returned in responseto some queries. Whilst this has been fixed in later versions, it is not clear fromdiscussions with Neo4J representatives in their support forum if the fix is deliberateand in response to a bug report or if it is coincidental and thus may be symptomaticof an intermittent and unidentified bug still lurking in the system code.PostgreSQL, meanwhile worked commendably and dealt with everything that wasthrown at it. We managed successfully to load 1.5 billion records into a single ta-ble – albeit slowly – and it performed as well as Neo4J on the majority of queriesthat we ran. Key to this performance, though, was the creation of indexes on thetable structures, as it is with all database engines (Neo4J included). In this, there is atrade-off between disk space and query performance, but in a system where disk usagecan readily be monitored this is easy to manage.In general, the issue of genotype loading was not one where either database wascomfortable. We did not benchmark any genotype queries against either system, asthe Neo4J system failed to load the full data set under any circumstances that wetested. Given the availability of other efficient genotype storage systems that usecustom file formats and indexed file reads e.g. PLINK [56], it seems likely that a hybridapproach using a database engine for gene build, annotation and pedigree informationand a separate, file-based genotype storage and retrieval system would provide a betterall-round solution to the growing problem of dealing with genotype.The one area where Neo4J has a clear advantage is in querying variable depth trees,such as we see in the Gene Ontology data. Here, a given node has zero or more parentnodes identified as “is a” or “part of”. A query that wants, for example, to retrieveall genes that are annotated against Gene Ontology term ‘Y’ will not only wish toretrieve all those that are annotated directly against that term, but also those genesthat are annotated against any other term that is reachable by traversing the “is a”relationships. So, if we have a relationship (here expressed in Cypher syntax) such that

(X:Goterm)-[:IS_A]->(Y:Goterm)-[:IS_A]->(Z:Goterm)

and that(A:Gene)-[:HAS_GO_TERM]->(X),

(B:Gene)-[:HAS_GO_TERM]->(Y) and

52

Page 68: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

(C:Gene)-[:HAS_GO_TERM]->(Z)

then a query to retrieve all the genes that are annotated as (Y:Goterm) should return(B:Gene) directly and (A:Gene) transitively. The provision of the [0..*] variablepath length query operator in Cypher makes expressing, and limiting, this query trivial.

However, ultimately it is the same information that must be joined up in either databasesystem and the Cypher syntax is simply that – a syntactic layer on top of a computingengine. As one delves under the hood, writing Java code to hook directly into theNeo4J engine directly, it becomes apparent that there is still a lot of processing goingon behind the scenes. Whilst the syntax is enticing, the work done to return the resultsof that concise query is as complex as would be required in any other database system.As an example, if we wish to look to see if a specific node has a particular type ofrelationship to another specific node, then one must retrieve all the relationships fromthe first node, check their type and then look at the opposite end of the relationshipto compare against the target node. The API to the database does not provide readyaccess to that information directly – the code must do it. The difference, thereforebetween the Cypher query and the SQL query is simply the syntax.

In any large scale system – a system large enough to put a user-facing application infront of – all the data wrangling would likely be delegated to a Data Access Object orService layer in any case. The Neo4J interface to those software engineering-prescribedlayers is primarily direct through the Java API in any case, which renders moot theadvantages of the Cypher language.

Ultimately, Neo4J offers a relatively simple to use database system with a particularlyattractive browser-based front end. The Cypher query language is simple and expres-sive, albeit not as intuitively obvious as the marketing claims would have users believe.However, shortcomings in backup and monitoring capabilities in the free version andthe presence of potentially data-damaging bugs in the code base make it impossible torecommend the adoption of this version of the software at present.

PostgreSQL, meanwhile, continues to provide a solid database engine at no up-frontcost, with a performance in our limited testing environment that comfortably exceededthat extracted from the graph database.

53

Page 69: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Bibliography

[1] DEFRA. Agriculture in the United Kingdom. https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/629226/AUK-2016-17jul17.pdf – retrieved 10th August 2017, April 2017.

[2] DEFRA. Corrections to the Total Income from Farming inthe United Kingdom. https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/615850/agriaccounts-tiffstatsnotice-25may17.pdf - retrieved 10th August2017, May 2017.

[3] L. E. Kuentzel. New codes for Hollerith-type punched cards – to sort infraredabsorption and chemical structure data. Analytical Chemistry, pages 1413–1418,1951.

[4] Kristi L. Berg, Tom Seymour, and Richa Goel. History of databases. Interna-tional Journal of Management and Information Systems (Online), 17(1):29, 2013.Copyright - Copyright Clute Institute for Academic Research 2013; Last updated- 2013-08-21.

[5] Gerard O’Regan. History of Databases, pages 275–283. Springer InternationalPublishing, Cham, 2016.

[6] E. F. Codd. A relational model of data for large shared data banks. Commun.ACM, 13(6):377–387, June 1970.

[7] Wikipedia NoSQL page. https://en.wikipedia.org/wiki/NoSQL – Accessed15th August.

[8] Graph DB vs RDBMS - Neo4J Developer Documentation – accessed 15th august2017.

[9] Nucleic Acids Research - Database Issue. https://academic.oup.com/nar/issue/45/D1 – Accessed 15th August 2017, 2017.

54

Page 70: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

[10] Christina A. Vivelo, Ricky Wat, Charul Agrawal, Hui Yi Tee, and Anthony K. L.Leung. ADPriboDB: The database of ADP-ribosylated proteins. Nucleic AcidsResearch, 45(D1):D204–D209, 2017.

[11] Pawel Dabrowski-Tumanski, Aleksandra I. Jarmolinska, Wanda Niemyska, Eric J.Rawdon, Kenneth C. Millett, and Joanna I. Sulkowska. LinkProt: a database col-lecting information about biological links. Nucleic Acids Research, 45(D1):D243–D249, 2017.

[12] Pora Kim, Junfei Zhao, Pinyi Lu, and Zhongming Zhao. mutlbsgenedb: mutatedligand binding site gene database. Nucleic Acids Research, 45(D1):D256–D263,2017.

[13] Andrei L. Lomize, Mikhail A. Lomize, Shean R. Krolicki, and Irina D. Pogozheva.Membranome: a database for proteome-wide analysis of single-pass membraneproteins. Nucleic Acids Research, 45(D1):D250–D255, 2017.

[14] Bronwen L. Aken, Premanand Achuthan, Wasiu Akanni, M. Ridwan Amode,Friederike Bernsdorff, Jyothish Bhai, Konstantinos Billis, Denise Carvalho-Silva,Carla Cummins, Peter Clapham, Laurent Gil, Carlos García Girón, Leo Gor-don, Thibaut Hourlier, Sarah E. Hunt, Sophie H. Janacek, Thomas Juettemann,Stephen Keenan, Matthew R. Laird, Ilias Lavidas, Thomas Maurel, WilliamMcLaren, Benjamin Moore, Daniel N. Murphy, Rishi Nag, Victoria Newman,Michael Nuhn, Chuang Kee Ong, Anne Parker, Mateus Patricio, Harpreet SinghRiat, Daniel Sheppard, Helen Sparrow, Kieron Taylor, Anja Thormann, AlessandroVullo, Brandon Walts, Steven P. Wilder, Amonida Zadissa, Myrto Kostadima, Fer-gal J. Martin, Matthieu Muffato, Emily Perry, Magali Ruffier, Daniel M. Staines,Stephen J. Trevanion, Fiona Cunningham, Andrew Yates, Daniel R. Zerbino, andPaul Flicek. Ensembl 2017. Nucleic Acids Research, 45(D1):D635–D642, 2017.

[15] Neil Bartley Andy Law, David Speed and Trevor Paterson. The arkdb geneticlinkage map drawing system. http://www.thearkdb.org/arkdb/ – Accessed15th August 2017.

[16] Andy Law, David Speed, Neil Bartley, Fahad Ifthkar, Pauline Ward, Paul Nelson,Paul Shaw, and Trevor Paterson. The ResSpecies database system. http://www.resspecies.org/.

[17] Graph databases in life and health sciencesworkshop. https://www.eventbrite.com/e/neo4j-life-health-sciences-day-berlin-tickets-33238223421# –Accessed 15th August 2017.

[18] EMBL-EBI Ontology Lookup Service. http://www.ebi.ac.uk/ols/index –Accessed 15th August 2017.

55

Page 71: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

[19] The Reactome pathway database. http://reactome.org – Accessed 15th Au-gust 2017.

[20] Thilo Muth, Alexander Behne, Robert Heyer, Fabian Kohrs, Dirk Benndorf,Marcus Hoffmann, Miro Lehtevä, Udo Reichl, Lennart Martens, and ErdmannRapp. The MetaProteomeAnalyzer: a powerful open-source software suite formetaproteomics data analysis and interpretation. Journal of Proteome Research,14(3):1557–1565, 2015. PMID: 25660940.

[21] DB Engines Relational Database Rankings. https://db-engines.com/en/ranking/relational+dbms – Accessed on 13th August 2017.

[22] Oracle Database. https://www.oracle.com/database/index.html – Ac-cessed 15th August 2017.

[23] MySQL Database. https://www.mysql.com – Accessed 15th August 2017.

[24] Microsoft SQL Server. https://www.microsoft.com/en-us/sql-server/ –Accessed 15th August 2017.

[25] PostgreSQL Database. https://www.postgresql.org – Accessed 15th August2017.

[26] Actian X Database. https://www.actian.com/data-management/ingres-sql-rdbms/ – Accessed 15th August 2017.

[27] Wikipedia – ACID. https://en.wikipedia.org/wiki/ACID.

[28] Introduction to InnoDB - MySQL Documentation. https://dev.mysql.com/doc/refman/5.6/en/innodb-introduction.html – Accessed 15th August2017.

[29] Neo4J website. https://neo4j.com – Accessed 15th August 2017.

[30] DB Engines Graph Database Rankings. https://db-engines.com/en/ranking/graph+dbms – Accessed on 13th August 2017.

[31] NetBeans website. https://netbeans.org, 2017.

[32] Oracle Java website. https://java.com/en/.

[33] Apache Maven website. https://www.apache.org. Accessed 7th August 2017.

[34] Juan Fernando Medrano and Estuardo-Aguilar Cordova. Genotyping of bovinekappa-casein loci following dna sequence amplification. Nat Biotech, 8(2):144–146, 02 1990.

56

Page 72: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

[35] Masao. Ota, Takeshi Seki, Hirofumi Fukushima, Kimiyosi Tsuji, and HidetoshiInoko. Hla-drb1 genotyping by modified pcr-rflp method combined with group-specific primers. Tissue Antigens, 39(4):187–202, 1992.

[36] Janet S. Ziegle, Ying Su, Kevin P. Corcoran, Li Nie, P. Eric Mayrand, Louis B.Hoff, Lincoln J. McBride, Mel N. Kronick, and Scott R. Diehl. Application ofautomated {DNA} sizing technology for genotyping microsatellite loci. Genomics,14(4):1026 – 1031, 1992.

[37] H. Khatib, E. Genislav, M. Soller, L. B. Crittenden, and N. Bumstead. Sequence-tagged microsatellite sites as markers in chicken reference and resource popula-tions. Animal Genetics, 24(5):355–362, 1993.

[38] LaDeana W. Hillier, Webb Miller, Ewan Birney, Wesley Warren, Ross C. Hardison,Chris P. Ponting, Peer Bork, David W. Burt, Martien A. M. Groenen, Mary E.Delany, Jerry B. Dodgson, Asif T. Chinwalla, Paul F. Cliften, Sandra W. Clifton,Kimberly D. Delehaunty, Catrina Fronick, Robert S. Fulton, Tina A. Graves,Colin Kremitzki, Dan Layman, Vincent Magrini, John D. McPherson, Tracie L.Miner, Patrick Minx, William E. Nash, Michael N. Nhan, Joanne O. Nelson, Lach-lan G. Oddy, Craig S. Pohl, Jennifer Randall-Maher, Scott M. Smith, John W.Wallis, Shiaw-Pyng Yang, Michael N. Romanov, Catherine M. Rondelli, Bob Pa-ton, Jacqueline Smith, David Morrice, Laura Daniels, Helen G. Tempest, Lind-say Robertson, Julio S. Masabanda, Darren K. Griffin, Alain Vignal, Valerie Fil-lon, Lina Jacobbson, Susanne Kerje, Leif Andersson, Richard P. M. Crooijmans,Jan Aerts, Jan J. van der Poel, Hans Ellegren, Randolph B. Caldwell, Simon J.Hubbard, Darren V. Grafham, Andrzej M. Kierzek, Stuart R. McLaren, Ian M.Overton, Hiroshi Arakawa, Kevin J. Beattie, Yuri Bezzubov, Paul E. Boardman,James K. Bonfield, Michael D. R. Croning, Robert M. Davies, Matthew D. Fran-cis, Sean J. Humphray, Carol E. Scott, Ruth G. Taylor, Cheryll Tickle, WilliamR. A. Brown, Jane Rogers, Jean-Marie Buerstedde, Stuart A. Wilson, Lisa Stubbs,Ivan Ovcharenko, Laurie Gordon, Susan Lucas, Marcia M. Miller, Hidetoshi Inoko,Takashi Shiina, Jim Kaufman, Jan Salomonsen, Karsten Skjoedt, Gane Ka-ShuWong, Jun Wang, Bin Liu, Jian Wang, Jun Yu, Huanming Yang, Mikhail Nefedov,Maxim Koriabine, Pieter J. deJong, Leo Goodstadt, Caleb Webber, Nicholas J.Dickens, Ivica Letunic, Mikita Suyama, David Torrents, Christian von Mering,Evgeny M. Zdobnov, Kateryna Makova, Anton Nekrutenko, Laura Elnitski, PallaviEswara, David C. King, Shan Yang, Svitlana Tyekucheva, Anusha Radakrishnan,Robert S. Harris, Francesca Chiaromonte, James Taylor, Jianbin He, MoniqueRijnkels, Sam Griffiths-Jones, Abel Ureta-Vidal, Michael M. Hoffman, JessicaSeverin, Stephen M. J. Searle, Andy S. Law, David Speed, Dave Waddington,Ze Cheng, Eray Tuzun, Evan Eichler, Zhirong Bao, Paul Flicek, David D. Shteyn-berg, Michael R. Brent, Jacqueline M. Bye, Elizabeth J. Huckle, Sourav Chatterji,

57

Page 73: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Colin Dewey, Lior Pachter, Andrei Kouranov, Zissimos Mourelatos, Artemis G.Hatzigeorgiou, Andrew H. Paterson, Robert Ivarie, Mikael Brandstrom, Erik Ax-elsson, Niclas Backstrom, Sofia Berlin, Matthew T. Webster, Olivier Pourquie,Alexandre Reymond, Catherine Ucla, Stylianos E. Antonarakis, Manyuan Long,J. J. Emerson, Esther Betrán, Isabelle Dupanloup, Henrik Kaessmann, Angie S.Hinrichs, Gill Bejerano, Terrence S. Furey, Rachel A. Harte, Brian Raney, AdamSiepel, W. James Kent, David Haussler, Eduardo Eyras, Robert Castelo, Josep F.Abril, Sergi Castellano, Francisco Camara, Genis Parra, Roderic Guigo, Guil-laume Bourque, Glenn Tesler, Pavel A. Pevzner, Arian Smit, Lucinda A. Fulton,Elaine R. Mardis, and Richard K. Wilson. Sequence and comparative analysis ofthe chicken genome provide unique perspectives on vertebrate evolution. Nature,432(7018):695–716, 12 2004.

[39] Martien A. M. Groenen, Alan L. Archibald, Hirohide Uenishi, Christopher K. Tug-gle, Yasuhiro Takeuchi, Max F. Rothschild, Claire Rogel-Gaillard, Chankyu Park,Denis Milan, Hendrik-Jan Megens, Shengting Li, Denis M. Larkin, Heebal Kim,Laurent A. F. Frantz, Mario Caccamo, Hyeonju Ahn, Bronwen L. Aken, AnnaAnselmo, Christian Anthon, Loretta Auvil, Bouabid Badaoui, Craig W. Beattie,Christian Bendixen, Daniel Berman, Frank Blecha, Jonas Blomberg, Lars Bolund,Mirte Bosse, Sara Botti, Zhan Bujie, Megan Bystrom, Boris Capitanu, DeniseCarvalho-Silva, Patrick Chardon, Celine Chen, Ryan Cheng, Sang-Haeng Choi,William Chow, Richard C. Clark, Christopher Clee, Richard P. M. A. Crooijmans,Harry D. Dawson, Patrice Dehais, Fioravante De Sapio, Bert Dibbits, Nizar Drou,Zhi-Qiang Du, Kellye Eversole, Joao Fadista, Susan Fairley, Thomas Faraut, Ge-offrey J. Faulkner, Katie E. Fowler, Merete Fredholm, Eric Fritz, James G. R.Gilbert, Elisabetta Giuffra, Jan Gorodkin, Darren K. Griffin, Jennifer L. Harrow,Alexander Hayward, Kerstin Howe, Zhi-Liang Hu, Sean J. Humphray, Toby Hunt,Henrik Hornshoj, Jin-Tae Jeon, Patric Jern, Matthew Jones, Jerzy Jurka, HiroyukiKanamori, Ronan Kapetanovic, Jaebum Kim, Jae-Hwan Kim, Kyu-Won Kim,Tae-Hun Kim, Greger Larson, Kyooyeol Lee, Kyung-Tai Lee, Richard Leggett,Harris A. Lewin, Yingrui Li, Wansheng Liu, Jane E. Loveland, Yao Lu, Joan K.Lunney, Jian Ma, Ole Madsen, Katherine Mann, Lucy Matthews, Stuart McLaren,Takeya Morozumi, Michael P. Murtaugh, Jitendra Narayan, Dinh Truong Nguyen,Peixiang Ni, Song-Jung Oh, Suneel Onteru, Frank Panitz, Eung-Woo Park,Hong-Seog Park, Geraldine Pascal, Yogesh Paudel, Miguel Perez-Enciso, RicardoRamirez-Gonzalez, James M. Reecy, Sandra Rodriguez-Zas, Gary A. Rohrer, Lau-retta Rund, Yongming Sang, Kyle Schachtschneider, Joshua G. Schraiber, JohnSchwartz, Linda Scobie, Carol Scott, Stephen Searle, Bertrand Servin, Bruce R.Southey, Goran Sperber, Peter Stadler, Jonathan V. Sweedler, Hakim Tafer,Bo Thomsen, Rashmi Wali, Jian Wang, Jun Wang, Simon White, Xun Xu, Mar-tine Yerle, Guojie Zhang, Jianguo Zhang, Jie Zhang, Shuhong Zhao, Jane Rogers,

58

Page 74: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Carol Churcher, and Lawrence B. Schook. Analyses of pig genomes provide insightinto porcine demography and evolution. Nature, 491(7424):393–398, 11 2012.

[40] The Bovine Genome Sequencing, Analysis Consortium, Christine G. Elsik, Ross L.Tellam, and Kim C. Worley. The genome sequence of taurine cattle: A windowto ruminant biology and evolution. Science, 324(5926):522–528, 2009.

[41] Martien AM Groenen, Hendrik-Jan Megens, Yalda Zare, Wesley C. Warren,LaDeana W. Hillier, Richard PMA Crooijmans, Addie Vereijken, Ron Okimoto,William M. Muir, and Hans H. Cheng. The development and characterization ofa 60K SNP chip for chicken. BMC Genomics, 12(1):274, May 2011.

[42] Antonio M. Ramos, Richard P. M. A. Crooijmans, Nabeel A. Affara, Andreia J.Amaral, Alan L. Archibald, Jonathan E. Beever, Christian Bendixen, CarolChurcher, Richard Clark, Patrick Dehais, Mark S. Hansen, Jakob Hedegaard,Zhi-Liang Hu, Hindrik H. Kerstens, Andy S. Law, Hendrik-Jan Megens, Denis Mi-lan, Danny J. Nonneman, Gary A. Rohrer, Max F. Rothschild, Tim P. L. Smith,Robert D. Schnabel, Curt P. Van Tassell, Jeremy F. Taylor, Ralph T. Wiedmann,Lawrence B. Schook, and Martien A. M. Groenen. Design of a high densitysnp genotyping assay in the pig using snps identified and characterized by nextgeneration sequencing technology. PLOS ONE, 4(8):1–13, 08 2009.

[43] Didier Boichard, Hoyoung Chung, Romain Dassonneville, Xavier David, AndréEggen, Sébastien Fritz, Kimberly J. Gietzen, Ben J. Hayes, Cynthia T. Lawley,Tad S. Sonstegard, Curtis P. Van Tassell, Paul M. VanRaden, Karine A. Viaud-Martinez, George R. Wiggans, and for the Bovine LD Consortium. Design of abovine low-density snp array optimized for imputation. PLOS ONE, 7(3):1–10,03 2012.

[44] Gene Ontology Consortium. The Gene Ontology (GO) database and informaticsresource. Nucleic Acids Research, 32(supplement 1):D258–D261, 2004.

[45] NCI genetic simulation software packages comparison webpage. https://popmodels.cancercontrol.cancer.gov/gsr/search/.

[46] Mehdi Sargolzaei and Flavio S Schenkel. QMSim: a large-scale genome simulatorfor livestock. Bioinformatics, 25(5):680–681, 2009.

[47] QMSim package website. http://www.aps.uoguelph.ca/~msargol/qmsim/.

[48] Wesley C. Warren, LaDeana W. Hillier, Chad Tomlinson, Patrick Minx, MilinnKremitzki, Tina Graves, Chris Markovic, Nathan Bouk, Kim D. Pruitt, FrancoiseThibaud-Nissen, Valerie Schneider, Tamer A. Mansour, C. Titus Brown, AlekseyZimin, Rachel Hawken, Mitch Abrahamsen, Alexis B. Pyrkosz, Mireille Moris-son, Valerie Fillon, Alain Vignal, William Chow, Kerstin Howe, Janet E. Fulton,

59

Page 75: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Marcia M. Miller, Peter Lovell, Claudio V. Mello, Morgan Wirthlin, Andrew S.Mason, Richard Kuo, David W. Burt, Jerry B. Dodgson, and Hans H. Cheng. Anew chicken genome assembly provides insight into avian genome structure. G3:Genes, Genomes, Genetics, 7(1):109–117, 2017.

[49] NCBI genome overview for Gallus gallus. https://www.ncbi.nlm.nih.gov/genome/?term=gallus%20gallus.

[50] Damian Smedley, Syed Haider, Steffen Durinck, Luca Pandini, Paolo Provero,James Allen, Olivier Arnaiz, Mohammad Hamza Awedh, Richard Baldock, Giu-lia Barbiera, Philippe Bardou, Tim Beck, Andrew Blake, Merideth Bonierbale,Anthony J. Brookes, Gabriele Bucci, Iwan Buetti, Sarah Burge, Cédric Cabau,Joseph W. Carlson, Claude Chelala, Charalambos Chrysostomou, Davide Cittaro,Olivier Collin, Raul Cordova, Rosalind J. Cutts, Erik Dassi, Alex Di Genova, AnisDjari, Anthony Esposito, Heather Estrella, Eduardo Eyras, Julio Fernandez-Banet,Simon Forbes, Robert C. Free, Takatomo Fujisawa, Emanuela Gadaleta, Jose M.Garcia-Manteiga, David Goodstein, Kristian Gray, José Afonso Guerra-Assunção,Bernard Haggarty, Dong-Jin Han, Byung Woo Han, Todd Harris, Jayson Harsh-barger, Robert K. Hastings, Richard D. Hayes, Claire Hoede, Shen Hu, Zhi-LiangHu, Lucie Hutchins, Zhengyan Kan, Hideya Kawaji, Aminah Keliet, Arnaud Ker-hornou, Sunghoon Kim, Rhoda Kinsella, Christophe Klopp, Lei Kong, DanielLawson, Dejan Lazarevic, Ji-Hyun Lee, Thomas Letellier, Chuan-Yun Li, PietroLio, Chu-Jun Liu, Jie Luo, Alejandro Maass, Jerome Mariette, Thomas Maurel,Stefania Merella, Azza Mostafa Mohamed, Francois Moreews, Ibounyamine Nabi-houdine, Nelson Ndegwa, Céline Noirot, Cristian Perez-Llamas, Michael Primig,Alessandro Quattrone, Hadi Quesneville, Davide Rambaldi, James Reecy, MichelaRiba, Steven Rosanoff, Amna Ali Saddiq, Elisa Salas, Olivier Sallou, RebeccaShepherd, Reinhard Simon, Linda Sperling, William Spooner, Daniel M. Staines,Delphine Steinbach, Kevin Stone, Elia Stupka, Jon W. Teague, Abu Z. Dayem Ul-lah, Jun Wang, Doreen Ware, Marie Wong-Erasmus, Ken Youens-Clark, AmonidaZadissa, Shi-Jian Zhang, and Arek Kasprzyk. The BioMart community portal:an innovative alternative to large, centralized data repositories. Nucleic AcidsResearch, 43(W1):W589, 2015.

[51] Rhoda J. Kinsella, Andreas Kähäri, Syed Haider, Jorge Zamora, Glenn Proctor,Giulietta Spudich, Jeff Almeida-King, Daniel Staines, Paul Derwent, Arnaud Ker-hornou, Paul Kersey, and Paul Flicek. Ensembl BioMarts: a hub for data retrievalacross taxonomic space. Database, 2011:bar030, 2011.

[52] Paul Julian Kersey, James E. Allen, Irina Armean, Sanjay Boddu, Bruce J. Bolt,Denise Carvalho-Silva, Mikkel Christensen, Paul Davis, Lee J. Falin, ChristophGrabmueller, Jay Humphrey, Arnaud Kerhornou, Julia Khobova, Naveen K.Aranganathan, Nicholas Langridge, Ernesto Lowy, Mark D. McDowall, Uma Ma-

60

Page 76: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

heswari, Michael Nuhn, Chuang Kee Ong, Bert Overduin, Michael Paulini, HelderPedro, Emily Perry, Giulietta Spudich, Electra Tapanari, Brandon Walts, GarethWilliams, Marcela Tello-Ruiz, Joshua Stein, Sharon Wei, Doreen Ware, Daniel M.Bolser, Kevin L. Howe, Eugene Kulesha, Daniel Lawson, Gareth Maslen, andDaniel M. Staines. Ensembl Genomes 2016: more genomes, more complexity.Nucleic Acids Research, 44(D1):D574, 2016.

[53] Ensembl release 89 archive website. http://may2017.archive.ensembl.org/index.html.

[54] Neo4J Channel on Slack. http://neo4j-users.slack.com.

[55] Sign-up page for Neo4J Slack channel. https://neo4j.com/developer/slack/.

[56] Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel A.R.Ferreira, David Bender, Julian Maller, Pamela Sklar, Paul I.W. de Bakker, Mark J.Daly, and Pak C. Sham. PLINK: A tool set for whole-genome association andpopulation-based linkage analyses. The American Journal of Human Genetics,81(3):559 – 575, 2007.

61

Page 77: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Appendix A

Database Software

A.1 PostgreSQL

postgresql-9.6.3.tar.gz (183622c0f7e283ef47ca6831a2740ea8)

A.2 Neo4J

neo4j-community-3.2.1-unix.tar.gz (d7f9792ae0b0d12f3886d013353bab0c)

neo4j-community-3.2.3-unix.tar.gz (95db078cd70e2f5978590b6d331161d4)

62

Page 78: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Appendix B

Data Simulation

B.1 QMSim SoftwareQMSim_Linux.zip (a28c25f4183dddfc06a9b4ef56d24737)

QMSim_Mac.zip (277b7c6cbf9f23139715fbb43bebb83f)

B.2 Minimal Exemplar Data Set

B.2.1 Input Filessmall-test-population.prm (0886f28681ad195d7c52a79323839ab0)

small-test-population-seed-file (ed740803e92cb14496def8d0669aa87f)

B.2.2 Output Fileslm_mrk_001.txt (fc4263c0f6816b72f9d3c67156d7cb4d)

lm_qtl_001.txt (3099243be28789e220e675517bf27cbc)

p1_data_001.txt (a7fa001d671cc2d15cf633db6574cec4)

p1_mrk_001.txt (5057a38b5d293ee00c0630f08f664ce2)

p1_qtl_001.txt (2d375b4ea737f32849f62855baaf2443)

p1_stat_001.txt (b5030dbcd599ffa4b5b15b84b7c644c7)

63

Page 79: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

report.txt (5d41a897f78f1c6a19a77a7a4c83845b)1

B.3 Representative Medium Data Set

B.3.1 Pre-processing Fileschicken-chromosome-sizes.txt (ec2e82402278731d90d1ca570a08701e)

build_params.py (c388c138dfa323b8c527f9563c5cdebb)

B.3.2 Input Filesmain-50k.prm (981c7dd95e410bf8b14c7fdf71c40549)

main-50k-seed-file (9ba9d7419baab015afa1298ccfbb55de)

B.3.3 Output Fileslm_mrk_001.txt (924e3d9aa5aaef9fb29db5c8edd282be)

lm_qtl_001.txt (82e0133318d984fb7fa967b3500806e2)

p2_data_001.txt (b2302097e4c884c4cb33b9f8a8ee3531)

p2_mrk_001.txt (61eba14fd1aad75125165881c6f8c5b0)

p2_qtl_001.txt (bc89d36fe322103e75db18d1c0b87b1e)

p2_stat_001.txt (64e357c204094940611e7de12343a4fa)

report.txt (14eab4f1ec3802494ad3539ad5afb100)2

B.3.4 Processed Output Files

medium-markers.txt (066f5d6c5b7ef24f9d8780d5b3cd462d)

medium-pedigree.txt (235e218b00dfa55cfbc6a69e7abd59e7)1 The report.txt file contains a timestamp on the first line and a record of total execution time

and thus is not directly reproducible. The checksum reported here is that of the file obtained afterour simulation runs. Equivalent files should be comparable one with another after removal of therelevant lines of the file using a standard linux utility such as awk .

2 See footnote 1

64

Page 80: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

B.4 Representative Large Data Set

B.4.1 Input Filesmain-5-million.prm (181eac5a6e599895ed8653d2aaaf1caa)

main-5-million-seed-file (8dc22ec8b9743fde66df0cafbaeabd0c)

B.4.2 Output Fileslm_mrk_001.txt (113c9ba9b54d004fd961ac339858e4ef)

lm_qtl_001.txt (d2af6f9edce1d407c51daa30973c121e)

p2_data_001.txt (0543fa4b48aede8a5fc817436132a68e)

p2_mrk_001.txt (c95e839f3ffdc6dc98ff28cf15fc1271)

p2_qtl_001.txt (798902c05c7e248e3fafb632b7b48fb0)

p2_stat_001.txt (9484abf0a23c7477068566b70a18e61a)

report.txt (82c6f50082246ff4dc077fb198026250)3

B.4.3 Processed Output Files

large-markers.txt (3933e9a4b387e4e694eb84d7293051b2)

large-pedigree.txt (73d9fd47caac5b29d0ef29885b5439f2)

3 See footnote 1

65

Page 81: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Appendix C

Pre-Processing Prior to Loading

split-marker-genotype.pl (1cf28abe73579e1b3e97dd674d7b35e5)

66

Page 82: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Appendix D

Gene Build Data Files

The following comma-separated text files were downloaded from the Ensembl Biomartsystem as described in Section 5.5. Listed are the file names, checksums and the namesof the columns as defined in the first line of the file contents.

chromosomes.txt (4f2dc853c870f3bc9102284099ff11d0)Contains: Chromosome/scaffold name

exons-to-chromosomes.txt (9cf9347068ac73f294726aaeda12e27f)Contains: Exon stable ID, Chromosome/scaffold name, Strand, Exon regionstart (bp), Exon region end (bp)

exons-to-transcripts.txt (df574da8295848912af73637950194fa)Contains: Exon stable ID, Transcript stable ID, Exon rank in transcript

exons.txt (4703ffc0a7575f73f8567708cfbc8302)Contains: Exon stable ID

genes-to-chromosomes.txt (044fa1940181b23ebf864bacda85f465)Contains: Chromosome/scaffold name, Gene stable ID, Gene start (bp), Geneend (bp), Strand

genes-to-goterms.txt (fd69542c4db1e83b601c02c0aa0680a0)Contains: Gene stable ID, GO term accession, GO term evidence code

genes.txt (ef960b49ffe28caac27f7343fc7db427)Contains: Gene stable ID, Gene name, Gene description, Gene type

67

Page 83: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

transcript-3prime-ends.txt (846f987a0d05784001012026d7926ca8)Contains: Transcript stable ID, Chromosome/scaffold name, Strand, 3’-UTRstart, 3’-UTR end

transcript-5prime-ends.txt (c543c0365eebe27d60c61c848a5cb3e1)Contains: Transcript stable ID, Chromosome/scaffold name, Strand, 5’-UTRstart, 5’-UTR end

transcript-ends.txt (3ede2f0ad6c4da77ca391c94629fd666)Contains: Transcript stable ID, Chromosome/scaffold name, Strand, 5’-UTRstart, 5’-UTR end, 3’-UTR start, 3’-UTR end

transcripts-to-chromosomes.txt (78afe4c540a3d839aed985b77aca67ba)Contains: Transcript stable ID, Chromosome/scaffold name, Transcript start(bp), Transcript end (bp), Strand

transcripts-to-genes.txt (c18eaba044522ee528b2504a942866ec)Contains: Transcript stable ID, Gene stable ID

transcripts.txt (a89a3fdc7fabe102ab74f3caea18b13b)Contains: Transcript stable ID, Transcript name

68

Page 84: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Appendix E

Gene Ontology Data Files andScripts

go-basic.obo.txt (19bb3fbaf68a59dca17a121eccf71610)File containing the Gene Ontology definitions and descriptions, downloaded fromthe Gene Ontology website on 21st June 2017.

split-go-obo.pl (4e31fb78a9aa60a4633de8919128e57b)A perl script written to process the above Gene Ontology OBO-format file priorto data loading into the database systems.

go-terms.txt ( d2b43977d9d34098739eeee9af5047c1)Contains: The core GO terms, their unique GO ID, a description, the controlledGO name and the type of GO term (‘biological_process’, ‘molecular_function’or ‘cellular_component’

go-aliases.txt ( 0f66948bd112494e706fd20048c4c2b0)Contains: Alias relationships between terms in the form ‘A’ is an alias of ‘B’

go-is_a.txt ( d4a3e91ea65b368a93c182cf17795def)Contains: “Is A” relationships between terms in the form ‘A’ is an alias of ‘B’

go-part_of.txt ( a85160bc3551d1df33d1311fc57dbe2a)Contains: “Part Of” relationships between terms in the form ‘A’ is part of ‘B’

go-regulates.txt ( a2d897f2fcbf0d8b0918b7c2ad1d3f17)Contains: “Regulates” relationships between terms in the form ‘A’ regulates‘B’, followed by an indication of any directionality of the regulation (positive,

69

Page 85: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

negative, unspecified)

70

Page 86: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Appendix F

PostgreSQL Data Loading ScriptDetails

postgres-load-gene-build.sql (2dd86f4efcee261e9de05d7a150f6f7c)

postgres-load-go-terms.sql (cb5b0c3b8a23016171376ae5e1b574cc)

postgres-map-gene-to-go-terms.sql (b6f404da84d01ef365349d9de13b3a83)

postgres-load-marker.sql (f7473a7c32bd40ac705146b74c047fe6)

postgres-load-pedigree.sql (f54a12c36db07e098c7e77c2ef30aba8)

postgres-map-marker-to-gene-build.sql (9356a5a3ecaf3b5a76f6dac12f501baa)

postgres-load-genotypes.sql (831ee3bc9b246fabf5a7829fcf92d4f4)

71

Page 87: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Appendix G

Neo4J Data Loading Script Details

cypher-load-gene-build.cql (00039df446c8fc2d89ec4cea3fb52e76)

cypher-load-go-terms.cql (9daa65901ea55c07e645d74bcdd180fe)

cypher-map-gene-to-go-terms.cql (158f1d7f03517ed661628c73d864463a)

cypher-load-markers.cql (260e845f650b483b29cc095db3045d70)

cypher-map-marker-to-gene-build.cql (eae765ed1c5d4107f223af371f67a956)

cypher-load-pedigree.cql (0e09cfb503be47b14d48b442ca0075ab)

cypher-load-genotypes.cql (70ec21a2d9bbf0d9f43416f8e54b5576)

72

Page 88: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

Appendix H

Neo4J Unmanaged Extension Code

Listing H.1: Java code to read genotypes from a file and import them into Neo4J1 @POST2 @Path ( "/ genotype s " )3 p u b l i c Response impor tGenotypes ( S t r i n g body , @Context GraphDatabaseSe rv i c e db )4 throws Exce p t i on {5 i n t count = 0 ;6 i n t l ineNumber = 0 ;78 // B u i l d an ArgumentParser9 ArgumentParser a r g s ;

10 HashMap<S t r i n g , Object> i n p u t ;11 t r y {12 i n p u t = objectMapper . r eadVa lue ( body , HashMap . c l a s s ) ;13 a r g s = new ArgumentParser ( i n p u t ) ;14 } ca tch ( IOExcept i on e ) {15 // throw E x c e p t i o n s . i n v a l i d I n p u t ;16 throw new E x c e p t i o n s (400 , e . g e t L o c a l i z e d M e s s a g e ( ) ) ;17 }1819 long [ ] markers = n u l l ;2021 i n t s t a r t I n d i v i d u a l = a r g s . g e t S t a r t I n d i v i d u a l L i n e ( ) ;22 i n t e n d I n d i v i d u a l = a r g s . g e t E n d I n d i v i d u a l L i n e ( ) ;2324 i n t s t a r t M a r k e r = a r g s . g e t S t a r t M a r k e r I n d e x ( ) ;25 i n t endMarker = a r g s . getEndMarker Index ( ) ;2627 // Make us a G e n o t y p e B u i l d e r o b j e c t28 G e n o t y p e B u i l d e r g t B u i l d e r = new Wi thPrope r t yGeno typeBu i l d e r ( ) ;29 GenotypeCodings v a l = a r g s . getGenotypeCodedAs ( ) ;30 s w i t c h ( v a l ) {31 // − d e f a u l t so we don ’ t need to s p e c i f y32 // ca se PROPERTIES :33 // g t B u i l d e r = new Wi thPrope r t yGeno typeBu i l d e r ( ) ;34 // break ;35 ca se RELATIONSHIP :36 g t B u i l d e r = new NoPrope r t yGenotypeBu i l de r ( ) ;37 break ;38 ca se NONE:39 g t B u i l d e r = new NoGenotypeBu i lde r ( ) ;

73

Page 89: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

40 break ;41 }4243 I n s t a n t s t a r t L o a d ;44 I n s t a n t s t a r t = I n s t a n t . now ( ) ;45 t r y ( Bu f f e r e dR e ad e r br = new B u f f e r e dR ea d e r ( new F i l e R e a d e r ( a r g s .

g e tF i l ename ( ) ) ) ) {46 // Sk ip l i n e s u n t i l we f i n d the f i r s t I n d i v i d u a l t h a t we want to work

wi th .47 // We can do u s i n g a l e s s −than compar i son because the f i r s t l i n e i s48 // heade r l i n e49 w h i l e ( l ineNumber < s t a r t I n d i v i d u a l ) {50 br . r e a d L i n e ( ) ;51 l ineNumber++;52 }5354 s t a r t L o a d = I n s t a n t . now ( ) ;5556 T r a n s a c t i o n tx = db . beginTx ( ) ;57 t r y {58 f o r ( S t r i n g l i n e ; ( l i n e = br . r e a d L i n e ( ) ) != n u l l ; ) {59 // I f we ’ ve done a l l the i n d i v i d u a l s t h a t we need then60 // break out from the f i l e −r e a d i n g l oop61 i f ( l ineNumber > e n d I n d i v i d u a l ) {62 break ;63 }64 l ineNumber++;6566 S t r i n g [ ] s p l i t = l i n e . s p l i t ( " \\ s+" ) ;6768 // Going to pre−c r e a t e a l l the Markers and l o a d them i n t o an

a r r a y f o r un−i n d e x e d a c c e s s l a t e r .69 i f ( markers == n u l l ) {70 endMarker = Math . min ( endMarker , s p l i t . l e n g t h / 2) ;7172 markers = new long [ s p l i t . l e n g t h / 2 ] ;73 f o r ( i n t i = s t a r t M a r k e r ; i <= endMarker ; i ++) {74 S t r i n g markerName = "M" + ( i + 1) ;75 Node marker = db . f indNode ( L a b e l s . Marker , NAME_PROPERTY

, markerName ) ;76 i f ( marker == n u l l ) {77 marker = db . c reateNode ( L a b e l s . Marker ) ;78 marker . s e t P r o p e r t y (NAME_PROPERTY, markerName ) ;79 }80 markers [ i ] = marker . g e t I d ( ) ;8182 i f ( i % 1000 == 0) {83 tx . s u c c e s s ( ) ;84 tx . c l o s e ( ) ;85 tx = db . beginTx ( ) ;86 }87 }8889 }9091 Node i n d i v i d u a l = db . f indNode ( L a b e l s . I n d i v i d u a l , ID_PROPERTY,

s p l i t [ 0 ] ) ;92 i f ( i n d i v i d u a l == n u l l ) {93 i n d i v i d u a l = db . c reateNode ( L a b e l s . I n d i v i d u a l ) ;94 i n d i v i d u a l . s e t P r o p e r t y (ID_PROPERTY, s p l i t [ 0 ] ) ;95 }96 //

74

Page 90: Genotype analysis and graph databases · Genotype analysis and graph databases AndrewStephenLaw ... Chapter2 ... E7-8870v3CPUsrunningat2.10GHz ...

97 f o r ( i n t i = s t a r t M a r k e r ; i <= endMarker ; i ++) {98 Node marker = db . getNodeById ( markers [ i ] ) ;99 // i t ’ s + 1 because f i r s t i tem i s i n d i v i d u a l i d

100 S t r i n g a1 = s p l i t [ ( i ∗ 2) + 1 ] ;101 S t r i n g a2 = s p l i t [ ( i ∗ 2) + 2 ] ;102 g t B u i l d e r . Bu i ldGenotype ( i n d i v i d u a l , marker , a1 , a2 ) ;103 count++;104 i f ( count % 1000 == 0) {105 tx . s u c c e s s ( ) ;106 tx . c l o s e ( ) ;107 tx = db . beginTx ( ) ;108 }109 }110 }111112 tx . s u c c e s s ( ) ;113 } f i n a l l y {114 tx . c l o s e ( ) ;115 }116117 }

75


Recommended