U N I V E R S I T Y O F C O P E N H A G E N
F A C U L T Y O R D E P A R T M E N T
PhD Thesis
Ye Yin
Evolution and Adaptation of Baboon and Mandrill Revealed by
Genome Sequencing
Academic advisor: Karsten Kristiansen, University of Copenhagen,
Denmark
This thesis has been submitted to the PhD School of The Faculty of Science, University of
Copenhagen
Submitted: March 2018
2
Dissertation for the degree of philosophiae doctor (PhD)
Department of Biology, University of Copenhagen
Copenhagen, Denmark
and
BGI-Research, BGI-Shenzhen
Shenzhen, China
March 2018
Author: Ye Yin
Title: Evolution and Adaptation of Baboon and Mandrill Revealed by Genome Sequencing
Academic advisors: Karsten Kristiansen, Department of Biology, University of Copenhagen,
Denmark
Submitted March 2018
3
Preface
This PhD project started in 2015 as collaboration between the Department of Biology, University of
Copenhagen and BGI-Shenzhen. The work presented here has been performed at both institutions
by supervision of Professor Karsten Kristiansen.
4
Acknowledgements
I would like to thank my supervisor Professor Karsten Kristiansen for introducing me in the most
cutting-age research area of genomics and bioinformatics, giving me kind guidance on conducting
academic researches.
I would also like to thank Chenglin Zhang, Deputy Director of Beijing Zoo, and Professor Rasmus
from University of Copenhagen for kindly providing the baboon and mandrill sample used in this
study.
Additionally, I would like to thank those who participated in the crowdfunding for the baboon and
mandrill sequencing projects.
5
Abstract
Baboon (genus Papio) and mandrill (Mandrillus sphinx) are closely related with human beings in
phylogenetic relationships, which can serve as unique models for primate evolutionary studies as
well as human diseases researches. However, genetic researches and genomic resources of baboon
and mandrill are limited, especially comparing to chimpanzee and gorilla. Thus genome sequencing
of baboon and mandrill was carried out here for constructing reference genomes of these
remarkable Old World monkeys.
With the process of sampling, DNA extraction and sequencing, 414 Gb and 426 Gb raw sequencing
data of different libraries were generated for baboon and mandrill respectively using the second
generation sequencing platform. Then, genome assembly was carried out based on the sequencing
data of both species. The genome assembly of baboon was 3.11 Gb with contig N50 to be 21,659,
and scaffold N50 to be 1,070,645, and the genome assembly of mandrill was 2.88 Gb with contig
N50 to be 20,483, and scaffold N50 to be 3,564,730. With the assembled genomes, repeat contents
were first annotated to be 42.3% and 40.4% respectively for baboon and mandrill. After masking
the repeat content in the genomes, evidence based and ab-initial gene annotation were combined
together to predict 23,867 genes in baboon and 21,906 genes in mandrill. Searching 3,023 BUSCO
(Benchmarking Universal Single-Copy Orthologs) genes against the predicted genes, the
completeness of the genes were estimated to be 97% (baboon) and 98% (mandrill). This
comprehensive assembly and complete gene sets provides new biological insight into genetic
diversity, structural variation, behavioral characteristic. Comparative genomic analysis among
primates was conducted to reveal the synteny between primates and also the gene family evolution
of contraction and expansion especially in baboon and mandrill. There were 9,930, 11,418,14,318
gene pairs between baboon and mandrill, macaque and baboon, human and macaque. In baboon and
mandrill lineage, there were 545 expanded and 618 contracted gene families, expanded genes were
6
significantly enriched in biosynthetic process, structural constituent of ribosome, nucleosomal DNA
binding, G-protein coupled receptor activity, olfactory receptor activity, glucose catabolic process,
peptidyl-prolyl isomerization, as well as carbon fixation in photosynthetic organisms and electron
transport chain pathway. Molecular mechanisms of adaptation for baboon and mandrill including
immune, language competence and olfactory character were also investigated through comparative
genomics. Through this study, genomic resources were provided for primate species, and
comprehensive insights of adaptation and evolution were also provided for better understanding of
baboon and mandrill.
7
Table of Contents
Preface ...................................................................................................................................... 3
Acknowledgements .................................................................................................................... 4
Abstract ..................................................................................................................................... 5
Table of Contents ....................................................................................................................... 7
Abbreviations .......................................................................................................................... 10
List of Tables ............................................................................................................................ 12
List of Figures ........................................................................................................................... 14
1. Introduction ..................................................................................................................... 16
1.1 Baboon and its biology ............................................................................................................... 16
1.2 Mandrill and its biology .............................................................................................................. 20
1.3 Genomic studies on baboon and mandrill ................................................................................... 22
1.3.1 Genomics of primates .................................................................................................................... 22
1.3.2 Genomics of baboon ...................................................................................................................... 26
1.3.3 Genomics of mandrill ..................................................................................................................... 28
1.3.4 Comparative genomics in primates ............................................................................................... 29
1.4 Objectives .................................................................................................................................. 31
2. Materials and Methods ..................................................................................................... 32
2.1 Sampling and sample preparation ........................................................................................ 32
2.2 Genome sequencing ............................................................................................................. 33
2.2.1 Library construction and sequencing ............................................................................................. 33
2.2.2 Data filtering .................................................................................................................................. 33
2.2.3 Overlapping library data merging .................................................................................................. 34
2.2.4 K-mer analysis ................................................................................................................................ 34
8
2.3 Genome assembly and annotation ....................................................................................... 35
2.3.1 Genome assembly .......................................................................................................................... 35
2.3.2 Genome annotation ....................................................................................................................... 36
2.4 Evolutionary analysis ........................................................................................................... 40
2.4.1 Gene family cluster ........................................................................................................................ 40
2.4.2 Phylogenetic analysis ..................................................................................................................... 41
2.4.3 Positively gene selection analysis .................................................................................................. 41
2.5 Comparative genomics ......................................................................................................... 42
2.5.1 Synteny analysis of human, macaque, baboon and mandrill ........................................................ 42
2.5.2 Gene family contraction and expansion ........................................................................................ 42
2.5.3 Segmental duplications .................................................................................................................. 43
2.6 Investigating molecular mechanisms of adaptation/phenotype ............................................. 43
2.6.1 Immune character.......................................................................................................................... 43
2.6.2 Language competence ................................................................................................................... 44
2.6.3 Olfactory character ........................................................................................................................ 44
2.6.4 Predicting binding sites of transcription factors ............................................................................ 45
3. Results .............................................................................................................................. 46
3.1 Landscapes of baboon and mandrill genomes ............................................................................. 46
3.1.1 Sequencing data............................................................................................................................. 46
3.1.2 K-mer analysis ................................................................................................................................ 47
3.1.3 Genome assembly .......................................................................................................................... 48
3.1.4 Annotation results ......................................................................................................................... 49
3.2 Evolution of baboon and mandrill ............................................................................................... 53
3.2.1 Gene families ................................................................................................................................. 53
3.2.2 Phylogenetic analysis ..................................................................................................................... 55
3.3 Synteny among primates ............................................................................................................ 57
3.3.1 Synteny analysis of human, macaque, baboon and mandrill ........................................................ 57
3.3.2 Gene family contraction and expansion ........................................................................................ 58
9
3.3.2 Segmental duplications .................................................................................................................. 60
3.4 MHC comparison between human and baboon/mandrill ............................................................. 61
3.5 Language related genomic features ............................................................................................. 66
3.6 Olfactory receptor genes analysis ............................................................................................... 67
3.7 Positively selected genes ............................................................................................................ 69
4. Discussion......................................................................................................................... 73
5. Conclusions ...................................................................................................................... 77
6. Future perspectives .......................................................................................................... 79
7. References ........................................................................................................................ 80
8. Appendix .......................................................................................................................... 88
10
Abbreviations
4D Fourfold Degenerate
AIDS Acquired Immune Deficiency Syndrome
BP Biological Process
BUSCO Benchmarking Universal Single-Copy Orthologs
CC Cellular Component
CEGMA Core Eukaryotic Genes Mapping Approach
ChIP-seq Chromatin Immunoprecipitation Sequencing
CMV Cytomegalovirus
DBG De Bruijn Graph
EBV Epstein-Barr Virus
EC Enzyme Commission
EST Expressed Sequence Tag
GO Gene Ontology
HAV Hepatitis A Virus
HGNC Hugo Gene Nomenclature Committee
HIV Human Immunodeficiency Virus
HLA Human Leukocyte Antigen
IPEX Immunodysregulation Polyendocrinopathy Enteropathy X-Linked
KEGG Kyoto Encyclopedia Of Genes And Genomes
LINE Long Interspersed Nuclear Element
LTR Long Terminal Repeat
MF Molecular Function
MHC Major Histocompatibility Complex
MYA Million Years Ago
NGS Next Generation Sequencing
OR Olfactory Receptor
PCR Polymerase Chain Reaction
PGC Primordial Germ Cells
PPIA Peptidylprolyl Isomerase A
PSMC Pairwise Sequentially Markovian Coalescent
QTL Quantitative Trait Loci
ROS Reactive Oxygen Species
SD Segmental Duplication
SINE Short Interspersed Nuclear Element
SIV Simian Immunodeficiency Virus
11
SMRT Single-Molecule Realtime Sequencing
SNPRC Southwest National Primate Research Center
TE Transposable Element
WSSD Whole-Genome Sequence Detection
12
List of Tables
Table 1.1 Baboons as animal models for studies of human diseases and vaccines. .................... 19
Table 1.2 Summary of mandrill as models in human diseases and vaccines studies/tests. ......... 22
Table 1.3 Published primate genome sequences. ...................................................................... 25
Table 3.1 Statistics of baboon and mandrill raw sequencing data. ............................................. 46
Table 3.2 The information of 17-mer statistics. ......................................................................... 47
Table 3.3 Statistics of the genome assemblies........................................................................... 48
Table 3.4 Repeat contents of baboon, mandrill, human and mouse. ......................................... 50
Table 3.5 Summary of gene annotation in baboon genome. ...................................................... 51
Table 3.6 Summary of gene annotation in mandrill genome...................................................... 51
Table 3.7 Assessment of gene sets using BUSCO. ...................................................................... 52
Table 3.8 Function annotation of the final gene sets. ................................................................ 52
Table 3.9 Gene family clustering in the seven species. .............................................................. 53
Table 3.10 Olfactory receptor gene copy number in five species. .............................................. 68
Table 8.1 Statistics of baboon and mandrill clean/filtered sequencing data. .............................. 88
Table 8.2 Prediction of the repeats in baboon genome. ............................................................ 88
Table 8.3 General statistics of repeats in mandrill genome........................................................ 88
Table 8.4 Categories of TEs in baboon genome. ........................................................................ 89
Table 8.5 Categories of TEs in mandrill genome. ....................................................................... 89
13
Table 8.6 Non-coding RNA genes in baboon genome. ............................................................... 89
Table 8.7 Non-coding RNA genes in mandrill genome. .............................................................. 90
Table 8.8 Go enrichment of unique gene families in baboon. .................................................... 90
Table 8.9 Go enrichment of unique gene families in mandrill. ................................................... 91
Table 8.10 GO enrichment result of unique gene families for mandrill. ..................................... 92
Table 8.11 Repeat content of MHC class I region for mandrill and human. ................................ 92
Table 8.12 GO and KEGG enrichment of the positively selected genes (PSGs). ........................... 93
14
List of Figures
Figure....................................................................................................................................... 17
Figure 2.1 Photos of the samples selected for sequencing. ........................................................ 32
Figure 2.2. Overall process of genome annotation. ................................................................... 37
Figure 3.1 Orthologous gene clusters in the five related species. ............................................... 54
Figure 3.2 Comparison of orthologous genes among 12 primates and mouse. ........................... 55
Figure 3.3 Phylogenetic tree based on single copy gene families in the 13 species. .................... 56
Figure 3.5 Synteny relationship of human, macaque, baboon and mandrill. .............................. 58
Figure 3.6 Gene family contraction and expansion for 12 primates and mouse. ......................... 60
Figure 3.7 Segmental duplications in seven primate species. ..................................................... 61
Figure 3.8 Synteny between human and mandrill MHC regions. ................................................ 63
Figure 3.9 Alignment of HLA genes with amino acid sequence for human, baboon and mandrill.
................................................................................................................................................ 65
Figure 3.10 Structure of MICA and MICB gene for human and mandrill. .................................... 65
Figure 3.11 Amino acid sequence aligment of FOXP2 gene from human, chimpanzees, mouse,
baboon and mandrill. ............................................................................................................... 67
Figure 3.12 Expansion of the olfactory receptor gene family in baboon and mandrill. ............... 69
Figure 3.13 Interaction between innate immunity for positively selected genes in mandrill. ...... 70
Figure 8.1 The distribution of 17-mer frequency of baboon and mandrill. ............................... 102
15
Figure 8.2 Colinearity analysis of chr 3 for mandrill. ................................................................ 103
Figure 8.3 Sequencing depth and the location relationships of pair-end reads on MHC class I
region for mandrill. ................................................................................................................ 103
Figure 8.4 OR7E24 genes on chromosome 19 in mandrill. ....................................................... 105
16
1. Introduction
There are two suborders of primates, the Strepsirrhini and Haplorhini. Haplorhines are further split
into tarsiers and simians. Simians comprise two groups, one of them is the catarrhines, Catarrhines
are further split into the Old World monkeys (Cercopithecoidea) and the apes (Hominoidea).
Baboon (genus Papio) and mandrill (Mandrillus sphinx) belong to genus Papionini, and they are
primates in the Old World monkey family which are widely distributed in Africa. Comparing to
chimpanzees and gorillas which belong to Hominidae, baboon and mandrill are also closely related
to human beings [1]. Papio, Mandrillus, and Macaca were used to be clustered in a tribe
Cercocebini of the subfamily Cercopithecinae [2], but currently Papio was assigned to Lophocebus
while Mandrillus was assigned to Cercocebus according to postcranial skeleton and the dentition [3].
The Papionini tribe was diverged from Cercopithecini around 11.5 million years ago (Mya) and
comprises the subtribe Papionina, with the genera Papio, Mandrillus and the subtribe Macacina,
with the genus Macaca [4]. Complete mtDNA genome sequences also provided similar
phylogenetic relationships among of Macaca and the Mandrillus [5].
1.1 Baboon and its biology
Baboons are primates of the Old World monkeys belonging to Papio. They have close-set eyes,
powerful jaws, short tails, long muzzles, thick fur, and rough spots on their protruding buttocks.
Baboon species also show sexual dimorphism, usually in size, but sometimes also in color or canine
development [6]. Baboons can live up to more than 40 years, with the baboons in captivity were
known to 45 years while in the wild is about 30 years. They live in open savannah, woodland and
hills across Africa and they eat insects or fish occasionally. There are five species in Papio, P.
ursinus (chacma baboon), P. papio (Guinea baboon), P. hamadryas (hamadryas baboon), P. anubis
(olive baboon) and P. cynocephalus (yellow baboon), which are predominantly found in Southern
17
Africa, Western Africa, Southwestern Arabia, North-central Africa and eastern Africa, respectively
[7] (Figure 1.1).
Figure 1.1 Geographical distribution of baboons. Distribution based on the map in Kingdom [8]
(Modified from a figure in previous study [9]).
Baboons live in hierarchical troops with number of individuals ranging from 5 to more than 200,
considerably larger than most of chimpanzee groups. The size of the troops largely varies for
different baboon species and different time periods during a year. The structure of hamadryas
baboons is remarkably different from that of the other baboon species, which are collectively
termed as savanna baboons. For example, hamadryas baboons always have very large troops
composed of many small harems while other baboons often have a structure more promiscuous and
the hierarchy is determined by the matriline. In the hamadryas harems, the males jealously guard
their females and some of them also raid harems for females, which will cause fights by the males.
18
Visual threats such as quick flashing of eyelids and show off the teeth are usually used during the
fights, and in some species, infants are taken as hostages during fights.
Baboons can determine the dominance relations between individuals from vocal exchanges. In
savanna baboons, each male individual mate with any female and the order among the males
depends partially on their rankings in the structure. Individuals with higher rank have benefits in
health and reproductive. High-ranking males have higher level of testosterone and lower level of
glucocorticoid than other males, and the top-ranking males have higher levels of both testosterone
and glucocorticoid than the second-ranking males [10]. Females also prefer friendly males as mates.
Therefore, there is also possibilities that a female baboon cam mate with a female by exhibiting
friend behaviors such as groom the female or supply with food. The time for gestation of baboons is
six months, and usually a single infant. The mother will be the primary caretaker but other females
also share the duties of taking care of all the offspring. The young baboons will be weaned about
one year later and the male baboons have to leave their group before they reach sexual maturity,
about five or six years old. On the other hand, females stay in the same group.
Studies of human complex diseases are difficult because it is very challenge to control human
pedigree structure and environmental conditions. The limited access of tissues also greatly
hampered the related studies. To overcome these limitations, nonhuman primates are often used as
valuable sources. Sharing many genetic, biochemical, physiologic, and anatomic characteristics
with human beings [11], baboon are naturally infected with numerous human pathogens and
therefore have the potential to be used as animal models for physiology and pathophysiology
researches [12], including cardiovascular disease, obesity, hypertension, age-related skeletal disease,
epilepsy, infectious disease and intrauterine researches [13, 14] (Table 1.1). Transplantation and
19
drug therapy have also been conducted in baboon [15-17].
Table 1.1 Baboons as animal models for studies of human diseases and vaccines.
Experimental objective Reference
Viral diseases
Ebola Studies in pathogenesis [18]
Encephalo-myocarditis virus Vaccine [19]
HIV-1 HIV-1 vaccine candidates [20]
Hepatitis A virus (HAV), cytomegalovirus
(CMV), Epstein-Barr virus(EBV)
Infections transmissible between
baboons and human beings
[21]
Bacterial infections
Bacillus anthracis Infections in nonhuman primate
model
[22]
Francisella tularensis Outer membrane live, attenuated
LPS-17
[23]
Angioinvasive aspergillus Baboon-to-human liver
transplantation infection
[15]
Parasite infections
Leishmania major Infection model [24]
Schistosoma mansoni Live, irradiated cecariae vaccine [25, 26]
Zoonotic gastrointestinal Baboon as zoonotic reservoirs [27]
In the meantime, some characteristics of baboon are obviously different to human, including
language ability and sensory capabilities [28]. As comparative biology according to articulatory
anatomy, many researches [29] claim nonhuman primates are incapable of producing systems of
vowel-like sounds due to their high larynx position, but recent discoveries have begun to challenge
this view with three reasons. First, some animal species with no documented ability to produce
systems of vowel-like sounds [30]. Second, human infants, with their larynx still high, produce the
same range of vowel qualities as adults [31]. Third, modeling suggests that the production of
vocalic sounds depend on the position of the larynx, but rather on the control of tongue muscles and
lips [32].
20
1.2 Mandrill and its biology
Mandrill (here, specifically referred to Mandrillus sphinx) is a primate of the Old World monkey
(Cercopithecidae) family. Along with the drill, mandrill was once classified as baboons (Papio)
because they are superficially similar [33]. Mandrills live in tropical rainforests, rocky, riparian,
flooded or gallery forests, as well as cultivated areas and stream beds across Africa, and are usually
found in Southern Cameroon, Gabon, Equatorial Guinea and Congo [34]. The distribution is
generally separated by the Sanaga River and the Ogooué and White Rivers. There are remarkable
genetic differences between these two populations. As a result, these two populations have been
classified into different subspecies.
The diet is omnivorous ranging from fruits to insects. Its diet is generally composed of fruits
(50.7%), seeds (26.0%), leaves (8.2%), pith (6.8%), flowers (2.7%), and animal foods (4.1%), with
other foods making up the remaining (1.4%). Usually, they consume plants, as diverse as more than
a hundred species and fruits are preferred. Furthermore, they also eat mushrooms and soil. Besides
plants, they also eat animals, mostly invertebrates, such as insects like ants, beetles, termites,
crickets, and snails or scorpions. Its diet also contains eggs and small vertebrates like birds, frogs,
rats, and shrews or juvenile of larger vertebrates such bay duikers and antelope [35]. The life
expectancy of mandrill in captivity can be up to 31 years, shorter than that of baboons.
Mandrills also exhibit strong sexual dimorphism. It has experienced very long and strong sexual
selection, as a result, male mandrills have larger size and coloration. Generally, mandrill’s face is
hairless with an elongated muzzle. They also have distinct characters, such as protruding blue ridges
on the sides, red nostrils and lips. The areas around the genitals are multi-colored. Particularly,
dominant male mandrills have more pronounced coloration. Mandrill is the largest and heaviest
21
monkey in the world. Typically, male mandrills weigh 19–37 kg, with an average of 32.3 kg, while
the females weigh roughly half as much as the males, at 10–15 kg and an average of 12.4 kg. The
male mandrills are 75–95 cm long in average and the females are 55–66 cm. The shoulder height
ranges from 45–50 cm in females to 55–65 cm in males. These sizes and weights even surpass that
of the largest baboons. Furthermore, the mandrill is more ape-like compared to the baboons
regarding the body structure, with a muscular and compact build, shorter, thicker limbs that are
longer in the front and almost no tail.
Mandrills are mostly terrestrial but they are more arboreal compared to baboons [36]. Mandrills live
in large, stable groups with the size as big as hundreds of individuals [34]. The largest horde that
have been verifiably observed contains more than 1,300 mandrills, which is the largest nonhuman
aggregation ever documented. Mandrills are diurnal. They sleep on trees at night. They use tools
and have been observed using sticks in captivity.
Mandrills breed every two years and the mating season extends from June to October. Sometimes,
male mandrills fight for mating rights. The testicular volume increases along with the gaining of
dominance (alpha male) and decreases if the dominance is lost. Similar changes are also observed in
the color of sexual skin on the face and genitalia, which becomes red in alpha male mandrills.
Physiologically, the secretion of the sternal cutaneous gland also increases accordingly [37, 38]. A
dominance hierarchy among females also exists [38].
The way monkeys select their mates can be attributed to smell, rather than color which mainly
genes called the major histocompatibility complex (MHC) [39]. MHC is a cluster of genes which
determine mandrill’s individual scent and help build proteins involved in the body's immune system
22
and affects body odour by interacting with bacteria on the skin. By a series of experiments found
that particular odour-types were consistent with particular MHC gene patterns, suggesting that
mandrills use odour as an indication of genetic compatibility [40]. Mandrill is one of two species in
Papionini possess a sternal gland, gland is a triangular area in the middle of the chest and structure
basis for scent [41]. Mandrill is also widely used in immune systems research and is nature host for
SIV strains (Table 1.2).
Table 1.2 Summary of mandrill as models in human diseases and vaccines studies/tests.
Objectives Reference
Viral diseases
Simian Immunodeficiency virus (SIV) Studies in pathogenesis [42]
Simian T-lymphotropic Virus Type 1 (STLV-1) Transmission modes [43, 44]
Bacterial infections
Paratuberculosis Infections in mandrill [45]
Helicobacter heilmannii Bacterial pathogen model [46]
Parasite infections
Amebic, ciliate, nematodes Parasite prevalence in mandrill [47]
Loa loa Irradiated vaccine [48]
1.3 Genomic studies on baboon and mandrill
1.3.1 Genomics of primates
Since the complete of the human genome assembly [49], along with the reduced cost of genome
sequencing and greatly increased throughput of new sequencers, more and more genomic data of
primates are becoming available, including both Old world monkeys, such as chimpanzee [50], and
New world monkeys, such as Marmoset (Callithrix jaccbus). The genomics of non-human primates
received wide interests for two motivations: the application as models for analysis of human disease,
and genetic conservation and divergence on evolutionary history through comparative genomics
[51]. Generally, the species selected for genome sequencing meet the criteria of: 1) important
23
evolutionary position within the phylogeny (i.e. chimpanzee, gibbon and orangutan etc.); 2)
biomedical relevance to human. For example, macaque and baboon, although the genome
sequencing of the latter has not been completed yet, were selected because they are often used to
study the genetic basis of numerous human diseases [13], and squirrel monkey is used for studies of
neurobiology and infectious disease. The size of primate genomes varies little, ranging from 2.7 Gb
of Bonobo (Pan paniscus) [52] to 3.4 Gb of Tarsier (Tarsius syrichta) [53]. Repetitive regions
occupy about 50% of human, ape and monkey genomes but the amount of species-specific
insertions varies substantially, ranging from ~5,000 in human to ~2,300 in chimpanzee. Orang-utan
has only 250 [54]. Genomic studies had also been conducted to extinct hominis, the Neanderthals
(Green et al., 2010) and the Denisovans [55]. These genomic data resources together enabled people
to perform comparisons between human and other primates, or between primates and other
mammals.
The sequencing and assembly of non-human primate genomes went through different stages in pace
with the development of sequencing technologies. Sequencing of the genomes of chimpanzee (Pan
troglodytes), and the rhesus macaque (Macaca mulatta) were performed through the application of
shotgun sequencing used exclusively Sanger sequencing methods with considerable cost and efforts
[56, 57]. Then next generation sequencing (NGS) was widely used and gave more rapid progress on
genomics, while plenty of primates were sequenced and assembled (Table 1.3), which supplied us
with more understanding on genome content, evolution and diversity [51]. Since 2013, the
development and application of single-molecule, realtime (SMAT) sequencing technology has
shown considerable improvement on human or other genomes assemblies. Compared with the NGS
assembly version gorGor3, the results in Gorilla (Gorilla gorilla) with SMAT sequences show
significant decrease in assembly fragmentation, while the contig N50 increased >819 folds (from
11.8 kb to 9.6 Mb, Table 1), and 94% of gorGor3 gaps were closed [58, 59].
24
To understand the origin of the human genome is one of the most important purposes to sequence
primates closely related to human. Inter- and intra-species comparisons had provided insights of
gene exchange among the early human and chimpanzee ancestors, and allowed the identification of
positively selected genes or regions during the evolution of human or other primates. These genes
always indicate genetic or phenotypic changes that are critical for adaptation of human or non-
human primates, as well as hominis. It has been clearly shown that genes involved in the immune
system and resistance to pathogens, as well as those involved in reproductive biology were
commonly positively selected in many non-human primates [58, 60, 61]. This might be a result of
the long-term exposure to various pathogens in the wild, and genes related to gametogenesis are
beneficial to the competition within species. On the other hand, within species, the signals of
positive selection have been found in genes related to a wide range of phenotypes. Positively
selected genes shared among human, chimpanzee and gorilla are related to neuro and brain
development; genes related to glycolipid metabolism and hearing are positively selected in
orangutans [54]; In marmosets and other callitrichine primates, genes involved in phyletic reduction
of body size were positively selected [62]. The common ancestor of dated back to 12-5 million
years ago [63]. The reciprocal gene flow has lasted for ~3 million years between those lineages [63],
suggesting the divergence process is a long period with extensive gene flow, instead of a short event.
Similar evidence of gene exchanges was also detected in Bornean and Sumatran orang-utans
genomes [54, 63].
The long-read sequence indeed improved the completeness and accuracy of assembly, while the
scaffolds were still in the Mb level. Recently, a new technology named Hi-C help assemble contigs
into chromsome-scale scaffolds [64-66]. Hi-C and related technologies were developed to detect the
three-dimensional folding of chromosomes within the nucleus [67], then the information were used
25
to assist assembly. The results indicated that combination of shotgun fragments and mate-pair
sequences with Hi-C date could generate chromsome-scale assemblies with 98% accuracy in
assigning scaffolds to chromosome groups for human [64]. As far as we know, there is no Hi-C
assistant assebly result for non-human primates.
Table 1.3 Published primate genome sequences. (modified based on [51]).
Common
name
Species name Bases in
contigs
Contig
N50
Scaffold
N50
Reference
Chimpanzee Pan troglodytes 2.7 Gb 15.7 kb 8.6 Mb [56]
Chimpanzee
(updated)
P. troglodytes 2.9 Gb 50.7 kb 8.9 Mb [68]
Bonobo Pan paniscus 2.7 Gb 67 kb 9.6 Mb [69]
Gorilla Gorilla gorilla 2.7 Gb 11.8 kb 914 kb [58]
Gorilla
(updated)
Gorilla gorilla 2.8 Gb 9.6 Mb 23.1 Mb [59]
Orang-utan Pongo abelii 3.1 Gb 15.5 kb 739 kb [54]
Indian rhesus
macaque
Macaca mulatta 2.9 Gb 25.7 kb 24.3 Mb [57]
Indian rhesus
macaque
(updated)
M. mulatta 3.1 Gb 107.2 kb 4.2 Mb [70]
https://www.ncbi.
nlm.nih.gov/asse
mbly/GCA_0007
72875.3
Chinese
rhesus
macaque
M. mulatta 2.8 Gb 11.9 kb 891 kb [71]
Vietnamese
cynomolgus
macaque
M. fascicularis 2.9 Gb 12.5 kb 652 kb [71]
Aye-aye D.
madagascarensi
s
3.0 Gb NA 13.6 kb [72]
Vervet C. aethiops 2.8 Gb 90.4 kb 81.8 Mb [73]
Olive baboon P. anubis 2.9 Gb 149.8 kb 585.7 kb https://www.ncbi.
nlm.nih.gov/asse
26
mbly/GCF_00026
4685.3/
Gibbon Nomascus
leucogenys
2.8 Gb 35.1 kb 22.7 Mb [74]
Marmoset Callithrix
jacchus
2.3 Gb 29 kb 6.7 Mb [75]
Mouse lemur Microcebus
murinus
2.4 Gb 210.7 kb 108.2 Mb https://www.ncbi.
nlm.nih.gov/asse
mbly/GCF_00016
5445.2
Pig-tailed
macaque
Macaca
nemestrina
2.8 Gb 106.9 kb 15.2 Mb https://www.ncbi.
nlm.nih.gov/asse
mbly/GCF_00095
6065.1/#/st
Sifaka Propithecus
coquereli
2.1 Gb 28.1 kb 5.6 Mb https://www.ncbi.
nlm.nih.gov/asse
mbly/GCF_00095
6105.1/#/st
Sooty
mangabey
Cercocebus atys 2.8 Gb 112.9 kb 12.8 Mb https://www.ncbi.
nlm.nih.gov/asse
mbly/GCF_00095
5945.1/
Squirrel
monkey
Saimiri
boliviensis
2.5 Gb 38.8 kb 18.7 Mb https://www.ncbi.
nlm.nih.gov/asse
mbly/GCF_00023
5385.1/#/def
Bushbaby Otolemur
garnettii
2.4 Gb 27.1 kb 13.9 Mb https://www.ncbi.
nlm.nih.gov/asse
mbly/GCF_00018
1295.1/
Mouse lemur Microcebus
murinus
2.4 Gb 182.9 kb 3.7 Mb https://www.ncbi.
nlm.nih.gov/asse
mbly/GCF_00016
5445.1/
Tarsier Tarsius syrichta 3.4 Gb 38.2 kb 401 Mb [76]
1.3.2 Genomics of baboon
Baboons (Papio) shares a common ancestor with humans ~30 million years ago and are genetically
closer to human comparing to New World monkeys but are less closely related than the African
apes. Although the genome assembly of baboon is lacking before our study, a comparison between
27
a short region (~1.5 Mb) of baboon and human genome showed very limited substitutions, most of
which are relatively enriched in exons [77].
Baboon has been commonly used as an ideal primate model for genetic studies of complex traits
and human diseases because of high similarities between human and baboon in transcriptome,
physiology and genetics [78]. Southwest National Primate Research Center (SNPRC) at the Texas
Biomedical Research Institute maintains ~2,000 baboons for biomedical researches. The pedigree
contains over 16,000 individuals across seven generations, with 384 founds of P. h. Anubis, P. h.
cynocephalus and their hybrid progenies. Tissues and blood clots from over 8,000 individuals have
been well stored, and DNA, serum and buffy coats from ~ 4,000 members have been banked [13].
Among these pedigree baboons across seven generations, more than 2,000 individuals have been
genotyped using microsatellite markers, followed by the construction of a whole-genome linkage
map, which contains 294 ordered loci, with an average interval between markers of 7.2 cM [79, 80].
Together with the genotypes of these baboons, several hundred quantitative traits have also been
phenotyped accordingly, which were further used to localize genomic regions of genes controlling
these traits. These data have been implemented in the studies of atherosclerosis, hypertension,
obesity, craniofacial complex etc. Taking the advantage of this genetic map, scientists have scanned
the genome searching for regions (QTL) associated with over 200 traits related to cardiovascular
diseases. Several important QTL were found using this approach. For example, Kammerer et al.
(2002) have found a QTL influencing low-density lipoprotein cholesterol dietary cholesterol
response on chromosome 6; next year, Rainwater et al (2003) had found several lipid-/lipoprotein-
related QTL including three for low-density lipoprotein cholesterol size fractions located on
chromosomes 5, 10q and 17, respectively. A region on chromosome 17 was also found to be
associated with cholecystokinin. This region is known to harbor genes of glucose transporter,
glucagon-like peptide 2 receptor, and sterol regulatory element binding transcription factor 1, which
28
are related to adiposity [81].
Transcriptomic study is also an efficient approach to advance our understanding to many human
diseases and traits. Northcott et al. (2012) developed a cross-species array (rat and baboon)
targeting 328 genes possibly related to blood pressure. Among these genes, they found 74 were
commonly expressed in both rat and baboon kidney, while 41 were specifically expressed in rat and
34 were specific to baboon. This study displayed evidence of similarities and differences of gene
expression profile between primate and rodent and therefore highlighted the importance of an
appropriate primate model in studies of human complex diseases as well as other traits, such as
neurology and social behaviors.
Several investigators have also combined the transcriptome and linkage map in their studies and
found that, for example, the mRNA level of adiponectin, which is correlated to body weight, serum
triglycerides, adipocyte volume and glucose levels, is significantly heritable and the heritability is
associated to a region on chromosome 4p [82]. Similarly, the abundance of resistin mRNA is also
heritable and the QTL is located on chromosome 19p [83].
1.3.3 Genomics of mandrill
As a nature host of HIV and SIV, mandrills (M. sphinx) is on a list of species whose genomes to be
sequenced. In some cases, mandrills are able to tolerate SIV infection for long periods of time, and
their responses to viral infections are sometimes quite different from the other hosts, such as
mangabey and African green monkey. Additionally, mandrills are adapted to two different SIV
strains, SIVmnd1 possibly originated from a virus in Cercopitehcus lhoesti, and SIVmnd2 from a
virus in M. leucophaeus. Furthermore, a correlation between the low rates of vertical transmission
29
and the expression of CCR5 have been found in mandrill [84]. Heterozygous individuals have
greater reproductive success regardless of the sex. They always have more offspring. However, this
advantage has only been observed in alpha males but not in the beta males. This correlation
between heterozygosity and reproductive success and tenure has been fairly explained by multi-
locus effects [85].
1.3.4 Comparative genomics in primates
With the genomic data across different primate species, genome regions or elements common or
specific to several species (i.e. human) can be identified and analyzed in details by systematic
comparisons between these genomes. Sequence alignment and comparison shows strong correlation
between pairwise differences and time of divergence which is inferred from other information, such
as fossils. The divergence between human and chimpanzee sequence is 1.1-1.4% [58]. The
difference between human and rhesus macaque is relatively larger, ~6.5% [61], which is consistent
with the longer time of divergence between these two species (28-35 million years ago). The
alignments also show indels among species. Indels are favorably located at intronic and intergenic
regions, which are more tolerant to small indels compared to protein-coding regions.
Besides nucleotide substitutions and small indels, insertion of fragments and larger segmental
rearrangements were also detected in primate genomes. The most extensively investigated process
is the insertion of retrotransposons, such as Alu, which is ongoing in primate genomes and have
played very important roles in shaping the genome structure. For example, Alu insertions is a major
driver of genome change [86]. Retroposition has broader effects on genome evolution because of its
potential of inducing segmental duplication or deletion [87].
Segmental duplication is vital for genome evolution. Segmental duplications collectively make up
30
~5% of human and chimpanzee genomes, and ~3.8% of orang-utan genome [54, 88]. It is apparent
that the segmental duplications are not randomly located on the chromosomes. Segmental
duplications are preferentially distributed on human chromosome 22 (11.9%) and non-recombining
chromosome Y (50.4%) but against chromosome 3 (1.7%) [89]. Segmental duplications in primate
genomes are categorized into three groups according to their locations. They are pericentromeric,
subtelomeric and interstitial duplications. Duplications in these three classes differ in the types and
frequencies [90]. Pericentromeric duplicates make up about 47.6 Mb, occupying a third of all the
duplicates in human genome [89]. The ratio of inter- to intra-chromosomal duplication in this class
is about 6:1 [89]. Furthermore, more than 30% of the pericentromeric sequences are occupied by
duplicons from other chromosomes. A two-step model has been proposed to explain the process of
segmental duplications in pericentromeric regions [91]. It is similar in subtelomeric regions that
they also have many duplicates from other chromosomes although the total amount of duplicates is
much fewer (2.6 Mb). The inter-chromosomal segmental duplicates are present in 30 of 42
subtelomeric regions [92]. Subtelomeric segmental duplicates are typically 50 to 100 kb long and,
on the contrary to the origination of pericentromeric duplicates, their births involve exchanges
between subtelomeric regions and a larger part of the relative orientation between non-homologous
chromosomes has been retained [89, 93]. Interstitial segmental duplicates locate on euchromatins.
In contrast to the predominance of tandem duplicate clusters found in most genomes, primate
genomes contain a large number of interstitial duplicates. Although interspersed duplicates are
located along the euchromatin, the locations are not randomly distributed either [89]. Comparative
analysis of genomes across primates show that many interspersed intra-chromosomal duplicates can
be dated to the evolution of the great ape. Their births are always associated with chromosome
rearrangements [94].
31
1.4 Objectives
Providing the background mentioned above, whole genome sequences will facilitate current
biological researches of baboon and mandrill. Thus here I proposed to use second generation
sequencing to construct reference genomes for both baboon and mandrill, conduct repeat/gene
annotation, conduct gene family clustering, conduct evolutionary analyses and comparative
genomic analyses for baboon and mandrill. Study objectives include:
i) Genomic resources for future studies on baboon and mandrill;
ii) Detailed genomic features of baboon and mandrill in repeat content, protein coding genes,
gene families, etc.;
iii) Comparing the genome/genetic features of baboon and mandrill to human and other
primates to provide further insights for primate integration;
iv) Identify possible genetic mechanisms for Old World monkey adaptation;
v) Provide additional insights for human diseases/health.
32
2. Materials and Methods
2.1 Sampling and sample preparation
One baboon (olive baboon, Papio anubis), and one mandrill (Mandrillus sphinx) were selected for
sampling (Figure 2.1). With the assistance of zoologist, veterinarians withdrew 5 mL blood from
the twenty-year-old male baboon and the eighteen-year-old male mandrill, and the 5 ml whole
blood was from the left jugular vein of animal, and the blood was collected to a plastic collection
tube with 4% (w/v) sodium citrate. The blood samples were then snap frozen in liquid nitrogen and
stored at -80˚C until further processing. Genomic DNA was extracted from the whole blood
samples with the AXYGEN Blood and Tissue Extraction Kit (Corning, USA) according to the
manufacturer’s instructions. The extracted DNA was subjected to electrophoresis in 2% agarose gel
and stained with ethidium bromide to assess the overall quality. DNA concentration was determined
by Quant-iT™ PicoGreen ® dsDNA Reagent and Kits (Thermo Fisher Scientific, USA) according
to the manufacturer’s instructions.
Figure 2.1 Photos of the samples selected for sequencing. The baboon (a) and the mandrill (b)
are both from Beijing Zoo.
33
2.2 Genome sequencing
2.2.1 Library construction and sequencing
DNA of Baboon and mandrill was used for library construction, according to protocols following
descriptions in previous publications [95]. A total of 12 libraries were constructed for each of the
two species. Then sequencing was carried out on Illumina sequencer HiSeq2000. For each species,
6 libraries were designed in paired-end configuration, comprising 2 libraries with reads of 100 bp in
length and a mean target insert size of 250 bp and 2 libraries of 100 bp reads with insert sizes of 500
bp and 800 bp, respectively. 6 libraries were designed and processed in mate-pair configuration,
with all libraries having 100 bp reads and 1 library with insert size 2 kbp, 4kbp and 5 kbp and 1
library each of 10 kbp and 20 kbp insert sizes.
2.2.2 Data filtering
The quality requirement for de novo sequencing is high thus data filtering was carried out to obtain
high quality reads for assembly. During sample preparation, adapters were ligated and amplification
was conducted. Thus adapter contaminated reads, duplicated reads introduced during amplification
and reads with high sequencing errors (low sequencing quality) need to be filtered according to
previous study [96]. Here, raw reads from the sequencer were filtered using SOAPnuck (v.
1.5.6; https://github.com/BGI-flexlab/SOAPnuke). The filtering criteria were as below: i) reads
with >10 percent base of Ns (uncertain/ambiguous bases) were filtered; ii) reads with >40 percent
of low quality bases (quality score <=10) were filtered; iii) reads contaminated by adaptor (adaptor
matched 50%, allowed one base mismatch) and produced by PCR duplication (identical reads in
both ends) were filtered.
parameter
34
SOAPnuke filter –f adapter1.list -r adapter2.list -1 reads1.fq.gz -2 reads2.fq.gz -l 10 -q 0.4 –n 0.1 –
M 1 –o ./
2.2.3 Overlapping library data merging
Overlapping libraries are designed in a way that the ends of the paired reads overlapped with each
other, thus the fragments were sequenced through. For the overlapping libraries, the insert size (Si)
of the library should be shorter than the total read length (length of the two read ends, Lr), and the
expected overlap length can be calculated as (2 Lr – Si). The overlap information can be used to
merge the paired reads into one longer sequence. Merging reads will benefit downstream analysis
by providing longer sequence and lower sequencing error. Here, merging of the overlapped reads
was performed using FLASh [97] v1.2.10 and default parameters.
2.2.4 K-mer analysis
In order to estimate the genome features including genome size, repeat content and heterozygosity,
K-mer analysis was first performed. K-mer is sub-sequence of the reads with the length of k. The
Formula 2.1 was used for estimating the genome size. In this formula, knum is the total number of K-
mer, kdepth is the expected depth of K-mer, bnum is the total number of bases, bdepth is the expected
depth of bases. According to Formula 2.2, the distribution of kdepth follows a Poission distribution.
Thus, the peak depth of the K-mer depth was used for expected K-mer depth, while λ was used as
the expect K-mer depth.
In this analysis, the k was 17 with command: kmerfreq -k 17 -m 1 -o 1 -l fq.list [98].
num num
depth depth
k bG
k b
(2.1)
35
(2.2)
2.3 Genome assembly and annotation
2.3.1 Genome assembly
The baboon and mandrill genome were assembled by short-reads assembly software SOAPdenovo2
[98] using the filtered data. SOAPdenovo was developed for the short read assemblies based on a de
Bruijn graph algorithm, which has been widely applied in genome assembly.
Four major steps were conducted to complete the preliminary assembly:
i) Building the de Bruijn graph
To build the de Bruijn graph, all reads from the small insert size (<1000 bp) libraries were
used to build the de Bruijn graph (DBG). The initial DBG was composed of 57-mers as
nodes and the edge connection among the nodes was made up of read paths. In order to
simplify the DBG, erroneous connections were removed to resolve the repeats, including the
following four aspects.
a) Clipping the short tips
The short tips that were shorter than 114 bp (the length of 2-fold 57mer) in the DBG were
clipped.
b) Removing low-coverage links
c) Solving tiny repeats by read path
d) Merging the bubbles. The bubbles were generally caused by repeats or heterozygosity.
ii) Contig construction
On the simplified DBG, the broken connections at repeat boundaries were extracted and
36
output the unambiguous sequence fragments of them as contigs.
iii) Scaffold construction
Realigned the reads onto the contigs and used the paired-end information to join the unique
contigs into scaffolds.
iv) Gap closure
Filled the intra scaffold gaps using the mapped reads. Most of the remaining gaps probably
occur in repetitive regions. Paired-end reads with one end mapped on the unique contig and
the other end located in the gap region were extracted for the local assembly, thus the
unmapped ends were used to fill in the gaps within the scaffolds. The gap filling was
performed by GAPcloser [98].
Genome assembly with command:
SOAPdenovo all -s config -K 49
GapCloser_v1.10_gz –a scaff.fa -b lib.cfg -o baboon_gapClosed.fill -t 16
2.3.2 Genome annotation
Repeat sequences can be classified into tandem repeat including microsatellite sequences, small
satellite sequences, and the interspersed repeats including DNA transposons and retrotransposons
(LTRs, LINEs and SINEs). Repeat elements were first annotated using both homolog searching and
de novo prediction, and similarly, genes were annotated by combining homolog searching and
prediction based on gene structure (Figure 2.2).
37
Genome
sequence
Gene annotationRepeat
annotation
ncRNA
annotation
Gene set
Function
annotation
homologDe novocDNA/
ESTDe novo homolog
Statistics resultsStatistics results
UniProtKEGGInterPro
Statistics results
miRNA/
snRNArRNAtRNA
GLEAN setRNA-
seq data
Figure 2.2. Overall process of genome annotation. The genome annotation including three major
parts: repeat annotation, gene annotation and ncRNA annotation.
2.3.2.1 Repeat annotation
To predict transposable elements (TEs) in the genome, RepeatMasker [99] (version 4.0.5) and
Repeat-ProteinMask were used to scan the whole genome against the RepBase library [100]
(Version 20.04) for known repeats. RepeatMasker was then used again to identify de novo repeats
38
based on the custom TE library constructed by combining results of RepeatModeler [101] (Version
1.0.8) and LTR_FINDER [102] (Version 1.0.6). Tandem repeats was also predicted using Tandem
Repeat Finder [103] (Version 4.0.7). Finally, all the repeat prediction results were combined
together to the final repeat annotation result.
LTR parameter
LTR_FINDER.x86_64-1.0.5/ltr_finder -w 2 -s tRNAdb/dm3-tRNAs.fa
RepeatMasker parameter
RepeatMasker -nolow -no_is -norna -parallel 1 -lib RepBase16.10/RepeatMaskerLib.embl.lib
ProteinMask parameter
RepeatProteinMask -noLowSimple -pvalue 0.0001
2.3.2.2 RNA annotation
To identify transfer ribonucleic acids (tRNAs), tRNAscan [104] was used. While for ribosomal
ribonucleic acids (rRNAs) identification, 757,441 rRNAs from public domain were used to search
against the genome with command -p blastn -e 1e-5. To identify RNA genes and other non-coding
RNA (ncRNA), Rfam database [105] was used to search against the genome with the Rfam
program, rfam_scan.pl, (ftp://ftp.hgc.jp/pub/mirror/sanger/Rfam/tools/rfam_scan.pl).
rRNA parameter
blastall -p blastn -e 1e-5 –i Human_rRNA.fa
ncRAN parameter
rfam_scan.pl -d Rfam.fasta.
2.3.2.3 Gene annotation
39
Genes were predicted using three categories of methods, including homolog based, evidence based
and ab initio prediction. For homolog based annotation, protein sequences of Macaca mulatta, Pan
troglodytes, Nomascus leucogenys, Pongo abelii, Gorilla gorilla and Homo sapiens were
downloaded from Ensembl database (Release 73) and were aligned to the genome using BLAT
[106]. Then GeneWise [107] (Version 2.2.0) was used for further precise alignment and gene
structure prediction. For evidence based prediction, EST sequences were downloaded from NCBI
and were aligned to the genome using PASA [108] for spliced alignments and assembly to detected
gene structure. For ab initio prediction, we employed AUGUSTUS [109] (Version 3.1) to process
ab initio gene model prediction in the repeat masked genome. Finally, these gene prediction results
were combined using GLEAN [110] to obtain the final non-redundant gene set.
Homology parameter
blat -q=prot -t=dnax
genewise -sum -genesf
ab initio prediction parameter
denovo-predict.pl --augustus human
GLEAN parameter
run.Glean.pl --YAML parameter.yaml --genome **.fa --maxintron 100000 --cds 150 --homolog
**.gff --EST **.pasa.gff
2.3.2.4 Gene function annotation
In order to provide possible gene function information, predicted genes were compared against
protein databases with protein function information. Blast2GO program [111] was used to assign
gene ontology (GO) terms and enzyme commission (EC) numbers. InterProScan [112], which
searches Pfam domains [112] and several other protein signature databases, was used to predict
40
protein domains. InterProScan results were finally subjected to searching against the genome by
Blast2GO for further GO terms assignments.
Function parameter
run_iprscan51-55.pl --cpu 100 --cuts 100 --appl ProDom --appl ProSiteProfiles --appl SMART --
appl PANTHER --appl PRINTS --appl Pfam --appl PIRSF --appl ProSitePatterns **.pep
blast -b 100 -v 100 -p blastp -e 1e-5 -F F -d database(database including kegg, swissprot, tremble )
2.3.2.5 Completeness of gene content with and BUSCO
CEGMA [113] and BUSCO [114] were used to assess the completeness of the genome and quality
of gene predictions. Both software used universal/conserved single-copy genes which should be
present in the genome to search against the genome, thus to estimate the completeness of the
genome and gene annotation. Completeness of the gene sets were assessed with default settings for
both software and with plant specific reference profiles in the case of BUSCO.
BUSCO parameter
BUSCO_v1.2.py -o run_glean -m OGS -l vertebrata database -in **.pep -c 16
2.4 Evolutionary analysis
2.4.1 Gene family cluster
Protein sequences of 11 species including Callithrix jacchus, Gorilla gorilla, Homo sapiens,
Macaca mulatta, Microcebus murinus, Nomascus leucogenys, Otolemur garnettii, Pan troglodytes,
Pongo abelii, Tarsius syricht and Mus musculus were used together with the predicted genes of the
two species to do the gene family clustering. Proteins were further filtered if, i) the coding sequence
41
was shorter than 90bp, ii) the sequences with first or last amino acid marked as “X”, which
indicated ambiguous amino acid because of “N” in the gene sequence. iii) to remain just one of the
transcript if multiple transcripts existed. Then TreeFam (http://www.treefam.org/) was used to
defined gene families in Mandrillus sphinx and Papio anubis. Firstly, all-vs.-all blastp with the e-
value cut-off of 1e-7 for 13 species’ protein sequences were conducted and secondly the possible
blast matches were joined together by an in-house program. Thirdly, we removed genes with
aligned proportion less than 0.33 and converted bit score to percent score. Finally, hcluster_sg
(Version0.5.0, https://pypi.python.org/pypi/hcluster) was used to cluster genes into gene families.
2.4.2 Phylogenetic analysis
With gene families clusters defined, the fourfold degenerate (4D) sites of 5,133 single-copy
orthologous among 13 species were extracted for the phylogenetic tree construction. PhyML
package [115] was used to build the phylogenetic tree with maximum-likelihood methods and
GTR+gamma as amino acid model (1,000 rapid bootstrap replicates conducted). Based on the
phylogenetic tree, divergence times of these species were estimated by using MCMCTree
(http://abacus.gene.ucl.ac.uk/software/paml.html) With default parameters. To further calibrate the
evolution time in the tree, six fossil dates collected from the TimeTree database
(http://www.timetree.org/) were used, including the divergence time between Mus musculus and
human to be 85-93 million years ago (MYA) [116], divergent time between human and chimpanzee,
gorilla, to be 6 MYA (with a range of 5–7) [117] and 9 MYA (range, 8-10) [118].
2.4.3 Positively gene selection analysis
The selection pressure on protein-encoding genes in mandrill and baboon were measured by
comparing nonsynonymous (dN) and synonymous (dS) substitution rates. This ratio would be equal
42
to 1 if the whole coding sequence evolves neutrally. When dN/dS < 1, it's under constraint, and
when dN/dS > 1 it should be under positive selection. I calculated the dN/dS ratio using models in
the program package PAML version 3.14. From gene family cluster, I obtained single-cope gene in
every species. Subsequently, I used neutral (M1 and M7) and selection (M2 and M8) models to
identify the codons that are under positive selection. Models M1 and M7 supposed a different
distribution of ω values smaller than 1, otherwise models M2 and M8 constrained ω to be larger
than 1 (ω2), thereby distinguishing positive selection from purifying evolution (ω < 1), neutral
evolution (ω = 1), and positive selection. The fitness of the model M1-M2 and M7-M8 can be
compared using a χ2 distribution with 2 degrees of freedom.
2.5 Comparative genomics
2.5.1 Synteny analysis of human, macaque, baboon and mandrill
For the comparative genomic analysis, syntenic blocks among primate species were first identified,
for further identification of genomic rearrangement events such as inversions, insertions and
deletions among these species. Proteins of human, macaque, baboon and mandrill were aligned
between each other using blastp (Version 2.2.26), and then the blast results were filtered using
criteria of coverage greater than 85% and identity greater than 85%. Finally, the best match of every
gene was obtained as the gene pair in synteny.
2.5.2 Gene family contraction and expansion
With the gene family clustering result, gene family contraction and expansion can be detected to
figure out the dynamic evolutionary changes of gene families along the phylogenetic tree.
According to the phylogenetic tree and divergence time, CAFÉ [119] was used for gene family
43
contraction and expansion analysis. Firstly, a global parameter λ by using maximum likelihood
based on random birth and death model was estimated. Then a conditional p-value was calculated
and families with p-value less than 0.05 were marked as significantly changed families, which
means these families underwent contraction or expansion in the process of evolution.
2.5.3 Segmental duplications
Segmental duplications are duplicated blocks of genomic DNA typically ranging in size from 1–200
kb. They often contain high-copy repeats or intron-exon structure. Whole-genome sequence
detection (WSSD) method was used for segmental duplications identification [120]. Whether a
sequence is duplicated or not were determined according to its overrepresentation and average
sequence identity. After excluding TE element in genome, clean reads were then mapped to genome
using BWA with parameters “-m 200000 -l 20 -k 2 -t 30”, then samtools was used to get coverage
and depth.
2.6 Investigating molecular mechanisms of adaptation/phenotype
2.6.1 Immune character
Major histocompatibility complex (MHC) is a series of genes coding surface proteins assisting cells
to recognize foreign substances, which is related with immune system and it has been demonstrated
to be in association with many diseases. The main function of MHC molecules is to bind the
peptide chain derived from pathogens thus present pathogens on the cell surface to facilitate T-cell
recognition and perform a series of immune functions. MHC has been proved to be highly
polymorphic in most primate species, incuding macaque. So MHC class I region was identified in
the mandrill genome by searching the human sequence against it with RepeatMasker (Version 4.0.5,
44
with parameter “-nolow -no_is -norna -engine ncbi”).
2.6.2 Language competence
Language is a special ability for communication within species, particularly in human. Previously,
some genes have been found to be involved in language, and exploring the status of these genes in
animals can further help to understand the original of language formation. FOXP2 was the first
gene found to be related to the human language development and a heterozygous missense mutation
were thought to cause inherited language disorder based on a case study of a family known as KE
family. FOXP2 is expressed in many tissues including the basal ganglia and inferior frontal cortex
[121] where it is essential for brain maturation and speech and language development. Here, protein
sequences of FOXP2 genes of human, chimp and mouse have been download from NCBI. These
FOXP2 protein sequences were mapped to baboon and mandrill genome using blat with the
parameters of “-q=prot -t=dnax”. Blat results were filter using the following criteria, i) hits other
than the best five hits were filtered, ii) query protein covered less than 30%, iii) difference greater
than 20%. After the blat alignment, GeneWise was used to do fine mapping with default parameters.
2.6.3 Olfactory character
Olfaction or sense of smell is one of the important feelings for animals. Chemical communication is
least well understood in Old World species and the olfactory sense is underappreciated [122].
Human olfactory receptor (OR) genes protein from HGNC
(http://www.genenames.org/genefamilies/OR) were used to search against the genomes of mandrill
and baboon to identify OR genes.
45
2.6.4 Predicting binding sites of transcription factors
Transcription factors (TFs) are key regulators which bind to specific DNA sequence to activate or
repress gene expression. Each TF has at least one DNA-binding domain (DBD) which is always
conserved. Based on their DBDs, TFs could be classified into 70 families in AnimalTFDB 2.0
database [123]. In order to identify and explore functions of TFs, a BLAST tool was used to search
against TFs in the database with the protein sequences. The 1,691 human protein sequences in
AnimalTFDB 2.0 database were selected as the BLAST database with the conditions setting as e-
value<=1e-5, coverage>=30%, identity>=20%. In the prediction result, 68 TF families in total,
3,438, 3,714 and 4,272 genes in Has, Msp, Oba, respectively.
Transcription factor binding sites (TFBS), a motif may correspond to the active site of an enzyme or
a structural unit necessary for proper expression of genes. Thus, sequence motifs are one of the
basic functional units of molecular evolution. Consequently, identifying and understanding these
motifs is fundamental to building models of cellular processes at the molecular scale and to
understanding the mechanisms of human disease. In this study, we used the MEME Suite to
perform motif-based sequence analysis, which comprises an integrated set of tools and databases.
We used build-in motifs to identify human genomic sequences with e-value<=1e-10 in DREME
algorithm that may contain the discovered motifs, or to determine if the motifs are similar to
previously studied motifs. In the prediction result, 19,024 TFBS were found in Has, and determined
whether there were some variations near the binding sites with 50 bp extending size.
46
3. Results
3.1 Landscapes of baboon and mandrill genomes
3.1.1 Sequencing data
For de novo genome assembly, 12 libraries were constructed and sequenced for each of the two
species, and the sequencing data was summarized in Table 3.1. In total, 512 Gb (~170× considering
the genome of 3 Gb) of raw paired-end and 328 Gb (109× considering the genome size of 3 Gb) of
raw sequencing data were obtained.
Table 3.1 Statistics of baboon and mandrill raw sequencing data.
Species Pair-end
Libraries
Insert
Size (bp)
Average Reads
Length (bp)
Raw Data
(Gb)
Sequence
Depth (×)
Baboon 250 150 109.48 36.49
500 100 80.91 26.97
800 100 60.11 20.37
4,000 90 68.24 22.75
10,000 90 95.72 31.91
Total - - 414.46 138.15
Mandrill 250 150 113,296 37.77
500 100 83,054 27.68
800 100 65,328 21.78
2,000 90 34,561 11.52
5,000 90 32,967 10.99
10,000 90 65,377 21.79
20,000 90 32,141 10.71
Total - - 426,724 142.2
After data filtering, 284 Gb and 289 Gb clean data were obtained (Table 8.1).
47
3.1.2 K-mer analysis
In order to assess the genome features, 17-mers (17 bp sub-sequences) were extracted and subjected
to the K-mer analysis. The reads from the short insert libraries (baboon, libraries with insert sizes of
250bp, 500bp and 800bp and ~202 Gb data amount in total; mandrill, libraries with insert sizes of
250bp, 500bp and 800bp and 212 Gb data amount in total) were used for this analysis. From the
distribution of depth-frequency (Figure 8.1), the peak of distribution was at ~28× and ~31×
respectively. Thus the genome sizes of olive baboons and mandrill were estimated to be 2.93 Gb
and 2.90 Gb respectively (Table 3.2).
Table 3.2 The information of 17-mer statistics.
Species K Number of K-
mers
Depth peak Genome Size Sequencing depth
Baboon 17 82,117,298,803 28 2,932,760,671 33
Mandrill 17 89,967,169,490 31 2,902,166,757 37
The distribution of K-mer frequencies of reads from second generation sequencing dataset can also
reflect the heterozygosity of the genome [124]. Considering a genome without heterozygosity,
repeat and no errors during sequencing, the K-mer frequency distribution should be a Poisson
distribution. For real dataset, due to the sequence errors, there were excessive K-mer with low
frequency. In the meantime, heterozygote regions would result in two sets of K-mers with half of
the major sequencing depth/K-mer frequency, thus for higher heterozygosity, there would be more
obvious secondary peak at half the frequency. Also for repeat sequences, since they are multiple
copies of K-mers resulted from identical repeat sequences, secondary peaks can be found at twice or
even more times of the major K-mer frequency. As indicated in Figure 8.1, for baboon, there was
no obvious secondary peak at half of the major K-mer frequency which was ~28, indicating low
48
heterozygosity for the sequenced baboon individual. However, for mandrill, obvious secondary
peak can be found at the K-mer frequency of ~16 which was half of the major K-mer frequency
(~31), thus the mandrill individual should have relatively high heterozygosity. For both genomes,
there were also noticeable peaks at twice of the major K-mer frequency, indicating high repeat
content for both genomes. Thus, the two baboon genomes sequenced are both high repetitive and
obviously heterozygous.
3.1.3 Genome assembly
With estimated genome features, genome assembly was conducted for both species to obtain the
genome assemblies. The final baboon genome assembly was 3.12 Gb with ~80 Mb gaps, similar to
the overall genome length estimated in the K-mer analysis (Table 3.3). The contig N50 was 21.7 kb
with longest contig to be 238.9 kb, indicating continuity of the genome and good quality for gene
annotation. For scaffolds, the N50 was 1.1 Mb with longest scaffold to be 8.8 Mb. And 2,308
longest scaffolds consisted more than 80% of the whole genome. Similarly, for mandrill, the total
length assembled was 2.88 Gb, with ~80 Mb gaps. The contig N50 was 20.5 kb with longest contig
to be 211 kb. The scaffold N50 was 3.6 Mb with the longest scaffold to be 19.1 Mb. And 634
longest scaffolds consisted more than 80% of the whole genome. The genome assemblies are of
good quality for downstream analysis, with good coverage and continuity.
Table 3.3 Statistics of the genome assemblies.
Contig*1 Scaffold
Size (bp) Number Size (bp) Number
Baboon N90*2 2,315 171,662 52,973 4,209
49
N80 7,938 108,096 332,809 2,308
N70 12,413 77,767 559,903 1,593
N60 16,868 56,789 798,728 1,128
N50 21,659 40,868 1,070,645 792
Longest 238,945 ---- 8,793,459 ----
Total size 3,044,016,568 ---- 3,116,777,842 ----
Total number (>=100
bp)
---- 1,831,592 ---- 1,610,583
Total number (>=2 kb) ---- 177,569 ---- 11,097
Mandrill N90 5,266 141,475 638,217 936
N80 9,025 101,618 1,303,160 634
N70 12,638 75,505 1,962,294 457
N60 16,336 56,061 2,730,696 332
N50 20,483 40,751 3,564,730 241
Longest 211,017 ---- 19,105,867 ----
Total size 2,798,997,503 ---- 2,882,689,325 ----
Total number (>=100
bp)
---- 455,069 ---- 215,140
Total number (>=2 kb) ---- 194,923 ---- 4,742
*1. Contigs are the first assembled sequences without gaps, while scaffolds are the sequences generated by linking
contigs with gaps filled in.
*2. N90 means the length of the contig/scaffold for which all the contigs/scaffolds longer than it accumulate to 90%
of the total length. Similarly, N(P) in which P ranged from 50 to 90 in this table, indicates the length of
contig/scaffold for which all the contigs/scaffolds longer than it accumulated to P% of the total length.
3.1.4 Annotation results
3.1.4.1 Repeat annotation
Repeats are widely existed in the genome with possible important functions. Repeats were
annotated and categorized in both genomes (Table 8.2 -Table 8.5). For baboon, the repeat content
took up ~50% of the whole genome, with 47% to be transposable elements (TEs). Comparing to the
repeat contents in human (Table 3.4), Long Interspersed Nuclear Elements (LINEs) were less in
baboon and mandrill genome (~17%) comparing to human genome (~21%), while Short
Interspersed Nuclear Elements (SINEs) were similar in these genomes (~12%), especially with Alu
elements to have quite similar proportion (10%~11%), reflecting that the Alu elements were the
50
conserved within primate genomes as previously described [125].
Table 3.4 Repeat contents of baboon, mandrill, human and mouse.
Group Percentage coverage of genome
Baboon Mandrill Human Mouse
LINE 16.76 16.61 20.99 19.2
L1 15.6 15.05 17.37 18.78
L2 1.05 1.39 3.3 0.38
LINE/other 0.11 0.17 0.32 0.04
SINE 11.26 12.10 13.64 8.22
Alu 10.14 10.47 10.74 2.66
MIR 0.92 1.37 2.9 0.57
B4 0.14 0.20 -- 2.36
SINE/other 0.07 0.06 -- 2.64
LTR 7.88 8.36 8.55 9.87
MaLRs 3.12 3.40 3.78 4.82
Other ERVs 4.68 4.85 4.77 4.4
LTR/other 0.08 0.11 -- 0.65
DNA transposons 2.7 3.27 3.03 0.88
Other 3.70 0.06 0.53 0.74
Total 42.3 40.40 46.74 38.91
3.1.4.2 RNA annotation
The non-coding RNAs (ncRNAs) are RNA molecules that are not translated into a protein. Four
types of ncRNAs were annotated in baboon and mandrill genomes, including transfer RNAs
(tRNAs), ribosomal RNAs (rRNAs), and small nuclear RNAs (snRNAs) (Table 8.6 and Table8.7).
3.1.4.3 Gene annotation
After masking repeats, protein coding genes were predicted in the genome using ESTs, homolog
51
proteins and ab initio prediction, generating 23,867 (baboon) and 21,906 (mandrill) protein-coding
genes finally (Table 3.5 and 3.6). In mandrill genome, the average number of exon per gene is
slightly lower than that in baboon genome while the average exon length is longer than baboon. In
addition, the average intron length is 700bp longer than that in baboon.
Table 3.5 Summary of gene annotation in baboon genome.
Gene set Number Average
transcript
length
(bp)
Average
CDS
length
(bp)
Average
exon per
gene
Average
exon
length
(bp)
Average
intron
length
(bp)
De novo AUGUSTUS 22,528 45,907 1,371 8.10 169 6,272
Homolog Nomascus
leucogenys
21,278 36,106 1,467 8.31 176 4,741
Pongo abelii 23,806 32,996 1,341 7.59 176 4,801
Pan
troglodytes
21,245 35,899 1,468 8.22 178 4,771
Macaca
mulatta
25,930 32,844 1,267 7.14 177 5,145
Gorilla gorilla 24,402 30,755 1,377 7.65 179 4,415
Homo sapiens 25,350 35,308 1,481 8.18 181 4,710
EST 39,294 6,538 775 2.21 350 5,763
Final set 23,867 37,246 1,459 8.20 178 4,972
Table 3.6 Summary of gene annotation in mandrill genome
Gene set Number Average
transcript
length
(bp)
Average
CDS
length
(bp)
Average
exon
per
gene
Average
exon
length
(bp)
Average
intron
length
(bp)
De novo AUGUSTUS 18,460 54,148 1,429 8.68 164.65 6,863
Homolog Nomascus
leucogenys
20,874 39,863 1,499 8.56 175.07 5,072
Pongo abelii 23,330 37,371 1,373 7.82 175.53 52,757
Pan troglodytes 20,866 40,317 1,502 8.46 177.62 5,204
Macaca
mulatta
25,460 38,089 1,294 7.36 175.96 5,787
52
Gorilla gorilla 23,791 34,748 1,413 7.92 178.47 4,816
Homo sapiens 25,161 39,338 1,513 8.42 179.77 5,098
EST 38,021 7,365 781 2.33 335.00 4,935
Final set 21,906 39,087 1,390 7.52 184.95 5,785
3.1.4.4 Gene evaluation and function annotation
To evaluate the quality of the annotated protein coding genes, 3,023 BUSCO (Benchmarking
Universal Single-Copy Orthologs) groups were searched against the predicted gene set to find that
97% (baboon) and 98% (mandrill) (Table 3.7) of complete groups can be found in the final gene
sets. Besides, 99.24% (baboon) and 98.70% (mandrill) of the predicted genes were with
corresponding biological function supported by at least one of the functional databases (Table 3.8).
Table 3.7 Assessment of gene sets using BUSCO.
Baboon Mandrill
Total BUSCO groups 3,023 3,023
Complete BUSCOs 2,936 2,981
Complete and single-copy BUSCOs 2,772 2,811
Complete and duplicated BUSCOs 164 170
Fragmented BUSCOs 63 28
Missing BUSCOs 24 14
Table 3.8 Function annotation of the final gene sets.
Baboon Mandrill
Gene number % Gene number %
Total 23,867 100.00 21,906 100.00
Annotated InterPro 20,310 85.10 18,139 82.80
GO 15,818 66.27 14,160 64.64
KEGG 19,733 82.68 18,022 82.27
Swissprot 22,547 94.47 20,547 93.80
TrEMBL 23,661 99.14 21,529 98.28
All 23,686 99.24 21,622 98.70
53
database
Unannotated 181 0.74 284 1.30
3.2 Evolution of baboon and mandrill
3.2.1 Gene families
In order to analyze gene family evolution of baboon and mandrill, gene family clustering was
conducted to identify 17,947 and 15,368 gene families respectively, with 668 and 1,387 genes not
clustered (Table 3.9). Comparing to human (Homo sapiens), macaque (Macaca mulatta) and
chimpanzee (Pan troglodytes), 489 and 342 gene families, with 598 and 515 genes, were found to
be unique in the two species (Figure 3.1). These unique gene families were significantly enriched
in function annotation with gene ontology (GO) terms 0042773 of ATP synthesis coupled electron
transport (GO level, biological process, BP, P=1.28e-12), GO:0016651 of oxidoreductase activity,
acting on NADH or NADPH (GO level, molecular function, MF, P=1.68e-09) for baboon (Table
8.8) and GO:0006412 of translation (GO level: BP, P=6.29e-33), GO:0003735 of structural
constituent of ribosome (GO level, BP, P=6.29e-33) for mandrill (Table 8.9). On the other hand,
5,133 single-copy orthologous genes were found to be shared among all the 13 species (Figure 3.2).
Table 3.9 Gene family clustering in the seven species.
Species Genes
number
Genes in
families
Un-clustered
genes
Family
number
Unique
families
Callithrix jacchus 20,585 445 16,858 12 1.19
Gorilla gorilla 20,478 313 17,495 8 1.15
Homo sapiens 19,513 105 17,367 2 1.12
Macaca mulatta 20,627 912 16,391 38 1.2
Mandrillus sphinx 21,906 1,387 15,368 87 1.34
Microcebus
murinus
17,853 310 15,414 9 1.14
Mus musculus 22,190 864 17,778 209 1.2
54
Note: Un-clustered genes refer to unique genes in the species; Unique families refer to unique gene families of the species.
Figure 3.1 Orthologous gene clusters in the five related species. The Venn diagram of unique
and shared gene families in the human, mandrill, gorilla, macaque and baboon genomes.
55
Figure 3.2 Comparison of orthologous genes among 13 primates and mouse.
3.2.2 Phylogenetic analysis
In order to analyze the species evolution, phylogenetic tree of the baboon, mandrill and the other
sequenced animal genomes were constructed based on single-copy orthologous genes. Molecular
clock of 4-fold degenerate sites (neutral substitution rate per year) in species was estimated with
single copy orthologous genes thus the divergence time was estimated. The maximum-likelihood
phylogenetic tree (Figure 3.3) indicates that baboon and mandrill are located in the same clade with
macaque and they diverged from human clade about 28.5 (27.5–30.4) Million years ago (MYA)
56
while the divergence time between Cercopithecoidea and Hominoidea was estimated to be 26.66
(24.29–28.95) MYA using mitochondrial genome sequences method [126]. Baboon and mandrill
were estimated to split from macaque about 7.9 (6.9–9.2) MYA which was different from the
previous estimation which was 6.6 (6.0–8.0) MYA [127]. Baboon and mandrill split from each
other at ~5.8 (5.0–6.8) MYA, reflecting the close evolutionary relationship between baboon and
mandrill.
Figure 3.3 Phylogenetic tree based on single copy gene families in the 13 species. The
calibration time marked as red dot is derived from previous publications [89-91].
The demographic history of a species reflects historical population changes thus would be important
to understand from the genome. We inferred a noticeable population bottleneck in the demographic
history of the baboon and mandrill using the pairwise sequentially Markovian coalescent (PSMC)
model (Figure 3.4). The two species went through similar population size changes between 100 and
10,000 thousand years (kyr) ago. Around 28 kyr ago, a sharp increase, followed by a noticeable
57
bottleneck from a peak of 61,000 and 47,000 to ~6,500 around 17 kys ago in both the baboon and
mandrill populations. The increase of population size was coincident with the increase of human
population, probably indicating climate change suitable for increase of mammals, while the recent
bottleneck of the baboon and mandrill populations are different from the recent increase of the
human population.
Figure 3.4 The demographic change of baboon and mandrill. The population size change over
time was estimated by PSMC model. The x-axis indicates the time, from left to right to be from
recent to ancient, while the y-axis indicates the effective population size.
3.3 Synteny among primates
3.3.1 Synteny analysis of human, macaque, baboon and mandrill
Comparing the genomes of human, macaque, baboon and mandrill, synteny can be identified thus
the historical genome rearrangement events such as inversions, insertions and deletions can be also
identified. These events may result in loss, duplication or change of genes functions. In total, 9,930,
11,418, 14,318 gene pairs between baboon and mandrill, macaque and baboon, human and macaque
58
were identified respectively. Human retained 24 chromosomes (22+X+Y) while the baboon clade
had only 22 chromosomes (20+X+Y) (Figure 3.5) after ~27.5–30.4 millon years of evolution. I
found that chromosome 13 and 14 of baboon branch went through chromosome fusion and
chromosome 7 and 10 experienced chromosome breaks after forming a new clade. Moreover,
several inversion events including paracentric such as chromosome 1, 6, 9 and so on and pericentric
(chromosome 2) occurred in comparison with human. For detail, I analyzed the genes located in
five inversions with length more than 37 Mb on chromosome 1, 2, 3, 4 and 9 and enriched them
significantly with terms GO:0006412: translation (GO level: BP, P=4.60e-03), GO:0004950:
chemokine receptor activity (GO level: MF, P=8.96E-06) and so on (Table 8.10). And we found
FOXP2 gene, which was vital for the formation of voice and language, was located at chromosome
3: 135,344,867–135,605,165 (mandrill) (Figure 8.3).
Figure 3.5 Synteny relationship of human, macaque, baboon and mandrill.
3.3.2 Gene family contraction and expansion
In baboon and mandrill lineage, there were 545 expanded and 618 contracted gene families (Figure
3.6). Expanded gene families were found to be significantly enriched in the functions of
59
biosynthetic process, structural constituent of ribosome, nucleosomal DNA binding, G-protein
coupled receptor activity, olfactory receptor activity, glucose catabolic process, peptidyl-prolyl
isomerization, as well as carbon fixation in photosynthetic organisms and electron transport chain
pathway. In baboon and mandrill, peptidylprolyl isomerase A (PPIA) was significantly expanded
(GO:0003755, P= 3.60E-89, Fisher’s exact test, 40 baboon genes and 53 mandrill genes). The PPIA
belongs to the peptidyl-prolyl cis-trans isomerase (PPIase) family which catalyze the cis-trans
isomerization, folding of newly synthesized protein, combination of several transcription factors
and regulating many biological processes including inflammation and apoptosis, even acting in
cerebral hypoxia-ischemia. In stress environment when presence of reactive oxygen species (ROS),
cell will secrete PPIA to induce an inflammatory response and mitigate tissue injury. Baboons have
been used in embryo infections and disparate bacterial infections and were found to have rapid
infections during the early innate immune responses, which may be related to PPIA functions.
The peroxiredoxin-6 (PRDX6) family, which can reduce peroxides and protection against oxidative
injury during metabolism, was also significantly expanded (GO:0051920, P = 0.000641, Fisher’s
exact test, 4 baboon genes, 5 mandrill genes).
60
Figure 3.6 Gene family contraction and expansion for 12 primates and mouse.
3.3.2 Segmental duplications
Segmental duplications (SDs) were widely existed in mammal genes and might be functionally
important, thus SDs were identified in seven species including baboon and mandrill (Figure 3.7).
Long segment duplications were found to be similar in baboon and mandrill and less than human.
61
Figure 3.7 Segmental duplications in seven primate species.
3.4 MHC comparison between human and baboon/mandrill
Major histocompatibility complex (MHC) contains a series of genes which code surface proteins to
assist cells recognizing foreign substances, thus MHC is important for immune system and it has
been found to be associated with many diseases. The proteins coded by genes from MHC region are
majorly to bind the peptide chain from pathogens and present pathogens on the cell surface to
facilitate T-cell recognition and then a series of immune functions. MHC region is highly
polymorphic in most primate species that have been studied. Previous study has been conducted on
MHC region of macaque to reveal the diversity of this region. While for other Old World monkeys
other than macaque, the MHC regions remain largely unknown. Checking the assembled genomes
of baboon and mandrill, relative complete assembly of MHC region has only been found in mandrill
other than baboon, because of the complexity of high repeat content. MHC region of mandrill was
found on Chromosome 4. In order to make sure the assembled MHC region of mandrill was of high
62
quality, reads were mapped back to the assembled genome to show good coverage and pair-
end/mate-pair relationship (Figure 8.3), supporting assembly of MHC region in mandrill. Since the
MHC region is highly repetitive, a detailed repeat annotation was carried out for both mandrill and
human MHC class I regions (from gene GABBR1 to gene MICB in the direction from the telomere
side to the centromere side) with the same parameter to find similar repeat content for the two
species in this region (48.27% in mandrill comparing to 51.03% in human) (Table 8.11). In
addition to the similar repeat content, the genes of the two species in this region were in good
synteny (Figure 3.8). Only 54 insertion and deletion (indels) with length >100bp were found
between the MHC region I of the two species, which were mostly found to be overlapped with
repetitive elements, such as SINE, LINE and LTR, indicating the influence of repeat content in the
MHC diversity.
HLA genes are important for immune recognition thus HLA genes were further checked and
compared to human. In MHC class I region of human, there were 50 genes in total including 6 HLA
genes, while in mandrill MHC class I region, only 4 HLA genes were identified. Searching the
whole genome other than the MHC region, another 4 HLA genes were identified, making the total
number of HLA genes to be 8 in mandrill. However, further inspection of the 8 HLA genes in
mandrill resulted in finding 5 of them harbored start or stop codon changes, prematurely terminated
changes or frameshift mutations (Figure 3.9), reflecting the genetic mechanisms of differences in
immune response between mandrill and human.
Considering that only one MIC gene was found in chimpanzee [128] comparing to two genes of
MICA and MICB in human which resulted from genomic duplication occurred ~33-44 million years
ago [129, 130], MIC genes in mandrill were further identified (Figure 3.10). Both MICA and MICB
63
gene or gene fragments were found to be existed in mandrill. But the gene structure of MICA in
mandrill was found to be incomplete because of loss of the first exon. Again, this reflected genetic
mechanisms of differences between human and mandrill immune responses.
Figure 3.8 Synteny between human and mandrill MHC regions.
64
65
Figure 3.9 Alignment of HLA genes with amino acid sequence for human, baboon and
mandrill.
Figure 3.10 Structure of MICA and MICB gene for human and mandrill. The red, orange,
66
yellow and purple box represent exons, LINEs, SINEs and LTR, respectively.
3.5 Language related genomic features
Language is a special ability for communication, particularly used by human. Genes were identified
to be involved in language formation in human being. FOXP2 was the first gene identified to be
relevant to the human language development and a heterozygous missense mutation in this gene
was prove to cause inherited language disorder based on a case study of a family known as KE
family. FOXP2 expressed in many tissues including the basal ganglia and inferior frontal cortex
[121] where are essential for brain maturation and speech/language development. Two amino acid
substitutions affected the neural functions of FOXP2 and differential transcriptional regulation in
vivo resulted in two human-specific amino acids comparing to chimpanzee. And 111 genes were
found to be significantly expression changed [131] by these two substitutions. Similarly, using
ChIP-seq, researchers used FOXP2 peptide to design antibody and found 175 target genes [132].
With the baboon and mandrill genomes available, FOXP2 gene evolution was further investigated
here in primates, to shed light on language related genomic features.
FOXP2 genes in baboon and mandrill were identified and compared to those in human, chimpanzee,
and mouse (Figure 3.11). In baboon, the FOXP2 gene (which can be well aligned to human
ENSP00000386200) were found on scaffold1015 from 492,895 to 753,085 bp, and in mandrill, the
FOXP2 gene was found on scaffold103 from 4,983,301 to 5,243,599 bp. They both had 18 exons.
In human, the FOXP2 gene has 22 different transcripts with many motifs including FOXP coiled-
coil domain and Fork head domain. FOXP coiled-coil domain modulated the dimeric associations
of FOXP transcription factors when mutations in this domain might cause disease like IPEX
(immunodysregulation polyendocrinopathy enteropathy X-linked) syndrome. Fork head domain
67
was found in several different transcription factors and to be involved in a variety of biological
processes including early embryogenesis, organogenesis, tumorigenesis and signal transduction.
Comparing the human and baboon FOXP2 genes (with the entire length to be 715 amino acids),
only two amino acid differences were found, and they were both on the seventh exon. What is more,
no mutation in FOXP was found in the coiled-coil and Fork head domain which was concordant
with previous studies. The two amino acid substitutions may affect functions of FOXP2.
Figure 3.11 Amino acid sequence aligment of FOXP2 gene from human, chimpanzees, mouse,
baboon and mandrill. Dots represent identical residues to the human sequence.
3.6 Olfactory receptor genes analysis
Olfaction or sense of smell is one of the important feelings of animals. However, communications
68
through chemicals like olfactory were not understood in Old World species [122]. Most mammals
possess two distinct sets of chemosensory neurons located in the main olfactory epithelium (MOE)
and in the vomeronasal organ (VNO), while Old World primates were generally considered to lack
a functional of VNO [133]. Previous studies indicated that olfactory communication played a vital
role in information acquisition during social foraging for both mandrill and baboon [134].
Comparing to chimpanzee and macaque, almost all the OR gene families substantially expanded
(Table 3.10) and several families including Family 52, were expanded comparing to human
(Figure 3.12). In detail, the number of Family 7, subfamily E member 24 ORs (OR7E24), is
notably overrepresented in mandrill and baboon genomes (6 copies in mandrill distributed on
chromosome 19 (Figure 8.4), 5 in baboon distributed on chromosome 14 and 19, 1 in human, 1 in
macaque and 0 in chimpanzee). Intriguingly, OR7E24 was confirmed to preferentially and
specifically expressed in human testis cells and was supposed to play an important role in migratory
phase of germ cells life cycle [135]. These ORs may be functionally important during the life cycle
of mandrill and baboon and further researches can be conducted to explore the mechanisms.
Table 3.10 Olfactory receptor gene copy number in five species.
Families human macaque mandrill baboon chimpanzee
Family 1 26 12 23 25 18
Family 2 67 47 94 95 49
Family 3 3 1 3 3 3
Family 4 56 35 68 62 33
Family 5 56 23 56 60 39
Family 6 31 24 38 36 23
Family 7 11 2 14 13 9
Family 8 23 11 22 23 15
Family 9 8 4 12 11 7
Family 10 37 23 46 42 23
69
Family 11 9 4 11 12 4
Family 12 3 1 3 3 0
Family 13 12 5 13 13 10
Family 14 1 0 1 1 1
Family 51 24 18 35 27 21
Family 52 26 16 48 39 22
Family 56 6 3 6 5 5
Figure 3.12 Expansion of the olfactory receptor gene family in baboon and mandrill. The red,
blue, green, yellow are olfactory receptor genes in the baboon, mandrill, human and chimpanzee.
3.7 Positively selected genes
In addition to gene family expansion and contractions, genes under selection during evolution are
also functionally important, thus I identified positively selected genes in order to reveal evolution of
baboon and mandrill as well as depict possible functional changes of these species. 5,133 single-
copy orthologous genes shared among 13 species obtained in the gene family clustering were used
70
for detecting positively selected genes (PSGs). In total, 657 PSGs were identified with significant
enrichment in the molecular functions of kinase activity, transferase activity, phosphotransferase
activity and etc. (Table 8.12). Further investigating functions of these PSGs, 34 genes were found
to be innate immunity response genes by searching InnateDB. Interactions of these genes were
predicted by STRING: functional protein association networks (http://string-db.org/cgi). As shown
in Figure 3.13, STAT1, IL5, IL1R1, ATG5, CREB1, DICER1, PIK3R1 genes may have important
roles in immune system, which are strongly associated with stress resistance and wound healing.
Finally, by GO and KEGG pathway enrichment analysis, PSGs were found to be enriched in terms
of GO:0080134: regulation of response to stress (GO level: BP, P=1.59e-05), GO:0006955:
immune response (GO level: BP, P value=4.11e-05) and KEGG:4640: Hematopoietic cell
lineage(P=6.83e-05).
Figure 3.13 Interaction between innate immunity for positively selected genes in mandrill.
3.8 Disease related genomic features
71
In order to insight related disease mutation on baboon and mandrill. We collected the mutations
in the HGMD database, and check these gene’s mutation on baboon and mandrill. Based on this
method, we found 17 genes has the disease mutation of amino acid change in the two species (Table
3.11). Moreover, we found that some of the mutations are in the function domains which would
heavily affect the function of these genes (Table 3.12). These mutations could cause disease
phenotype in human, such as Lung cancer, cranial volume and Asthma atopic.
Above all, we tried to find some genes which are disease related genes and has unique
mutations on baboon and mandrill. To us supervise that we only find one gene (LRRK2) has a
unique amino acid changes in baboon and mandrill in position 1210 (Figure 3.14). For this site, all
other species is a tyrosine, but that for baboon and mandrill is cysteine and this change has reduced
the Hydrophobicity, which could affect the gene’s function.
Table 3.11 Genes with disease and its’ mutation on baboon and mandrill.
Gene name Position Wild type AA Mutation AA Disease Description
ALAD 59 K N Amyotrophic lateral sclerosis
CIITA 500 G A Multiple sclerosis
CRB1 959 G S Retinitis pigmentosa
IL4R 75 I V Asthma, atopic
MCPH1 761 A V Cranial volume
NPHS2 192 I V Nephrotic syndrome
TP53BP1 353 D E Lung cancer
Table 3.12 Mutation effect of some disease related genes.
72
PROTEIN UNIPROT_ID REF ALT POS VAR SIFT Domain
ALAD P13716 K N 59 K59N 1 ALAD
CIITA P33076 G A 500 G500A 1 NACHT domain
CRB1 P82279 G S 959 G959S 0.66 PFAM
NO/PROSITE(EGF-
like 14)
IL4R J9JII2 I V 75 I75V 0.82 Interleukin-4
receptor
MCPH1 Q8NEM0 A V 761 A761V 0.52 BRCT domain
NPHS2 Q9NP85 I V 192 I192V 1 SPFH domain /
Band 7 family
TP53BP1 Q12888 D E 353 D353E 1 not included
Figure 3.14 The unique mutation on baboon and mandrill with Y1210C.
73
4. Discussion
Primates are well studied mammals because of their evolutionarily importance as well as their close
relationship to human. As for genomic researches, there were many primate genomes available and
genomic features of primates have already been comprehensively studied. Despite current
progresses in primate genomic studies, more genomic data for primate species are necessary for
further studies to improve our understanding of primates in evolutionary studies and applications.
Here, applying second generation sequencing technologies, I established two draft genomes for
baboon and mandrill respectively, which are valuable resources for primate and diseases studies.
The contig N50 of the two genomes were longer than 20 kb while the scaffold N50 reached more
than 1 Mb (3.56 Mb for mandrill), indicating good quality of the assembled genomes. In order to
further improve the genome assemblies, long reads may be applied to fill in the gaps of the
assembly and improve the contig continuity, while genetic maps were necessary for anchoring the
scaffolds onto chromosomes. However, lacking of genetic maps usually impeded construction of
chromosome-level genome assemblies of primates. With further development of technologies like
Hi-C sequencing (formaldehyde cross-linking and sequencing), the assembled scaffolds may be
further anchored to chromosomes, even without the genetic maps.
Secondly, genomic features of baboon and mandrill were comprehensively explored with the draft
genomes. The repeat content and gene content were similar to other primate species. According to
the phylogenetic tree constructed based on single copy gene families, baboon and mandrill were
found to be located in the same clade and the divergence time from the human clade was about 28.5
million years ago (MYA), and the two species of baboon and mandrill were split about 5.8 MYA.
Evolutionary changes including chromosome-level changes, gene families changes (expanded,
contracted and specific gene families) and positively selected genes were identified here to reflect
74
genetic differences of the two primate species comparing to others. For example, chromosome
fusion events (fusion of human chromosome 13 and 14) have been identified even with the scaffold
level genome assembly here. Thus, with further improvement of the genome assembly, especially
the chromosome-level genome assembly, further investigation of the genomic changes can be
conducted to comprehensively reveal evolutionary changes.
Thirdly, since baboon is usually used as model for human diseases researches and both species have
some specific features, genetic mechanisms underpin immune, language ability as well as olfactory
have been investigated. For immune, MHC regions were specifically analyzed in mandrill genome,
because only mandrill genome assembly here was relatively complete. A very good synteny has
been found between mandrill and human MHC region with only 54 insertions and deletions (longer
than 100 bp) were found. And for homologs of human leukocyte antigen (HLA), I found fewer
HLA genes in both baboon and mandrill comparing to human (8 genes in total with five of them
harboring deleterious mutations). And different from chimpanzee, two MIC genes can be found in
baboon and mandrill although one of them has probably become pseudogenes. The similarity in
MHC region, and lacking of HLA gene families are probably related with the success of cross-
species plant cases. For baboon, improvement of the assembly in the MHC region should be
valuable for future studies. For language ability, I explored the FOXP2 genes in the two species to
find two amino acid changes comparing to the human FOXP2 gene, thus further validations should
be carried out to further illustrate the influences of these mutations. Substantially expanded
olfactory receptor (OR) genes were found in baboon and mandrill comparing to other species,
indicating specific olfactory systems for these two species, which also wait for further studies. With
the found of some mutation in genes that would cause disease in human but that not show a
diseased phenotype. For example, MCPH1 a gene which is identified as being responsible for the
neurodevelopmental disorder primary microcephaly type 1, that is characterized by a smaller-than-
75
normal brain size and mental retardation{Liu, 2016 #299}. We found the consensus mutation on all
the primates but except Homo sapiens (Figure 4.1). Compared with all the primates, it’s easy to find
that the human sapiens has the largest brain volume. We infer that this mutation is positive selection
site in human been and it may have accelerated the intelligence in the evolution of Human. What’s
more, some disease mutation on baboon and mandrill also made them a better medical model. We
can use CRISPR technology to edit the genome of baboon and then see its phenotype, then we can
use some newest medicine on them to select the best cure solutions which may facilitate the develop
of medicine.
Figure 4.1 The volume of cranial capacity in primates.
Finally, assembly and analysis of the two draft genomes of baboon and mandrill also reflected the
possibility of establishing more genomes for primate species. Primates are an order of mammal
species with ~16 families and ~500 species, which are all highly evolved animal species with
76
special physiological and behavioral characteristics. Despite the evolutionary importance and
relative simple genome content, there were only ~20 species already have been established
reference genomes. Also, the genome assemblies were quite different in quality and continuity,
making it more difficult for further analysis and applications. Thus, establishing draft genomes
using second generation sequencing for all primate species can be invaluable for evolutionary
researches, conservation/preservation, as well as human genetic/diseases researches and
applications. The plan to sequence all primate species in near future, using either second generation
sequencing technologies combined with 10X or Hi-C library construction methods, or the third
generation long reads sequencing, should be feasible. With genome sequence available, repeat and
gene annotation, as well as comparative genomics among the primate species can also be conducted.
77
5. Conclusions
Firstly, draft genomes of baboon and mandrill have been established in this study, which can serve
as reference dataset for future genome sequencing and comparative genomic studies. With more
than 100× second generation sequencing data from different sequencing libraries, whole genome
shotgun (WGS) assemblies of both species were finished, with the genome size of 3.12 Gb and 2.88
Gb respectively. Then genome assemblies reached to high continuity reflected by long contig N50
of more than 20 kb and scaffold N50 longer than 1 Mb. The longest scaffold was longer than 8.8
Mb in baboon and 19.1 Mb in mandrill. ~40% of the genome were annotated to be repeat sequenced
and 23,867 and 21,906 protein coding genes were annotated respectively. BUSCO assessment
indicated high quality of both the genome assembly and gene annotation with high coverages (98%
and 99%) of the conserved genes.
Secondly, with the draft genome sequences available, basic genomic features were investigated and
compared to related species to find similar repeat content, protein coding gene numbers and gene
families in baboon and mandrill comparing to other primates. Only 489/342 gene families with
598/515 genes were found to be specific in baboon/mandrill. And fewer segmental duplications
(SDs) were found in baboon and mandrill comparing to human.
Thirdly, evolution of the two species was comprehensively analyzed to find the demographic
changes, chromosome-level changes, gene family expansion and contraction, as well as positively
selected genes. Baboon and mandrill were found to be located in the same clade with macaque and
they were diverged from human clade about 28.5 (27.5–30.4) million years ago (MYA) while the
divergence time between Cercopithecoidea and Hominoidea was estimated to be 26.66 (24.29–
78
28.95) MYA. Baboon and mandrill were found to be split from each other ~5.8 (5.0–6.8) MYA.
Demographic changes along evolution with a sharp increase followed by a noticeable bottleneck
happened ~28 thousand years ago were observed for both the baboon and mandrill. Synteny
between baboon, mandrill and human were established to find chromosomal rearrangements (fusion
of chromosome 13 and 14 and chromosome breaks of chromosome 7 and 10). For gene family
evolution, the lineage of baboon and mandrill had 545 expanded and 618 contracted gene families,
with gene families of important functions to be expanded including PPIA which can induce an
inflammatory response and mitigate tissue injury, and PRDX6 family, which can reduce peroxides
and protection against oxidative injury during metabolism. 657 positively selected genes were
identified for the lineage of baboon and mandrill and some of them were also related with immune
responses.
Finally, underlying genetic mechanisms for immune system, language and olfactory were
investigated to find highly consistent MHC regions with fewer HLA genes, two amino acid
mutations in FOXP2 genes, and notably expanded olfactory gene families in baboon and mandrill.
Good synteny was found between mandrill and human in MHC region with only 54 insertion and
deletion (indels) longer than 100 bp in MHC region I. And fewer HLA genes in baboon and
mandrill were found comparing to human (8 in total with 5 to become pseudogenes).
79
6. Future perspectives
i) Further improving the genome assemblies. Especially by applying third generation
sequencing and Hi-C sequencing, chromosome-level genome assembly with fewer gaps can
be achieved. And for the highly repetitive regions including MHC regions, better assembly
would benefit future functional and comparative genomic studies.
ii) Constructing genome database for these species. In order to effectively share the genome
data, database can be established.
iii) Further functional/molecular studies of some genomic features. Genomic features
including specific gene families, mutations in functionally important genes (for example,
FOXP2 gene) as well as expanded gene families (for example, olfactory receptor genes)
were identified in this study but further validations through functional studies should be
required for illustration of the mechanisms related with these functions.
iv) Large scale genome sequencing of primates. With experiences obtained in this study,
large scale genome sequencing aiming at establishing draft genomes for all primate species
can be further considered.
80
7. References
1. Wilson DE and Reeder DM. Mammal species of the world: a taxonomic and geographic
reference. JHU Press; 2005.
2. Jolly C. Introduction to the Cercopithecoidea, with notes on their use as laboratory animals.
In: Symp Zool Soc Lond 1966, pp.427-57.
3. Fleagle JG and McGraw WS. Skeletal and dental morphology supports diphyletic origin of
baboons and mandrills. Proceedings of the National Academy of Sciences. 1999;96 3:1157-
61.
4. Perelman P, Johnson WE, Roos C, Seuánez HN, Horvath JE, Moreira MA, et al. A
molecular phylogeny of living primates. PLoS genetics. 2011;7 3:e1001342.
5. Liedigk R, Roos C, Brameier M and Zinner D. Mitogenomics of the Old World monkey
tribe Papionini. BMC evolutionary biology. 2014;14 1:176.
6. Sigg H, Stolba A, Abegglen J-J and Dasser V. Life history of hamadryas baboons: physical
development, infant mortality, reproductive parameters and family relationships. Primates.
1982;23 4:473-87.
7. Groves CP. Primate taxonomy. 2001.
8. Kingdon J. The Kingdon field guide to African mammals. Bloomsbury Publishing; 2015.
9. Zinner D, Groeneveld LF, Keller C and Roos C. Mitochondrial phylogeography of baboons
(Papio spp.)–Indication for introgressive hybridization? BMC evolutionary biology. 2009;9
1:83.
10. Gesquiere LR, Learn NH, Simao MCM, Onyango PO, Alberts SC and Altmann J. Life at
the top: rank and stress in wild male baboons. Science. 2011;333 6040:357-60.
11. Rogers J and Hixson JE. Baboons as an animal model for genetic studies of common human
disease. The American Journal of Human Genetics. 1997;61 3:489-93.
12. Chai D, Cuneo S, Falconer H, Mwenda J and D'Hooghe T. Olive baboon (Papio anubis
anubis) as a model for intrauterine research. Journal of medical primatology. 2007;36 6:365-
9.
13. Cox LA, Comuzzie AG, Havill LM, Karere GM, Spradling KD, Mahaney MC, et al.
Baboons as a model to study genetics and epigenetics of human disease. ILAR journal.
2013;54 2:106-21.
14. Locher CP, Witt SA, Herndier BG, Tenner‐Racz K, Racz P and Levy JA. Baboons as an
animal model for human immunodeficiency virus pathogenesis and vaccine development.
Immunological reviews. 2001;183 1:127-40.
15. Starzl TE, Fung J, Tzakis A, Todo S, Demetris A, Marino I, et al. Baboon-to-human liver
transplantation. The lancet. 1993;341 8837:65-71.
16. Taylor Jr F, Chang A, Esmon C, D'angelo A, Vigano-D'Angelo S and Blick K. Protein C
prevents the coagulopathic and lethal effects of Escherichia coli infusion in the baboon.
Journal of Clinical Investigation. 1987;79 3:918.
17. Hanson SR, Powell JS, Dodson T, Lumsden A, Kelly AB, Anderson JS, et al. Effects of
angiotensin converting enzyme inhibition with cilazapril on intimal hyperplasia in injured
arteries and vascular grafts in the baboon. Hypertension. 1991;18 4 Suppl:II70.
18. Ryabchikova EI, Kolesnikova LV and Luchko SV. An analysis of features of pathogenesis
in two animal models of Ebola virus infection. The Journal of infectious diseases. 1999;179
Supplement_1:S199-S202.
19. Huneke RB, Michaels MG, Kaufman CL and Ildstad ST. Antibody response in baboons
(Papio cynocephalus anubis) to a commercially available encephalomyocarditis virus
81
vaccine. Comparative Medicine. 1998;48 5:526-8.
20. VanCott TC, Mascola JR, Loomis-Price LD, Sinangil F, Zitomersky N, McNeil J, et al.
Cross-subtype neutralizing antibodies induced in baboons by a subtype E gp120 immunogen
based on an R5 primary human immunodeficiency virus type 1 envelope. Journal of
virology. 1999;73 6:4640-50.
21. Drewe JA, O’Riain MJ, Beamish E, Currie H and Parsons S. Survey of infections
transmissible between baboons and humans, Cape Town, South Africa. Emerging infectious
diseases. 2012;18 2:298.
22. Stearns-Kurosawa DJ, Lupu F, Taylor FB, Kinasewitz G and Kurosawa S. Sepsis and
pathophysiology of anthrax in a nonhuman primate model. The American journal of
pathology. 2006;169 2:433-44.
23. Khlebnikov V, Golovlev I, Zhemchugov V, Chugunov A, Averin S, Afanas' ev S, et al. The
immunological efficacy of Francisella tularensis outer membranes for hamadryas baboons.
Zhurnal mikrobiologii, epidemiologii, i immunobiologii. 1993; 3:61-4.
24. Githure JI, Reid GD, Binhazim AA, Anjili CO, Shatry AM and Hendricks LD. Leishmania
major: the suitability of East African nonhuman primates as animal models for cutaneous
leishmaniasis. Experimental parasitology. 1987;64 3:438-47.
25. Yole D, Pemberton R, Reid G and Wilson R. Protective immunity to Schistosoma mansoni
induced in the olive baboon Papio anubis by the irradiated cercaria vaccine. Parasitology.
1996;112 1:37-46.
26. Nyindo M and Farah I. The baboon as a non-human primate model of human schistosome
infection. Parasitology Today. 1999;15 12:478-82.
27. Mafuyai H, Barshep Y, Audu B, Kumbak D and Ojobe T. Baboons as potential reservoirs of
zoonotic gastrointestinal parasite infections at Yankari National Park, Nigeria. African
health sciences. 2013;13 2:252-4.
28. Prescott M. Primate sensory capabilities and communication signals: implications for care
and use in the laboratory. National Centre for the Replacement, Refinement and Reduction
of Animals in Research; 2006.
29. Boë L-J, Berthommier F, Legou T, Captier G, Kemp C, Sawallis TR, et al. Evidence of a
Vocalic Proto-System in the Baboon (Papio papio) Suggests Pre-Hominin Speech
Precursors. PloS one. 2017;12 1:e0169321.
30. Nishimura T, Mikami A, Suzuki J and Matsuzawa T. Descent of the hyoid in chimpanzees:
evolution of face flattening and speech. Journal of Human Evolution. 2006;51 3:244-54.
31. Kuhl PK and Meltzoff AN. Infant vocalizations in response to speech: Vocal imitation and
developmental change. The journal of the Acoustical Society of America. 1996;100 4:2425-
38.
32. Boë L-J, Badin P, Ménard L, Captier G, Davis B, MacNeilage P, et al. Anatomy and control
of the developing human vocal tract: A response to Lieberman. Journal of Phonetics.
2013;41 5:379-92.
33. Nowak RM. Walker's mammals of the world. JHU Press; 1999.
34. Harrison MJ. The mandrill in Gabon's rain forest—ecology, distribution and status. Oryx.
1988;22 4:218-28.
35. Hoshino J. Feeding ecology of mandrills (Mandrillus sphinx) in Campo animal reserve,
Cameroon. Primates. 1985;26 3:248-73.
36. Leigh SR, Setchell JM, Charpentier M, Knapp LA and Wickings EJ. Canine tooth size and
fitness in male mandrills (Mandrillus sphinx). Journal of Human Evolution. 2008;55 1:75-85.
37. Setchell JM and Dixson AF. Changes in the secondary sexual adornments of male mandrills
(Mandrillus sphinx) are associated with gain and loss of alpha status. Hormones and
Behavior. 2001;39 3:177-84.
82
38. Setchell JM and Dixson AF. Developmental variables and dominance rank in adolescent
male mandrills (Mandrillus sphinx). American journal of primatology. 2002;56 1:9-25.
39. Setchell JM, Vaglio S, Abbott KM, Moggi-Cecchi J, Boscaro F, Pieraccini G, et al. Odour
signals major histocompatibility complex genotype in an Old World monkey. Proceedings
of the Royal Society of London B: Biological Sciences. 2010:rspb20100571.
40. Setchell JM, Richards SA, Abbott KM and Knapp LA. Mate-guarding by male mandrills
(Mandrillus sphinx) is associated with female MHC genotype. Behavioral Ecology.
2016:arw106.
41. Feistner AT. Scent marking in mandrills, Mandrillus sphinx. Folia Primatologica. 1991;57
1:42-7.
42. Pandrea I, Apetrei C, Dufour J, Dillon N, Barbercheck J, Metzger M, et al. Simian
immunodeficiency virus SIVagm. sab infection of Caribbean African green monkeys: a new
model for the study of SIV pathogenesis in natural hosts. Journal of virology. 2006;80
10:4858-67.
43. Roussel M, Pontier D, Ngoubangoye B, Kazanji M, Verrier D and Fouchet D. Modes of
transmission of Simian T-lymphotropic Virus Type 1 in semi-captive mandrills (Mandrillus
sphinx). Veterinary microbiology. 2015;179 3:155-61.
44. Nerrienet E, Amouretti X, Müller-Trutwin M, Poaty-Mavoungou V, Bedjebaga I, Nguyen
HT, et al. Phylogenetic analysis of SIV and STLV type I in mandrills (Mandrillus sphinx):
indications that intracolony transmissions are predominantly the result of male-to-male
aggressive contacts. AIDS research and human retroviruses. 1998;14 9:785-96.
45. Zwick LS, Walsh TF, Barbiers R, Collins MT, Kinsel MJ and Murnane RD.
Paratuberculosis in a mandrill (Papio sphinx). Journal of Veterinary Diagnostic
Investigation. 2002;14 4:326-8.
46. O'Rourke J, Dixon M, Jack A, Enno A and Lee A. Gastric B‐cell mucosa‐associated
lymphoid tissue (MALT) lymphoma in an animal model of ‘Helicobacter
heilmannii’infection. The Journal of pathology. 2004;203 4:896-903.
47. Setchell JM, Bedjabaga I-B, Goossens B, Reed P, Wickings EJ and Knapp LA. Parasite
prevalence, abundance, and diversity in a semi-free-ranging colony of Mandrillus sphinx.
International Journal of Primatology. 2007;28 6:1345-62.
48. Ungeheuer M, Elissa N, Morelli A, Georges A, Deloron P, Debre P, et al. Cellular responses
to Loa loa experimental infection in mandrills (Mandrillus sphinx) vaccinated with
irradiated infective larvae. Parasite immunology. 2000;22 4:173-84.
49. International Human Genome Sequencing C. Initial sequencing and analysis of the human
genome. Nature. 2001;409:860. doi:10.1038/35057062
https://www.nature.com/articles/35057062#supplementary-information.
50. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature.
2005;437 7055:69-87. doi:10.1038/nature04072.
51. Rogers J and Gibbs RA. Comparative primate genomics: emerging patterns of genome
content and dynamics. Nat Rev Genet. 2014;15 5:347-59. doi:10.1038/nrg3707
http://www.nature.com/nrg/journal/v15/n5/abs/nrg3707.html#supplementary-information.
52. Prufer K, Munch K, Hellmann I, Akagi K, Miller JR, Walenz B, et al. The bonobo genome
compared with the chimpanzee and human genomes. Nature. 2012;486 7404:527-31.
doi:10.1038/nature11128.
53. Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, et al. A high-resolution
map of human evolutionary constraint using 29 mammals. Nature. 2011;478:476.
doi:10.1038/nature10530
https://www.nature.com/articles/nature10530#supplementary-information.
83
54. Locke DP, Hillier LW, Warren WC, Worley KC, Nazareth LV, Muzny DM, et al.
Comparative and demographic analysis of orang-utan genomes. Nature. 2011;469 7331:529-
33. doi:http://www.nature.com/nature/journal/v469/n7331/abs/10.1038-nature09687-
unlocked.html#supplementary-information.
55. Meyer M, Kircher M, Gansauge MT, Li H, Racimo F, Mallick S, et al. A high-coverage
genome sequence from an archaic Denisovan individual. Science. 2012;338 6104:222-6.
doi:10.1126/science.1224344.
56. and Analysis ConsortiumThe Chimpanzee S. Initial sequence of the chimpanzee genome
and comparison with the human genome. Nature. 2005;437 7055:69-87.
doi:http://www.nature.com/nature/journal/v437/n7055/suppinfo/nature04072_S1.html.
57. Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, et al.
Evolutionary and Biomedical Insights from the Rhesus Macaque Genome. Science.
2007;316 5822:222-34. doi:10.1126/science.1139247.
58. Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero J, et al. Insights into
hominid evolution from the gorilla genome sequence. Nature. 2012;483 7388:169-75.
doi:http://www.nature.com/nature/journal/v483/n7388/abs/nature10842.html#supplementary
-information.
59. Gordon D, Huddleston J, Chaisson MJP, Hill CM, Kronenberg ZN, Munson KM, et al.
Long-read sequence assembly of the gorilla genome. Science. 2016;352 6281
doi:10.1126/science.aae0344.
60. Johnson ME, Viggiano L, Bailey JA, Abdul-Rauf M, Goodwin G, Rocchi M, et al. Positive
selection of a gene family during the emergence of humans and African apes. Nature.
2001;413 6855:514-9. doi:10.1038/35097067.
61. Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, et al.
Evolutionary and biomedical insights from the rhesus macaque genome. Science. 2007;316
5822:222-34.
62. Harris RA, Tardif SD, Vinar T, Wildman DE, Rutherford JN, Rogers J, et al. Evolutionary
genetics and implications of small size and twinning in callitrichine primates. Proceedings
of the National Academy of Sciences of the United States of America. 2014;111 4:1467-72.
doi:10.1073/pnas.1316037111.
63. Mailund T, Halager AE, Westergaard M, Dutheil JY, Munch K, Andersen LN, et al. A New
Isolation with Migration Model along Complete Genomes Infers Very Different Divergence
Processes among Closely Related Great Ape Species. PLoS Genetics. 2012;8 12:e1003125.
doi:10.1371/journal.pgen.1003125.
64. Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO and Shendure J. Chromosome-
scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat
Biotechnol. 2013;31 12:1119-25. doi:10.1038/nbt.2727.
65. Kaplan N and Dekker J. High-throughput genome scaffolding from in vivo DNA interaction
frequency. Nat Biotech. 2013;31 12:1143-7. doi:10.1038/nbt.2768
http://www.nature.com/nbt/journal/v31/n12/abs/nbt.2768.html#supplementary-information.
66. Chaisson MJP, Wilson RK and Eichler EE. Genetic variation and the de novo assembly of
human genomes. Nat Rev Genet. 2015;16 11:627-40. doi:10.1038/nrg3933.
67. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al.
Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the
Human Genome. Science. 2009;326 5950:289-93. doi:10.1126/science.1181369.
68. Pan_troglodytes-2.1.4 assembly. National Center for Biotechnology Information [online],
2011.
69. Prufer K, Munch K, Hellmann I, Akagi K, Miller JR, Walenz B, et al. The bonobo genome
compared with the chimpanzee and human genomes. Nature. 2012;486 7404:527-31.
84
doi:http://www.nature.com/nature/journal/v486/n7404/abs/nature11128.html#supplementary
-information.
70. Zimin AV, Cornish AS, Maudhoo MD, Gibbs RM, Zhang X, Pandey S, et al. A new rhesus
macaque assembly and annotation for next-generation sequencing analyses. Biol Direct.
2014;9 1:20. doi:10.1186/1745-6150-9-20.
71. Yan G, Zhang G, Fang X, Zhang Y, Li C, Ling F, et al. Genome sequencing and comparison
of two nonhuman primate animal models, the cynomolgus and Chinese rhesus macaques.
Nat Biotech. 2011;29 11:1019-23.
doi:http://www.nature.com/nbt/journal/v29/n11/abs/nbt.1992.html#supplementary-
information.
72. Perry GH, Reeves D, Melsted P, Ratan A, Miller W, Michelini K, et al. A Genome
Sequence Resource for the Aye-Aye (Daubentonia madagascariensis), a Nocturnal Lemur
from Madagascar. Genome Biol Evol. 2012;4 2:126-35. doi:10.1093/gbe/evr132.
73. Warren WC, Jasinska AJ, Garcia-perez R, Svardal H, Tomlinson C, Rocchi M, et al. The
genome of the vervet (Chlorocebus aethiops sabaeus). Genome Res. 2015;
doi:10.1101/gr.192922.115.
74. Carbone L, Alan Harris R, Gnerre S, Veeramah KR, Lorente-Galdos B, Huddleston J, et al.
Gibbon genome and the fast karyotype evolution of small apes. Nature. 2014;513 7517:195-
201. doi:10.1038/nature13679
http://www.nature.com/nature/journal/v513/n7517/abs/nature13679.html#supplementary-
information.
75. The Marmoset Genome S and Analysis C. The common marmoset genome provides insight
into primate biology and evolution. Nat Genet. 2014;46 8:850-7. doi:10.1038/ng.3042
http://www.nature.com/ng/journal/v46/n8/abs/ng.3042.html#supplementary-information.
76. Schmitz J, Noll A, Raabe CA, Churakov G, Voss R, Kiefmann M, et al. Genome sequence
of the basal haplorrhine primate Tarsius syrichta reveals unusual insertions. Nature
Communications. 2016;7:12997. doi:10.1038/ncomms12997.
77. Silva JC and Kondrashov AS. Patterns in spontaneous mutation revealed by human–baboon
sequence comparison. TRENDS in Genetics. 2002;18 11:544-7.
78. VandeBerg JL, Williams-Blangero S and Tardif SD. The baboon in biomedical research.
New York: Springer; 2009.
79. Cox LA, Mahaney MC, VandeBerg JL and Rogers J. A second-generation genetic linkage
map of the baboon (Papio hamadryas) genome. Genomics. 2006;88 3:274-81.
doi:https://doi.org/10.1016/j.ygeno.2006.03.020.
80. Rogers J, Mahaney MC, Witte SM, Nair S, Newman D, Wedel S, et al. A genetic linkage
map of the baboon (Papio hamadryas) genome based on human microsatellite
polymorphisms. Genomics. 2000;67 3:237-47.
81. Voruganti VS, Tejero ME, Proffitt JM, Cole SA, Freeland-Graves JH and Comuzzie AG.
Genome-wide Scan of Plasma Cholecystokinin in Baboons Shows Linkage to Human
Chromosome 17. Obesity. 2007;15 8:2043-50. doi:10.1038/oby.2007.243.
82. Tejero ME, Voruganti VS, Proffitt JM, Curran JE, Goring HH, Johnson MP, et al. Cross-
species replication of a resistin mRNA QTL, but not QTLs for circulating levels of resistin,
in human and baboon. Heredity. 2008;101 1:60-6. doi:10.1038/hdy.2008.28.
83. Tejero ME, Cole SA, Cai G, Peebles KW, Freeland-Graves JH, Cox LA, et al. Genome-
wide scan of resistin mRNA expression in omental adipose tissue of baboons. International
Journal Of Obesity. 2004;29:406. doi:10.1038/sj.ijo.0802699.
84. Pandrea I, Onanga R, Souquiere S, Mouinga-Ondéme A, Bourry O, Makuwa M, et al.
Paucity of CD4(+) CCR5(+) T Cells May Prevent Transmission of Simian
Immunodeficiency Virus in Natural Nonhuman Primate Hosts by Breast-Feeding. Journal of
85
Virology. 2008;82 11:5501-9. doi:10.1128/JVI.02555-07.
85. Charpentier M, Setchell J, Prugnolle F, Knapp L, Wickings E, Peignot P, et al. Genetic
diversity and reproductive success in mandrills (Mandrillus sphinx). Proceedings of the
National Academy of Sciences of the United States of America. 2005;102 46:16723-8.
86. Gokcumen O, Tischler V, Tica J, Zhu Q, Iskow RC, Lee E, et al. Primate genome
architecture influences structural variation mechanisms and functional consequences.
Proceedings of the National Academy of Sciences. 2013;110 39:15764.
87. Cordaux R and Batzer MA. The impact of retrotransposons on human genome evolution.
Nat Rev Genet. 2009;10 10:691-703. doi:10.1038/nrg2640.
88. Marques-Bonet T, Ryder OA and Eichler EE. Sequencing primate genomes: what have we
learned? Annual review of genomics and human genetics. 2009;10:355-86.
doi:10.1146/annurev.genom.9.081307.164420.
89. She X, Horvath JE, Jiang Z, Liu G, Furey TS, Christ L, et al. The structure and evolution of
centromeric transition regions within the human genome. Nature. 2004;430:857.
doi:10.1038/nature02806
https://www.nature.com/articles/nature02806#supplementary-information.
90. Bailey JA and Eichler EE. Primate segmental duplications: crucibles of evolution, diversity
and disease. Nat Rev Genet. 2006;7 7:552-64. doi:10.1038/nrg1895.
91. Eichler EE, Budarf ML, Rocchi M, Deaven LL, Doggett NA, Baldini A, et al.
Interchromosomal duplications of the adrenoleukodystrophy locus: a phenomenon of
pericentromeric plasticity. Human molecular genetics. 1997;6 7:991-1002.
92. Riethman HC, Xiang Z, Paul S, Morse E, Hu XL, Flint J, et al. Integration of telomere
sequences with the draft human genome sequence. Nature. 2001;409 6822:948-51.
93. Linardopoulou EV, Williams EM, Fan Y, Friedman C, Young JM and Trask BJ. Human
subtelomeres are hot spots of interchromosomal recombination and segmental duplication.
Nature. 2005;437 7055:94-100.
94. Antonell A, De LORX and Perez Jurado LA. Evolutionary mechanisms shaping the
genomic structure of the Williams-Beuren syndrome chromosomal region at human 7q11.23.
Genome Research. 2005;15 9:1179.
95. Li R, Fan W, Tian G, Zhu H, He L, Cai J, et al. The sequence and de novo assembly of the
giant panda genome. Nature. 2010;463 7279:311.
96. Minoche AE, Dohm JC and Himmelbauer H. Evaluation of genomic high-throughput
sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome
biology. 2011;12 11:R112.
97. Magoč T and Salzberg SL. FLASH: fast length adjustment of short reads to improve
genome assemblies. Bioinformatics. 2011;27 21:2957-63.
98. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, et al. SOAPdenovo2: an empirically
improved memory-efficient short-read de novo assembler. Gigascience. 2012;1 1:18.
99. Tarailo‐Graovac M and Chen N. Using RepeatMasker to identify repetitive elements in
genomic sequences. Current protocols in bioinformatics. 2009:4.10. 1-4.. 4.
100. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O and Walichiewicz J. Repbase
Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 2005;110 1-
4:462-7.
101. Smit A and Hubley R. RepeatModeler Open-1.0. Repeat Masker Website. 2010.
102. Xu Z and Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR
retrotransposons. Nucleic Acids Res. 2007;35 suppl 2:W265-W8.
103. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids
research. 1999;27 2:573.
104. Lowe TM and Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA
86
genes in genomic sequence. Nucleic acids research. 1997;25 5:955-64.
105. Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, et al. Rfam: updates to
the RNA families database. Nucleic acids research. 2008;37 suppl_1:D136-D40.
106. Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res. 2002;12 4:656-64.
107. Birney E, Clamp M and Durbin R. GeneWise and genomewise. Genome Res. 2004;14
5:988-95.
108. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith Jr RK, Hannick LI, et al. Improving
the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic
Acids Res. 2003;31 19:5654-66.
109. Stanke M, Keller O, Gunduz I, Hayes A, Waack S and Morgenstern B. AUGUSTUS: ab
initio prediction of alternative transcripts. Nucleic Acids Res. 2006;34 suppl 2:W435-W9.
110. Elsik CG, Mackey AJ, Reese JT, Milshina NV, Roos DS and Weinstock GM. Creating a
honey bee consensus gene set. Genome biology. 2007;8 1:R13.
111. Götz S, García-Gómez JM, Terol J, Williams TD, Nagaraj SH, Nueda MJ, et al. High-
throughput functional annotation and data mining with the Blast2GO suite. Nucleic acids
research. 2008;36 10:3420-35.
112. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths‐Jones S, et al. The Pfam
protein families database. Nucleic acids research. 2004;32 suppl_1:D138-D41.
113. Parra G, Bradnam K and Korf I. CEGMA: a pipeline to accurately annotate core genes in
eukaryotic genomes. Bioinformatics. 2007;23 9:1061-7.
114. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV and Zdobnov EM. BUSCO:
assessing genome assembly and annotation completeness with single-copy orthologs.
Bioinformatics. 2015;31 19:3210-2.
115. Guindon S, Delsuc F, Dufayard J-F and Gascuel O. Estimating maximum likelihood
phylogenies with PhyML. Bioinformatics for DNA sequence analysis. 2009:113-37.
116. Huchon D, Chevret P, Jordan U, Kilpatrick CW, Ranwez V, Jenkins PD, et al. Multiple
molecular evidences for a living mammalian fossil. Proceedings of the National Academy of
Sciences. 2007;104 18:7495-9.
117. Glazko GV and Nei M. Estimation of divergence times for major lineages of primate species.
Molecular biology and evolution. 2003;20 3:424-34.
118. Schrago C and Voloch C. The precision of the hominid timescale estimated by relaxed clock
methods. Journal of evolutionary biology. 2013;26 4:746-55.
119. De Bie T, Cristianini N, Demuth JP and Hahn MW. CAFE: a computational tool for the
study of gene family evolution. Bioinformatics. 2006;22 10:1269-71.
120. Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, et al. Recent segmental
duplications in the human genome. Science. 2002;297 5583:1003-7.
121. Takahashi H, Takahashi K and Liu F-C. FOXP genes, neural development, speech and
language disorders. Forkhead Transcription Factors. Springer; 2009. p. 117-29.
122. Heymann EW. The neglected sense–olfaction in primate behavior, ecology, and evolution.
American journal of primatology. 2006;68 6:519-24.
123. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, et al. MEME SUITE: tools
for motif discovery and searching. Nucleic Acids Res. 2009;37 Web Server issue:20.
124. Pettersson E, Lundeberg J and Ahmadian A. Generations of sequencing technologies.
Genomics. 2009;93 2:105-11.
125. Kriegs JO, Churakov G, Jurka J, Brosius J and Schmitz J. Evolutionary history of 7SL
RNA-derived SINEs in Supraprimates. Trends Genet. 2007;23 4:158-61.
126. Raaum RL, Sterner KN, Noviello CM, Stewart C-B and Disotell TR. Catarrhine primate
divergence dates estimated from complete mitochondrial genomes: concordance with fossil
and nuclear DNA evidence. Journal of Human Evolution. 2005;48 3:237-57.
87
127. Steiper ME and Young NM. Primate molecular divergence dates. Molecular phylogenetics
and evolution. 2006;41 2:384-94.
128. Anzai T, Shiina T, Kimura N, Yanagiya K, Kohara S, Shigenari A, et al. Comparative
sequencing of human and chimpanzee MHC class I regions unveils insertions/deletions as
the major path to genomic divergence. Proceedings of the National Academy of Sciences.
2003;100 13:7708-13.
129. Gaudieri S, Giles KM, Kulski JK and Dawkins RL. Duplication and polymorphism in the
MHC: Alu generated diversity and polymorphism within the PERB11 gene family.
Hereditas. 1997;127 1‐2:37-46.
130. Yamazaki M, Tateno Y and Inoko H. Genomic organization around the centromeric end of
the HLA class I region: Large-scale sequence analysis. Journal of molecular evolution.
1999;48 3:317-27.
131. Konopka G, Bomar JM, Winden K, Coppola G, Jonsson ZO, Gao F, et al. Human-specific
transcriptional regulation of CNS development genes by FOXP2. Nature. 2009;462
7270:213-7.
132. Spiteri E, Konopka G, Coppola G, Bomar J, Oldham M, Ou J, et al. Identification of the
transcriptional targets of FOXP2, a gene linked to speech and language, in developing
human brain. The American Journal of Human Genetics. 2007;81 6:1144-57.
133. Burrows AM. Primate Anatomy: An Introduction. JSTOR, 2001.
134. Laidre ME. Informative breath: olfactory cues sought during social foraging among Old
World monkeys (Mandrillus sphinx, M. Leucophaeus, and Papio anubis). Journal of
Comparative Psychology. 2009;123 1:34.
135. Goto T, Salpekar A and Monk M. Expression of a testis-specific member of the olfactory
receptor gene family in human primordial germ cells. Molecular human reproduction.
2001;7 6:553-8.
136. Price AL, Jones NC and Pevzner PA. De novo identification of repeat families in large
genomes. Bioinformatics. 2005;21 suppl_1:i351-i8.
88
8. Appendix
Table 8.1 Statistics of baboon and mandrill clean/filtered sequencing data.
Species Pair-end
Libraries
Insert size
(bp)
Average
reads length
(bp)
Clean data
(Gb)
Sequencing
depth
Baboon 250 150 88.37 29.46
500 100 62.69 20.9
800 100 52.74 17.58
4000 90 36.79 12.26
10000 90 43.82 14.61
Total - - 284.41 94.8
Mandrill 250 150 91.18 30.39
500 100 67.38 22.46
800 100 54.28 18.09
2000 90 18.71 6.24
5000 90 16.3 5.43
10000 90 31.35 10.45
20000 90 10.34 3.45
Total - - 289.55 96.52
Table 8.2 Prediction of the repeats in baboon genome.
Prediction method Repeat size (bp) Percentage in the genome
TRF [103] 88,638,882 2.84
RepeatMasker [99] 1,317,851,115 42.28
RepeatProteinMask [72] 325,375,043 10.43
De novo [136] 1,353,584,808 43.42
Total 1,558,442,757 50.00
Table 8.3 General statistics of repeats in mandrill genome.
Type Repeat Size(bp) Percentage in the genome
TRF 87,221,621 3.03
RepeatMasker 936,130,281 32.47
RepeatProteinMask 281,888,845 9.77
De novo 1,139,310,255 39.52
Total 1,263,424,029 43.83
89
Table 8.4 Categories of TEs in baboon genome.
RepBase TEs TE Proteins De novo Combined TEs
Length (bp) % Length
(bp)
% Length (bp) % Length (bp) %
DNA 85,821,216 2.75 13,073,15
4
0.42 23,596,137 0.76 102,653,65
5
3.29
LINE 524,428,336 16.8
3
267,916,8
87
8.60 728,970,010 23.39 907,729,49
6
29.12
SINE 365,541,149 11.7
3
-- -- 488,745,536 15.68 629,613,40
2
20.20
LTR 246,841,510 7.92 44,428,21
6
1.42 358,098,539 11.49 522,321,82
6
16.76
Other 979 -- -- -- -- 0 979 0
Unknow
n
1,296,802 0.04 -- -- 495,265 0.02 1,791,826 0.06
Total 131,785,111
5
42.2
8
325,375,0
43
10.4
4
1,281,034,87
5
41.10 1,465,054,7
16
47.01
Note: Repbase TEs, the result of RepeatMasker based on Repbase; TE proteins, the result of RepeatProteinMask
based on Repbase; De novo, Result of RepeatMasker by using library predicted through De novo prediction;
Combined: combined results of Repbase TEs, TE proteins and de novo.
Table 8.5 Categories of TEs in mandrill genome.
RepBase TEs TE Proteins De novo Combined TEs
Length (bp) % Length (bp) % Length(bp) % Length (bp) %
DNA 47,923,460 1.66 13,264,158 0.46 27,821,997 0.96 68,516,869 2.37
LINE 401,922,498 13.9 229,014,482 7.94 725,287,701 25.1 815,296,990 28.28
SINE 319,811,862 11.0 -- -- 481,314,186 16.6 576,217,301 19.99
LTR 169,184,719 5.87 39,705,383 1.38 80,223,826 2.78 200,629,837 6.96
Other 81 -- -- -- 3,210 0 3,291 0
Unkno
wn -- -- -- -- 2,897,396 0.1 2,897,396 0.1
Total 936,130,281 32.4 281,888,845 9.78 1,117,858,1
41 38.78 121,695,029 42.
22 Note: Repbase TEs, the result of RepeatMasker based on Repbase; TE proteins, the result of RepeatProteinMask
based on Repbase; De novo, Result of RepeatMasker by using library predicted through De novo prediction;
Combined: combined results of Repbase TEs, TE proteins and de novo.
Table 8.6 Non-coding RNA genes in baboon genome.
Type Copy Average length
(bp)
Total length
(bp)
% of
genome
tRNA 510 75.26 38,384 0.12
rRNA 1,200 101.38 121,666 0.39
rRNA 18S 136 136.05 18,503 0.06
28S 288 155.67 44,833 0.14
5.8S 17 89.94 1,529 0.005
5S 759 74.84 56,801 0. 18
90
snRNA 2,812 110.58 310,963 0.99
snRNA CD-box 900 102.03 91,824 0. 29
HACA-
box
324 135.44 43,881 0. 14
splicing 1,322 118.04 156,045 0. 50
Table 8.7 Non-coding RNA genes in mandrill genome.
Type Copy Average length
(bp)
Total length
(bp)
% genome
tRNA 466 75.36 35,118 0.12
rRNA 982 97.05 95,301 0.33
rRNA 18S 20 252.6 5,052 0.02
28S 205 160.49 32,902 0.11
5.8S 8 103.87 831 0.00
5S 749 75.45 56,516 0.19
snRNA 2716 110.76 300,830 1.04
snRNA CD-box 880 101.76 89,547 0.31
HACA-
box
314 136.82 42,963 0.15
splicing 1261 118.27 149,146 0.52
Table 8.8 Go enrichment of unique gene families in baboon.
GO ID GO term GO class P value
GO:0090266 regulation of mitotic cell cycle spindle
assembly checkpoint
BP 1.60E-03
GO:0048478 replication fork protection BP 7.78E-03
GO:0007416 synapse assembly BP 1.38E-02
GO:0006749 glutathione metabolic process BP 2.34E-02
GO:0007018 microtubule-based movement BP 2.37E-02
GO:0006694 steroid biosynthetic process BP 3.92E-02
GO:0042773 ATP synthesis coupled electron transport BP 1.28E-12
GO:0055114 oxidation-reduction process BP 7.42E-06
GO:0005680 anaphase-promoting complex CC 1.05E-03
GO:0004957 prostaglandin E receptor activity MF 1.15E-02
GO:0003840 gamma-glutamyltransferase activity MF 1.55E-02
GO:0003854 3-beta-hydroxy-delta5-steroid
dehydrogenase activity
MF 2.06E-02
GO:0003777 microtubule motor activity MF 2.09E-02
GO:0016491 oxidoreductase activity MF 1.57E-06
91
GO:0016651 oxidoreductase activity, acting on NADH or
NADPH
MF 1.68E-09
Note: BP stands for biological process, CC stands for cellular component, MF stands for molecular function.
Table 8.9 Go enrichment of unique gene families in mandrill.
GO ID GO term GO class P value
GO:0044260 cellular macromolecule metabolic
process
BP 2.02E-04
GO:0043170 macromolecule metabolic process BP 6.18E-04
GO:0009987 cellular process BP 1.32E-02
GO:0008152 metabolic process BP 1.88E-02
GO:0044238 primary metabolic process BP 2.69E-02
GO:0044237 cellular metabolic process BP 3.97E-02
GO:0034645 cellular macromolecule biosynthetic
process
BP 1.02E-10
GO:0019538 protein metabolic process BP 1.86E-08
GO:0010467 gene expression BP 3.62E-11
GO:0044267 cellular protein metabolic process BP 4.79E-10
GO:0006412 translation BP 6.29E-33
GO:0007186 G-protein coupled receptor signaling
pathway
BP 9.02E-06
GO:0043229 intracellular organelle CC 1.49E-04
GO:0005622 intracellular CC 3.44E-04
GO:0044391 ribosomal subunit CC 4.00E-04
GO:0005912 adherens junction CC 2.86E-03
GO:0044464 cell part CC 2.92E-03
GO:0044424 intracellular part CC 5.80E-03
GO:0015934 large ribosomal subunit CC 1.70E-02
GO:0015935 small ribosomal subunit CC 4.19E-02
GO:0005840 ribosome CC 1.57E-35
GO:0005737 cytoplasm CC 1.73E-11
GO:0044444 cytoplasmic part CC 2.34E-15
GO:0032991 macromolecular complex CC 3.29E-10
GO:0043232 intracellular non-membrane-bounded
organelle
CC 6.11E-19
GO:0004888 transmembrane signaling receptor
activity
MF 1.11E-04
GO:0004871 signal transducer activity MF 1.49E-04
GO:0004930 G-protein coupled receptor activity MF 1.49E-04
92
GO:0045296 cadherin binding MF 9.17E-04
GO:0004807 triose-phosphate isomerase activity MF 1.55E-02
GO:0003735 structural constituent of ribosome MF 1.57E-35
GO:0005198 structural molecule activity MF 4.23E-29
GO:0004984 olfactory receptor activity MF 9.21E-08
Note: BP stands for biological process, CC stands for cellular component, MF stands for molecular function.
Table 8.10 GO enrichment result of unique gene families for mandrill.
GO ID GO term GO class P value
GO:0006412 translation BP 4.60E-03
GO:0006935 chemotaxis BP 1.84E-02
GO:0040011 locomotion BP 1.84E-02
GO:0009605 response to external stimulus BP 4.71E-02
GO:0005840 ribosome CC 4.60E-03
GO:0005737 cytoplasm CC 1.12E-02
GO:0044444 cytoplasmic part CC 1.84E-02
GO:0030529 ribonucleoprotein complex CC 2.36E-02
GO:0001594 trace-amine receptor activity MF 2.75E-04
GO:0016493 C-C chemokine receptor activity MF 7.93E-04
GO:0003735 structural constituent of ribosome MF 4.60E-03
GO:0004896 cytokine receptor activity MF 4.60E-03
GO:0008528 G-protein coupled peptide receptor
activity
MF 4.60E-03
GO:0005198 structural molecule activity MF 1.84E-02
GO:0004950 chemokine receptor activity MF 8.96E-06
Note: BP stands for biological process, CC stands for cellular component, MF stands for molecular function.
Table 8.11 Repeat content of MHC class I region for mandrill and human.
Type mandrill human
Copy
Number
Length (bp) Percent
(%)
Copy
Number
Length (bp) Percent (%)
DNA/Crypton-
V
1 65 0.00 0 0 0.00
DNA/DNA 3 179 0.01 2 126 0.01
DNA/Helitron 1 363 0.02 1 322 0.02
DNA/Maveric
k
0 0 0.00 1 44 0.00
DNA/Sola 0 0 0.00 1 69 0.00
DNA/MULE- 2 141 0.01 0 0 0.00
93
MuDR
DNA/TcMar-
Tc1
1 187 0.01 1 183 0.01
DNA/TcMar-
Tigge
12 3,673 0.20 0 0 0.00
DNA/TcMar-
Tigger
26 10,177 0.56 27 11,980 0.63
DNA/hAT 2 184 0.01 1 174 0.01
DNA/hAT-
Charlie
38 9,586 0.52 46 9,941 0.52
DNA/hAT-
Tip100
9 2,016 0.11 6 839 0.04
LINE/CR1 4 772 0.04 4 771 0.04
LINE/Jockey 0 0 0.00 1 57 0.00
LINE/L1 754 340,504 18.62 759 406,157 21.26
LINE/L1-Tx1 1 142 0.01 0 0 0.00
LINE/L2 38 10,351 0.57 32 8,906 0.47
LINE/RTE-X 2 290 0.02 2 302 0.02
LTR/Copia 1 92 0.01 0 0 0.00
LTR/ERV1 146 80,428 4.40 126 77,703 4.07
LTR/ERVK 22 10,368 0.57 34 29,679 1.55
LTR/ERVL 171 91,775 5.02 207 123,209 6.45
LTR/ERVL-
MaLR
109 35,671 1.95 81 27,654 1.45
LTR/Gypsy 2 170 0.01 1 67 0.00
LTR/LTR 3 728 0.04 1 170 0.01
SINE/7SL 5 338 0.02 9 399 0.02
SINE/Alu 947 276,532 15.12 806 267,594 14.01
SINE/B4 27 1,505 0.08 24 895 0.05
SINE/MIR 43 5,880 0.32 44 6,508 0.34
SINE/tRNA-
7SL
10 637 0.03 8 836 0.04
SINE/tRNA-
RTE
1 121 0.01 1 121 0.01
All 2,381 882,875 48.27 2,226 974,706 51.03
Table 8.12 GO and KEGG enrichment of the positively selected genes (PSGs).
GO ID GO Term GO Class Adjusted
P-value
94
GO:0016301 kinase activity MF 6.62E-10
GO:0016772 transferase activity, transferring phosphorus-
containing groups
MF 1.18E-09
GO:0016773 phosphotransferase activity, alcohol group as acceptor MF 1.18E-09
GO:0003824 catalytic activity MF 2.03E-09
GO:0005524 ATP binding MF 3.42E-09
GO:0004672 protein kinase activity MF 3.42E-09
GO:0032559 adenyl ribonucleotide binding MF 3.80E-09
GO:0030554 adenyl nucleotide binding MF 4.73E-09
GO:0016740 transferase activity MF 1.76E-08
GO:0005515 protein binding MF 3.48E-08
GO:0004713 protein tyrosine kinase activity MF 3.64E-08
GO:0016310 phosphorylation BP 3.85E-08
GO:0006468 protein phosphorylation BP 7.96E-08
GO:0035639 purine ribonucleoside triphosphate binding MF 8.16E-07
GO:0036094 small molecule binding MF 8.22E-07
GO:0032553 ribonucleotide binding MF 9.04E-07
GO:0032555 purine ribonucleotide binding MF 9.04E-07
GO:0017076 purine nucleotide binding MF 1.25E-06
GO:0000166 nucleotide binding MF 1.40E-06
GO:0006793 phosphorus metabolic process BP 1.24E-05
GO:0006796 phosphate-containing compound metabolic process BP 1.24E-05
GO:0009452 RNA capping BP 2.66E-05
GO:0007626 locomotory behavior BP 4.05E-05
GO:0005488 binding MF 5.19E-05
GO:0007155 cell adhesion BP 5.68E-05
GO:0022610 biological adhesion BP 5.68E-05
GO:0008374 O-acyltransferase activity MF 8.68E-05
GO:0043412 macromolecule modification BP 8.73E-05
GO:0006464 protein modification process BP 0.000113
GO:0004525 ribonuclease III activity MF 0.00015
GO:0000123 histone acetyltransferase complex CC 0.000275
GO:0004252 serine-type endopeptidase activity MF 0.000396
GO:0030507 spectrin binding MF 0.000399
GO:0006508 proteolysis BP 0.000485
GO:0070011 peptidase activity, acting on L-amino acid peptides MF 0.000639
GO:0004177 aminopeptidase activity MF 0.000665
95
GO:0008233 peptidase activity MF 0.000665
GO:0046777 protein autophosphorylation BP 0.000767
GO:0005802 trans-Golgi network CC 0.000848
GO:0005768 endosome CC 0.000848
GO:0005516 calmodulin binding MF 0.001068
GO:0004842 ubiquitin-protein ligase activity MF 0.001317
GO:0017016 Ras GTPase binding MF 0.001317
GO:0031267 small GTPase binding MF 0.001433
GO:0051020 GTPase binding MF 0.001433
GO:0016881 acid-amino acid ligase activity MF 0.002103
GO:0016747 transferase activity, transferring acyl groups other
than amino-acyl groups
MF 0.003245
GO:0004175 endopeptidase activity MF 0.003245
GO:0008236 serine-type peptidase activity MF 0.003313
GO:0017171 serine hydrolase activity MF 0.003313
GO:0019787 small conjugating protein ligase activity MF 0.003335
GO:0016787 hydrolase activity MF 0.003374
GO:0008238 exopeptidase activity MF 0.003374
GO:0070461 SAGA-type complex CC 0.003374
GO:0070566 adenylyltransferase activity MF 0.003374
GO:0042558 pteridine-containing compound metabolic process BP 0.004351
GO:0050660 flavin adenine dinucleotide binding MF 0.004532
GO:0007610 behavior BP 0.004535
GO:0004402 histone acetyltransferase activity MF 0.00476
GO:0006370 mRNA capping BP 0.005009
GO:0008174 mRNA methyltransferase activity MF 0.005009
GO:0009057 macromolecule catabolic process BP 0.005217
GO:0019199 transmembrane receptor protein kinase activity MF 0.005929
GO:0015291 secondary active transmembrane transporter activity MF 0.006141
GO:0008217 regulation of blood pressure BP 0.006384
GO:0014706 striated muscle tissue development BP 0.006384
GO:0060537 muscle tissue development BP 0.006384
GO:0005887 integral to plasma membrane CC 0.006872
GO:0031226 intrinsic to plasma membrane CC 0.006872
GO:0051345 positive regulation of hydrolase activity BP 0.00701
GO:0000910 cytokinesis BP 0.008254
GO:0004568 chitinase activity MF 0.009118
96
GO:0006032 chitin catabolic process BP 0.009118
GO:0045335 phagocytic vesicle CC 0.009118
GO:0055037 recycling endosome CC 0.009118
GO:0030318 melanocyte differentiation BP 0.009118
GO:0017049 GTP-Rho binding MF 0.009118
GO:2000114 regulation of establishment of cell polarity BP 0.009118
GO:0008344 adult locomotory behavior BP 0.009118
GO:0043966 histone H3 acetylation BP 0.009118
GO:0017034 Rap guanyl-nucleotide exchange factor activity MF 0.009118
GO:0004534 5'-3' exoribonuclease activity MF 0.009118
GO:0030914 STAGA complex CC 0.009118
GO:0008460 dTDP-glucose 4,6-dehydratase activity MF 0.009118
GO:0004909 interleukin-1, Type I, activating receptor activity MF 0.009118
GO:0004334 fumarylacetoacetase activity MF 0.009118
GO:0004349 glutamate 5-kinase activity MF 0.009118
GO:0004350 glutamate-5-semialdehyde dehydrogenase activity MF 0.009118
GO:0043550 regulation of lipid kinase activity BP 0.009118
GO:0070772 PAS complex CC 0.009118
GO:0003919 FMN adenylyltransferase activity MF 0.009118
GO:0006747 FAD biosynthetic process BP 0.009118
GO:0008609 alkylglycerone-phosphate synthase activity MF 0.009118
GO:0004336 galactosylceramidase activity MF 0.009118
GO:0006683 galactosylceramide catabolic process BP 0.009118
GO:0008611 ether lipid biosynthetic process BP 0.009118
GO:0016287 glycerone-phosphate O-acyltransferase activity MF 0.009118
GO:0006516 glycoprotein catabolic process BP 0.009118
GO:0008705 methionine synthase activity MF 0.009118
GO:0008898 homocysteine S-methyltransferase activity MF 0.009118
GO:0010739 positive regulation of protein kinase A signaling
cascade
BP 0.009118
GO:0090036 regulation of protein kinase C signaling cascade BP 0.009118
GO:0005137 interleukin-5 receptor binding MF 0.009118
GO:0048280 vesicle fusion with Golgi apparatus BP 0.009118
GO:0008488 gamma-glutamyl carboxylase activity MF 0.009118
GO:0017187 peptidyl-glutamic acid carboxylation BP 0.009118
GO:0006348 chromatin silencing at telomere BP 0.009118
GO:0004375 glycine dehydrogenase (decarboxylating) activity MF 0.009118
97
GO:0006546 glycine catabolic process BP 0.009118
GO:0004483 mRNA (nucleoside-2'-O-)-methyltransferase activity MF 0.009118
GO:0080009 mRNA methylation BP 0.009118
GO:0050902 leukocyte adhesive activation BP 0.009118
GO:0048066 developmental pigmentation BP 0.009118
GO:0050931 pigment cell differentiation BP 0.009118
GO:0032878 regulation of establishment or maintenance of cell
polarity
BP 0.009118
GO:0019202 amino acid kinase activity MF 0.009118
GO:0046443 FAD metabolic process BP 0.009118
GO:0072387 flavin adenine dinucleotide metabolic process BP 0.009118
GO:0072388 flavin adenine dinucleotide biosynthetic process BP 0.009118
GO:0006681 galactosylceramide metabolic process BP 0.009118
GO:0019374 galactolipid metabolic process BP 0.009118
GO:0019376 galactolipid catabolic process BP 0.009118
GO:0046485 ether lipid metabolic process BP 0.009118
GO:0016413 O-acetyltransferase activity MF 0.009118
GO:0042084 5-methyltetrahydrofolate-dependent methyltransferase
activity
MF 0.009118
GO:0070528 protein kinase C signaling cascade BP 0.009118
GO:0018214 protein carboxylation BP 0.009118
GO:0016642 oxidoreductase activity, acting on the CH-NH2 group
of donors, disulfide as acceptor
MF 0.009118
GO:0009071 serine family amino acid catabolic process BP 0.009118
GO:0016556 mRNA modification BP 0.009118
GO:0045123 cellular extravasation BP 0.009118
GO:0017137 Rab GTPase binding MF 0.009356
GO:0006030 chitin metabolic process BP 0.009521
GO:0016891 endoribonuclease activity, producing 5'-
phosphomonoesters
MF 0.009521
GO:0015103 inorganic anion transmembrane transporter activity MF 0.010851
GO:0007605 sensory perception of sound BP 0.012081
GO:0003714 transcription corepressor activity MF 0.012081
GO:0050954 sensory perception of mechanical stimulus BP 0.012081
GO:0007067 mitosis BP 0.016642
GO:0000280 nuclear division BP 0.016642
GO:0044431 Golgi apparatus part CC 0.017318
GO:0006725 cellular aromatic compound metabolic process BP 0.017468
98
GO:0005452 inorganic anion exchanger activity MF 0.018637
GO:0016055 Wnt receptor signaling pathway BP 0.020776
GO:0070588 calcium ion transmembrane transport BP 0.020776
GO:0004540 ribonuclease activity MF 0.020776
GO:0000226 microtubule cytoskeleton organization BP 0.020776
GO:0008271 secondary active sulfate transmembrane transporter
activity
MF 0.020776
GO:0008272 sulfate transport BP 0.020776
GO:0015116 sulfate transmembrane transporter activity MF 0.020776
GO:0042813 Wnt-activated receptor activity MF 0.020776
GO:0016573 histone acetylation BP 0.020776
GO:0048193 Golgi vesicle transport BP 0.020776
GO:0030574 collagen catabolic process BP 0.020776
GO:0090382 phagosome maturation BP 0.020776
GO:0045670 regulation of osteoclast differentiation BP 0.020776
GO:0046920 alpha-(1->3)-fucosyltransferase activity MF 0.020776
GO:0034450 ubiquitin-ubiquitin ligase activity MF 0.020776
GO:0008124 4-alpha-hydroxytetrahydrobiopterin dehydratase
activity
MF 0.020776
GO:0034435 cholesterol esterification BP 0.020776
GO:0034736 cholesterol O-acyltransferase activity MF 0.020776
GO:0006919 activation of cysteine-type endopeptidase activity
involved in apoptotic process
BP 0.020776
GO:0032963 collagen metabolic process BP 0.020776
GO:0044236 multicellular organismal metabolic process BP 0.020776
GO:0044243 multicellular organismal catabolic process BP 0.020776
GO:0044259 multicellular organismal macromolecule metabolic
process
BP 0.020776
GO:0002761 regulation of myeloid leukocyte differentiation BP 0.020776
GO:0030316 osteoclast differentiation BP 0.020776
GO:0045637 regulation of myeloid cell differentiation BP 0.020776
GO:0034433 steroid esterification BP 0.020776
GO:0034434 sterol esterification BP 0.020776
GO:0004772 sterol O-acyltransferase activity MF 0.020776
GO:0010950 positive regulation of endopeptidase activity BP 0.020776
GO:0010952 positive regulation of peptidase activity BP 0.020776
GO:0043280 positive regulation of cysteine-type endopeptidase
activity involved in apoptotic process
BP 0.020776
99
GO:0097202 activation of cysteine-type endopeptidase activity BP 0.020776
GO:2001056 positive regulation of cysteine-type endopeptidase
activity
BP 0.020776
GO:0008305 integrin complex CC 0.020935
GO:0007167 enzyme linked receptor protein signaling pathway BP 0.021323
GO:0004675 transmembrane receptor protein serine/threonine
kinase activity
MF 0.022843
GO:0016050 vesicle organization BP 0.022843
GO:0016337 cell-cell adhesion BP 0.023783
GO:0000087 M phase of mitotic cell cycle BP 0.023909
GO:0051301 cell division BP 0.025184
GO:0004553 hydrolase activity, hydrolyzing O-glycosyl
compounds
MF 0.025409
GO:0048037 cofactor binding MF 0.026148
GO:0048856 anatomical structure development BP 0.026148
GO:0030097 hemopoiesis BP 0.026288
GO:0006475 internal protein amino acid acetylation BP 0.026288
GO:0018393 internal peptidyl-lysine acetylation BP 0.026288
GO:0018394 peptidyl-lysine acetylation BP 0.026288
GO:0008237 metallopeptidase activity MF 0.028813
GO:0048285 organelle fission BP 0.028832
GO:0015301 anion:anion antiporter activity MF 0.030496
GO:0043085 positive regulation of catalytic activity BP 0.030496
GO:0001510 RNA methylation BP 0.030496
GO:0048534 hemopoietic or lymphoid organ development BP 0.030496
GO:0006473 protein acetylation BP 0.030496
GO:0004521 endoribonuclease activity MF 0.030496
GO:0004712 protein serine/threonine/tyrosine kinase activity MF 0.030496
GO:0043473 pigmentation BP 0.030496
GO:0017080 sodium channel regulator activity MF 0.030496
GO:0004948 calcitonin receptor activity MF 0.030496
GO:0046373 L-arabinose metabolic process BP 0.030496
GO:0046556 alpha-N-arabinofuranosidase activity MF 0.030496
GO:0004962 endothelin receptor activity MF 0.030496
GO:0048484 enteric nervous system development BP 0.030496
GO:0070776 MOZ/MORF histone acetyltransferase complex CC 0.030496
GO:0042577 lipid phosphatase activity MF 0.030496
GO:0004822 isoleucine-tRNA ligase activity MF 0.030496
100
GO:0006428 isoleucyl-tRNA aminoacylation BP 0.030496
GO:0019236 response to pheromone BP 0.030496
GO:0080025 phosphatidylinositol-3,5-bisphosphate binding MF 0.030496
GO:0032777 Piccolo NuA4 histone acetyltransferase complex CC 0.030496
GO:0000103 sulfate assimilation BP 0.030496
GO:0004020 adenylylsulfate kinase activity MF 0.030496
GO:0004781 sulfate adenylyltransferase (ATP) activity MF 0.030496
GO:0051018 protein kinase A binding MF 0.030496
GO:0017025 TBP-class protein binding MF 0.030496
GO:0034454 microtubule anchoring at centrosome BP 0.030496
GO:0008250 oligosaccharyltransferase complex CC 0.030496
GO:0005315 inorganic phosphate transmembrane transporter
activity
MF 0.030496
GO:0034599 cellular response to oxidative stress BP 0.030496
GO:0090307 spindle assembly involved in mitosis BP 0.030496
GO:0004666 prostaglandin-endoperoxide synthase activity MF 0.030496
GO:0019371 cyclooxygenase pathway BP 0.030496
GO:0043141 ATP-dependent 5'-3' DNA helicase activity MF 0.030496
GO:0030139 endocytic vesicle CC 0.030496
GO:0002573 myeloid leukocyte differentiation BP 0.030496
GO:0030010 establishment of cell polarity BP 0.030496
GO:0030534 adult behavior BP 0.030496
GO:0019566 arabinose metabolic process BP 0.030496
GO:0048483 autonomic nervous system development BP 0.030496
GO:0070775 H3 histone acetyltransferase complex CC 0.030496
GO:0004779 sulfate adenylyltransferase activity MF 0.030496
GO:0006677 glycosylceramide metabolic process BP 0.030496
GO:0046477 glycosylceramide catabolic process BP 0.030496
GO:0046514 ceramide catabolic process BP 0.030496
GO:0046521 sphingoid catabolic process BP 0.030496
GO:0010737 protein kinase A signaling cascade BP 0.030496
GO:0010738 regulation of protein kinase A signaling cascade BP 0.030496
GO:0072393 microtubule anchoring at microtubule organizing
center
BP 0.030496
GO:0019369 arachidonic acid metabolic process BP 0.030496
GO:0006342 chromatin silencing BP 0.030496
GO:0045814 negative regulation of gene expression, epigenetic BP 0.030496
GO:0003684 damaged DNA binding MF 0.032579
101
GO:0030163 protein catabolic process BP 0.0328
GO:0007596 blood coagulation BP 0.03529
GO:0007599 hemostasis BP 0.03529
GO:0006629 lipid metabolic process BP 0.036449
GO:0016569 covalent chromatin modification BP 0.037639
GO:0016570 histone modification BP 0.037639
GO:0043547 positive regulation of GTPase activity BP 0.039103
GO:0050817 coagulation BP 0.041062
GO:0002520 immune system development BP 0.043864
GO:0006026 aminoglycan catabolic process BP 0.043864
GO:0004702 receptor signaling protein serine/threonine kinase
activity
MF 0.043864
GO:0008235 metalloexopeptidase activity MF 0.043864
GO:0043235 receptor complex CC 0.043864
GO:0016407 acetyltransferase activity MF 0.043864
GO:0043414 macromolecule methylation BP 0.043937
GO:0048731 system development BP 0.044356
GO:0006895 Golgi to endosome transport BP 0.044356
GO:0000186 activation of MAPKK activity BP 0.044356
GO:0006729 tetrahydrobiopterin biosynthetic process BP 0.044356
GO:0030099 myeloid cell differentiation BP 0.044356
GO:0016822 hydrolase activity, acting on acid carbon-carbon
bonds
MF 0.044356
GO:0016823 hydrolase activity, acting on acid carbon-carbon
bonds, in ketonic substances
MF 0.044356
GO:0046146 tetrahydrobiopterin metabolic process BP 0.044356
GO:0006687 glycosphingolipid metabolic process BP 0.044356
GO:0019377 glycolipid catabolic process BP 0.044356
GO:0046479 glycosphingolipid catabolic process BP 0.044356
GO:0046504 glycerol ether biosynthetic process BP 0.044356
GO:0006906 vesicle fusion BP 0.044356
GO:0043281 regulation of cysteine-type endopeptidase activity
involved in apoptotic process
BP 0.044356
GO:2000116 regulation of cysteine-type endopeptidase activity BP 0.044356
GO:0005975 carbohydrate metabolic process BP 0.046346
Map ID Map Title Adjusted P-
value
Gene IDs
102
map00630 Glyoxylate and
dicarboxylate
metabolism
0.025008 Masph06057 Paanu10824 Masph03104
Paanu04268 Paanu04371 Masph01783
Paanu09927 Masph17981 Paanu10721
Masph08262
map00525 NA 0.025008 Masph19208 Paanu12892
map01055 Biosynthesis of
vancomycin group
antibiotics
0.025008 Masph19208 Paanu12892
map00523 Polyketide sugar unit
biosynthesis
0.025008 Masph19208 Paanu12892
map04113 Meiosis - yeast 0.025008 Masph15120 Paanu15224
map04141 NA 0.025008 Paanu10424 Masph02116 Paanu18656
Masph17762 Masph17390 Paanu15832
Figure 8.1 The distribution of 17-mer frequency of baboon and mandrill. Major and secondary
peaks of the frequency distribution were indicated by arrows for baboon (a) and mandrill (b).
a.
b.
0
0.5
1
1.5
2
2.5
3
3.5
4
0 10 20 30 40 50 60 70 80 90 100
Per
centa
ge
(%)
K-mer frequency
103
Figure 8.2 Colinearity analysis of chr 3 for mandrill. The orange lines represent gene pairs.
Figure 8.3 Sequencing depth and the location relationships of pair-end reads on MHC class I
region for mandrill.
0
0.5
1
1.5
2
2.5
3
3.5
4
0 10 20 30 40 50 60 70 80 90 100
Per
centa
ge
(%)
K-mer frequency
104
105
Figure 8.4 OR7E24 genes on chromosome 19 in mandrill.