www.sciencemag.org/cgi/content/full/344/6188/1168/DC1
Supplementary Materials for
The sheep genome illuminates biology of the rumen and lipid metabolism
Yu Jiang, Min Xie, Wenbin Chen, Richard Talbot, Jillian F. Maddox, Thomas Faraut, Chunhua Wu, Donna M. Muzny, Yuxiang Li, Wenguang Zhang, Jo-Ann Stanton,
Rudiger Brauning, Wesley C. Barris, Thibaut Hourlier, Bronwen L. Aken, Stephen M. J. Searle, David L. Adelson, Chao Bian, Graham R. Cam, Yulin Chen,
Shifeng Cheng, Udaya DeSilva, Karen Dixen, Yang Dong, Guangyi Fan, Ian R. Franklin, Shaoyin Fu, Pablo Fuentes-Utrilla, Rui Guan, Margaret A. Highland, Michael E. Holder, Guodong Huang, Aaron B. Ingham, Shalini N. Jhangiani, Divya Kalra, Christie L. Kovar,
Sandra L. Lee, Weiqing Liu, Xin Liu, Changxin Lu, Tian Lv, Tittu Mathew, Sean McWilliam, Moira Menzies, Shengkai Pan, David Robelin, Bertrand Servin,
David Townley, Wenliang Wang, Bin Wei, Stephen N. White, Xinhua Yang, Chen Ye, Yaojing Yue, Peng Zeng, Qing Zhou, Jacob B. Hansen, Karsten Kristiansen,
Richard A. Gibbs, Paul Flicek, Christopher C. Warkup, Huw E. Jones, V. Hutton Oddy, Frank W. Nicholas, John C. McEwan, James W. Kijas, Jun Wang, Kim C. Worley,* Alan L. Archibald,* Noelle Cockett,* Xun Xu,* Wen Wang,* Brian P. Dalrymple*
*To whom correspondence should be addressed. E-mail: [email protected] (B.P.D.); [email protected] (W.W); [email protected] (X.X.); [email protected]
(A.L.A.); [email protected] (K.C.W.); [email protected] (N.C.)
Published 6 June 2014, Science 344, 1168 (2014)
DOI: 10.1126/science.1252806 This PDF file includes:
Materials and Methods Supplementary Text Figs. S1 to S10, S12 to S33 Tables S1, S4, S8, S12 to S24, S27
Other Supplementary Material for this manuscript includes the following: (available at www.sciencemag.org/cgi/content/full/344/6188/1168/DC1)
Fig. S11 Tables S2, S3, S5 to S7, S9 to S11, S25, S26
2
Materials and Methods
1.1 DNA sample preparation and sequencing
The data for the sheep reference genome were generated at two sequencing facilities
in late 2009. The Texel breed was selected because Texel is a popular terminal sire breed
in several countries. In addition, the Texel breed served as one of the paternal grandsire
breeds of the sheep International Mapping Flock (IMF) (26).
Together with BGI-Shenzhen, the Kunming Institute of Zoology commissioned the
whole-genome shotgun sequencing and de novo assembly for a 6-month old Texel ewe,
which was provided by Dr. Jacob B. Hansen of the University of Copenhagen. The
genomic DNA was isolated from liver tissue by standard molecular biology techniques,
and then sequenced using the Genome Analyzer II sequencer (Illumina, CA, USA). For
small insert size libraries, 161 Gb of 101 bp paired-end reads from 7 libraries with insert
sizes ranging from 180 to 800 bp were generated (table S1). In addition, 60 Gb of 45 bp
mate-paired reads were generated from libraries with average insert sizes of 2 kb, 5 kb, 9
kb and 17 kb (table S1) with several additional steps of DNA circularization, digestion of
linear DNA, fragmentation of circularized DNA, and purification of biotinylated DNA.
To read through higher GC content, we also generated 2 Gb MeDIP-seq (see section 1.3).
Simultaneously, the ARK-Genomics Center for Comparative and Functional
Genomics at The Roslin Institute sequenced a single inbred Texel ram (about 14%
inbreeding over the last 5 generations of mating), which was used previously as the DNA
source for the CHORI-243 BAC library and the virtual sheep genome (27). A total of 149
Gb of 101 bp Illumina paired-end sequences from two libraries with insert sizes of 200
bp and 500 bp were obtained (table S1). In addition, 5 Gb of large insert size (8kb and
20kb) mate-paired reads from the same male Texel were generated at the Baylor College
of Medicine Human Genome Sequence Center (BCM-HGSC) using 454 technology
(table S1). An additional ~21 fold coverage of mate-paired reads data from The Roslin
Institute for the male Texel was also used to fill gaps (table S1), especially those located
in GC-rich regions. The mate-paired library was prepared using a combination of the 8 kb
Roche mate paired kit primers following the Roche instructions until the circularized
DNA was sheared and the biotinylated DNA fragments captured, then Illumina specific
primers were added and processed for Illumina sequencing. The new Illumina Truseq
SBS v3 reagents and TruSeq PE v3 cluster kits for the HiSeq 2000 (Illumina, CA, USA)
were used; which Illumina claims leads to less GC bias than earlier kits that were
previously used on the Illumina GA IIx instrument. Another 9 Gb of 454 reads and 0.3
Gb of BAC end sequences (sequenced by Sanger technology) (table S1), which were
used to generate the sheep draft genome Oar v1.0, were also integrated for gaps filling
and assembly checking.
1.2 Sample collection, RNA isolation, and RNA-Seq method
RNA was purified using Trizol (Invitrogen, CA, USA) for sequencing from seven
fresh tissues (Heart, Liver, Ovary, Kidney, Brain, Lung, White adipose) from the same
single Texel ewe used for the genome sequencing (table S7). RNA sequencing libraries
were constructed using the Illumina mRNA-Seq Prep Kit. Briefly, first strand cDNA
synthesis was performed with oligo-T primer and Superscript II reverse transcriptase
(Invitrogen, USA). The second strand was synthesized with E. coli DNA PolI
3
(Invitrogen, CA, USA). Double stranded cDNA was purified with a QIAquick PCR
purification kit (Qiagen, Germantown, MD), and sheared with a nebulizer (Invitrogen,
CA, USA) to 100-500 bp fragments. After end repair and addition of a 3’-dA overhang,
the cDNA was ligated to Illumina PE adapter oligo mix, and 200±20 bp fragments were
selected by gel purification. After 15 cycles of PCR amplification the 200 bp paired-end
libraries were sequenced using the Illumina Hiseq 2000 platform.
For wool related studies, RNA was extracted from a skin sample collected from the
flank of one super-fine strain (fiber diameter=16.5 micron) of Gansu alpine fine wool
sheep (adult ewe, two year old) in June 2010. Gansu alpine fine wool sheep were
developed in the Huang Cheng District of Gansu Province, China, by crossing Mongolian
or Tibetan sheep with Xinjiang fine wool sheep and then with several fine wool sheep
breeds from the Union of Soviet Socialist Republics, such as Caucasian and Salsk. RNA
libraries were constructed and sequenced as described above.
A large tissue survey of 83 tissue samples from another 4 Texels (Texel breed fiber
diameter 28-33 micron) was sequenced (150 bp paired-end RNA-Seq reads) using the
Illumina HiSeq 2500 at The Roslin Institute in 2013 (table S7). The RNA sequencing
libraries were prepared using a minor modification of the Illumina Truseq stranded total
RNA sequencing kit. The rRNA was removed using the EpicentreRiboZero
(Human/mouse/rat) kit. The rRNA depleted RNA was fragmented as indicated in the kit
protocol except that the fragmentation time was reduced to 15 seconds at 94oC using the
suggested modification in the protocol. The rest of the library preparation was as
indicated in the protocol except that the number of cycles for library enrichment was
reduced to 12 cycles to reduce the chance of duplicate products. The libraries were
quantified by qPCR before pooling into pools for sequencing.
We also downloaded goat and cattle RNA-Seq data from public databases; liver
RNA-Seq data from Yunling Black Goat (28); skin RNA-Seq data from Shanbei White
Cashmere Goat (29), cattle skin RNA-Seq data (30), and cattle liver RNA-Seq data
(GSE37544) from the NCBI GEO database..
1.3 DNA sample preparation and MeDIP sequencing We isolated the genomic DNA from liver tissue of the sequenced female Texel by
standard molecular biology techniques. MeDIP DNA libraries were prepared following
the protocol as previously described (31). Each MeDIP library was subjected to paired-
end sequencing using Illumina HiSeq 2000. Five microgram of DNA was isolated using
E.Z.N.A. HP Tissue DNA Midi Kit (Omega) and was sonicated to ~100-500 bp
fragments with a Bioruptor sonicator (Diagenode). Then libraries were constructed by
adopting the Illumina Paired-End protocol consisting of end repair, A-base addition and
adaptor ligation steps, which were performed using Illumina’s Paired-End DNA Sample
Prep kit following the manufacturer’s instructions. Adaptor-ligated DNA was
immunoprecipitated by anti-methylcytidine monoclonal antibodies. The specificity of the
enrichment was confirmed by qPCR using SYBR ® Green master mix (Applied
Biosystems) and primers for positive and negative internal control DNA of non-human
samples were supplied in the Magnetic Methylated DNA Immunoprecipitation kit
(Diagenode). Cycling of qPCR validation consisted of 95°C 5 min, followed by 40 cycles
95°C 15 s and 60°C 1 min. The enriched fragments with methylation and 10% input
DNA were purified on ZYMO DNA Clean & Concentrator-5 columns (ZYMO)
4
following the manufacturer’s instructions. DNA was eluted in 30 μl buffer EB and its
concentration was measured. Enriched fragments were amplified by adaptor-mediated
PCR in a final reaction volume of 50 μl consisting of 23 μl purified DNA, 25 μl Phusion
DNA polymerase mix (NEB) and 2 μl PCR primers. Amplification consisted of 94°C 30
s, 10 cycles of 94°C 30 s, 60°C 30 s, 72°C 30 s, followed by prolonged extension for 5
min at 72 °C and the products were held at 4°C. Amplification quality and quantity were
evaluated by Agilent 2100 Analyzer DNA 1000 chips purified by 2% agarose gel and
eluted in 15 μl buffer EB. Ultra-high-throughput 50 bp paired-end sequencing was carried
out using the Illumina HiSeq 2000 according to manufacturer’s instructions. Raw
sequencing data were processed by the Illumina base-calling pipeline. Then the
sequenced reads were mapped onto Oar v3.1 using SOAP2.20
(http://soap.genomics.org.cn) (32).
1.4 Scaffold assembly and filling intra-scaffold gaps
We assembled the ~75-fold coverage of Texel ewe reads into contigs and scaffolds
using SOAPdenovo (33). The gaps between the constructed scaffolds were mainly
composed of repeats that were masked during scaffold construction. To close these gaps,
we used the paired-end information to retrieve read pairs with one end mapped to a
unique contig while the other located in the gap region using GapCloser for SOAPdenovo
software (http://soap.genomics.org.cn). The ~120-fold coverage Illumina sequence from
both the ewe and the ram was used for the first round of gap filling. Then, ~144 ×
Illumina reads (includes 21 × GC unbiased data and 2 Gb MeDIP-seq data) from both the
Texel ewe and ram were applied for gap filling to improve the assembly. We
subsequently conducted a local assembly for these collected reads. We also used 3 × 454
and 0.1 × BAC reads to try and cross the remaining gaps, by uniquely mapping both of
the 45 bp tips of reads onto the flanking sequence of the same gap using
SOAPaligner/soap2.20 software (http://soap.genomics.org.cn) (32).
1.5 Construction of high density sheep RH (radiation hybrid) maps
A specific method for genotype calling of the two sheep RH panels (34, 35) was
used. The genotyping experiment undertaken with the Ovine SNP50 BeadChip (Illumina,
San Diego, CA) provides two measures of fluorescence intensity (one for each SNP
allele) for each RH clone and each SNP. The maximum intensity over the two alleles
(Imax) was used as a measure of the retention of the marker in the clone. We modeled the
distribution of Imax for non-retained SNPs (Dnull) within a clone using either a Gumbel
or a Normal distribution. The parameters of Dnull were estimated empirically using the
first 10% of the observed Imax, choosing for each clone the distribution family that was a
best fit for the data. Given Dnull, a p-value was computed for each SNP based on its
observed Imax. Finally the FDR approach of Storey and Tibshirani (36) was used to
perform the genotype calling across all clones and all SNPs. Specifically, data points with
Imax corresponding to an FDR of < 1% were called present, those with Imax
corresponding to an FDR of 1% to10% were called missing and those with Imax
corresponding to an FDR greater than 10% were called absent. This procedure (i)
controls the false positive rate, (ii) controls the false negative rate and (iii) estimates the
retention fraction. The genotype calling procedure was applied independently for each
panel. For both panels, the false positive rate was expected to be 1%. For the INRA
5
panel, the estimate of the retention fraction was 35%, 4.3% of false negatives were
expected and the missing data proportion was 6.5%. The selectable marker gene for the
INRA RH panel was HPRT which maps to the nonPAR (non pseudoautosomal region) of
the X chromosome meaning that this panel is biased towards retaining the X chromosome
nonPAR. This in conjunction with the sex of the INRA RH panel animal being male
(single copy of chromosome X nonPAR) means that the INRA RH panel has very low
resolving power for the X chromosome nonPAR. For the Utah panel, which was selected
for retention of the TK1 gene, which maps to OAR11 ~53 Mb, the estimate of the
retention fraction was 31%, 11% of false negatives were expected and the missing data
proportion was 4.1%. This panel also appears to have been derived from a male animal
meaning that it would have poorer resolving power for the chromosome X nonPAR
(single chromosome copy) and the region on OAR11 around TK1 and in fact the calling
procedure failed to call the SNPs on OAR11 from 45 Mb to the end of the chromosome
(62 Mb).
Out of the 49,035 SNP markers, 41,999 could be reliably called on both RH panels
using the method described above. RH maps were constructed for each chromosome
using the comparative approach (37) implemented in the carthagene software (38). The
principle of using the two panels to create the RH maps in the context of a comparative
approach is the following: the likelihood of a given order was computed independently
for the dataset corresponding to each panel and the product of likelihoods was used as the
likelihood of the data and combined with the prior probability of the order to produce the
objective criteria to maximize. A reduction in the instance of The Travelling Salesman
Problem (TSP) was performed, enabling the use of an efficient TSP solver in the
optimization step. Those comprehensive maps are made of 38,202 markers. Because high
density maps are likely to contain regions where local ordering is poorly supported by the
data, we constructed robust maps for each sheep chromosome. A robust map consists of a
subset of markers whose order was associated with a strong posterior probability (39, 40).
A total of 33,386 markers were included in the robust maps (table S3). The alignment of
the robust map with the final Oar v3.1 assembly was calculated for each chromosome
(fig. S11).
1.6 Construction of a sheep linkage map
Ovine SNP50 BeadChip genotyping data was obtained for three genetic mapping
populations: the International Mapping Flock (IMF, (26)), the SheepGENOMICS flock
(41), and the Louisiana State University (LSU) flock (42). The number of animals whose
genotypes were used comprised: 117 from the 3-generation full-sibling IMF; 20 sires and
3,831 progeny from the 2-generation half-sibling SheepGENOMICS flock; and 449 from
the complex F2-type 3-generation LSU pedigree.
SNPs were initially assigned to chromosomes using a dataset that included both the
IMF SNP data and the IMF genotype data used for IMF sheep map version 5 (table S2)
and the find-all-linkage-groups option of a version of MultiMap (43) that incorporated
lispcri version 2.503 (adapted by Jill Maddox and Ian Evans,
http://www.animalgenome.org/tools/share/crimap/). The SNP chromosome groupings
were then used to assign additional SNPs to chromosomes with find-all-linkage-groups in
the FMFS and LSU populations. All SNPs that could be assigned to a chromosome were
assumed to derive from single copy sequence, and the single copy nature of these SNPs
6
was checked in the sequence assembly. Putative single copy SNPs with multiple
sequence assembly locations were investigated and the correct chromosome identified.
Genetic maps, for comparing alternate putative sequence assembly, radiation hybrid
(RH) SNP and bovine orders, were constructed using the chrompic option of CRI-MAP
2.503, a modified version of CRI-MAP (44). The three populations were used separately
for autosomes and the pseudoautosomal region of the X chromosome, while only the IMF
and LSU dataset were used for the non-pseudoautosomal region as the
SheepGENOMICS data set contained only male informative meioses. In addition, low
density de novo lod6 (autosomes, SheepGENOMICS dataset) and lod3 X chromosome
maps (IMF, LSU datasets) were constructed as a further check on gross chromosomal
SNP ordering. Comparisons between two different map types (sequence assembly versus
RH map, sheep order versus bovine order) only used SNPs that were present in both of
the compared map types. Genetic maps were investigated for possible map expansions
due to incorrect ordering and double recombinants. This approach identified a number of
erroneously positioned SNPs on early versions of the sequence assembly. The positions
of these SNPs were further investigated and most major discrepancies (with large
log10likelihood differences for the genetic map order relative to the sequence assembly
order) were resolved, with a number of changes made to the sequence assembly.
1.7 Construction of Super-scaffolds and anchoring super-scaffolds to chromosomes
Scaffolds that were clearly chimeric were identified by remapping the female Texel
long insert size paired-ends reads (17 kb and 9 kb) to the draft assembly using
SOAPaligner/soap2.20 software (http://soap.genomics.org.cn) (32), then confirmed by
comparison with the bovine UMD3.1 genome assembly (45), goat genome assembly
(28), and antelope scaffold sets (46). Chimeric scaffolds were then manually split in the
gap between adjacent contigs mapped to two different bovine chromosomes.
Super-scaffolds were built from the set of scaffolds using the BAC-end sequences
derived from the male Texel BAC library (CHORI-243) and the predicted locations on
Oar v1.0 of SNPs included on the Illumina Ovine SNP50 BeadChip. This was undertaken
as a single integrated process and non-congruent BACs and out of position SNPs were
minimized. Several rounds of manual checking and final error correction were carried out
using the BAC-end sequences of Ovine CHORI-243 library and 454 mate-paired reads
derived from 8 kb and 20 kb insert libraries of the male Texel.
Unmapped scaffolds with a length of less than 2 kb were discarded. Super-scaffolds
were initially ordered and oriented into chromosomes using the locations of the SNPs in
the sheep RH map, with BLASTN, using unique BLASTN hits with E-values < 1 × 10-10
and hit length > 100 bp. The positions of the SNPs in the sheep linkage map were used to
identify remaining errors and to refine the assembly. Components of the genome
assembly (i.e. scaffolds, and corresponding quality files) and the component assembly
instruction file (i.e. the agp format file) were generated and are available at
(http://www.livestockgenomics.csiro.au/sheep/oar2.0.php) and
(http://www.livestockgenomics.csiro.au/sheep/oar3.1.php).
1.8 Removal of artificial sequence duplications
7
We systemically identified all of the duplicates by Whole Genome Assembly
Comparison (WGAC) (47). After mapping the 17 kb, 9 kb, 5 kb and 2 kb insert size mate
paired reads to repeat-masked genome using SOAPaligner/soap2.20, we estimated the
length of every duplicated region. If the estimated length was >1 kb shorter than the
assembled length, a local comparison with itself, and orthologous regions in the goat,
antelope, cattle genome assemblies using LASTZ (48) was used to confirm the
boundaries of the duplicated sequence. Duplicates that failed both of the two following
criteria were removed: (1) no reads could link it with its flanking sequences; (2) read
depth after the GC content adjustment was < (whole genome average depth – (3×
STDEV)) of read coverage (49).
1.9 Calling SNPs and validating SNP calling
All of the paired-ends reads from the Texel ewe and Texel ram were mapped back to
the assembled genome by SOAPaligner/soap 2.20 with an average depth of ~41-fold for
the Texel ewe and ~40-fold for the Texel ram. Then, SNPs were called by SOAPsnp (50)
separately for both of the ewe and ram. Next, four steps were used to filter out unreliable
SNPs: (1) a Q20 quality cutoff was used; (2) at least 10 supporting reads were required;
(3) the overall depth, including randomly placed repetitive hits, had to be less than 100;
(4) the approximate copy number of flanking sequences had to be less than 2 (to avoid
false positives caused by the alignment of similar reads from duplicates).
After filtering, the Texel ewe consensus sequence was defined as the reference
genome, and the single nuclear difference between the Texel ewe genome and Texel ram
genome were called as homozygous SNPs.
Both the male and female Texel DNA samples were genotyped using the Ovine
SNP50 BeadChip (Illumina, San Diego, CA). Raw signal intensities were converted into
genotype calls using the Genome Studio software (Illumina, San Diego, CA). We then
compared the SNP calls from the sequencing platform and the SNP50 BeadChip.
1.10 RNA-Seq data processing
To obtain high quality reads and precise analysis results, an in-house C++ program
was used to filter out raw reads which might negatively affect the subsequent analysis.
We removed:
Reads that contained > 10% ambiguous base calls (Ns)
Reads that contained > 40% low quality base calls (quality score ≤ 5)
Reads that contained adapter contamination (with >10 bp aligned length and ≤ 2 bp
mismatches)
Read pairs with read 1 and read 2 overlapping by ≥ 10 bp
100% identical read pairs
1.11 Alignment and de novo assembly of the transcriptomes
All of the clean RNA-Seq data were mapped onto the reference genome with
TopHat v.2.0.4 (51). Transcripts were constructed from the mapped using cufflinks
v.2.0.0 (52).
For de novo assembly of the transcriptomes of the seven tissues from the Texel
ewe, SOAPdenovo was used to assemble the filtered reads from each tissue separately
into contigs and scaffolds, with the parameters set to be “-K 23 -M 0 -F -R -D 1 -d 1”.
8
The contigs and scaffolds were then passed to ABySS (53)
(http://www.bcgsc.ca/platform/bioinfo/software/abyss) to assemble them into longer
sequences, with k-mer setting to be 43. 52,821 non-redundant sequences (ESTs) longer
than 300 bp, with an average length of 920 bp, were obtained.
Skin RNA-Seq reads which mapped onto MOGAT3 region and rumen RNA-Seq
reads which mapped onto TCHHL2 were also de novo assembled using the above
method.
1.12 Evaluation of the sheep assembly with Ovine SNP50 BeadChip sequences, de novo
assembled ESTs, and BACs
The 59,042 verified Ovine SNP50 BeadChip sequences with a SNP and 150 bp
flanking sequences were used to check the sheep assembly. All of them were mapped
against the genome with BLAST with an identity > 95% and a hit length > 100 bp. All of
the de novo assembled ESTs from RNA-Seq reads from the seven tissues were mapped
against the genome with BLAT to estimate the gene space completeness of the genome,
with an identity > 95%.
The sequences of 16 fully sequenced BACs and the sheep Major Histocompatibility
Complex region (assembled from 26 overlapping BACs) (54) were downloaded from
GenBank. The BACs were aligned against the chromosomes using Mummer (version
3.22) (55) with default parameters. The alignment blocks were then chained along the
BACs by in-house Perl scripts and also with manual confirmation. Ewe paired-end reads
with short insert size (insert size < 1000 bp) were mapped to the BACs using SOAP
(version 2.21) (32), and SOAPcoverage (SOAP software package) was used to calculate
sequencing depth for each non-overlapping 100 bp window along each BAC sequence.
1.13 CEGMA evaluation of sheep genome assembly
CEGMA (Core Eukaryotic Genes Mapping Approach) (version 2.3, with parameter
“--mam”) (56) was also used to assess the sheep genome assembly. 248 CEGs (Core
Eukaryotic Genes) downloaded from the webpage of CEGMA software
(http://korflab.ucdavis.edu/datasets/cegma/) were mapped to the sheep genome and
orthologous or paralogous genes of these CEGs were recovered. The CEGs are conserved
and readily identifiable across a broad range of eukaryotic species. The recovered CEG
numbers and completeness of the recovered gene models implies gene space
completeness of our genome assembly (table S12).
1.14 Identification of repetitive elements
We annotated repetitive sequences and transposable elements (TEs) using a
combination of homology to RepBase sequences and de novo prediction approaches.
We de novo constructed a sheep repeat library using RepeatModeler with the default
parameters. The generated results were consensus sequences and classification
information for each repeat family. TEs were classified according to Wicker et al., (57).
Then RepeatMasker v3.2.6 (58) was run on the genome sequences, using the
RepeatModeler consensus sequence its source library with parameters “ -nolow -no_is –
norna -lib repeat”.
9
RepeatMasker was also run against the RepBase TE library v2009-06-04 with the
parameter “-nolow -no_is -norna -lib repbase” and “-noLowSimple -pvalue 1e-4”. At the
protein level, RepeatProteinMask was applied (58).
We identified non-interspersed repeat sequences by RepeatMasker with the“-noint”
option, including simple repeats, satellites, and low complexity repeats. We also
predicted Tandem repeats using Tandem Repeat Finder, with parameters set to
“Match=2, Mismatch=7, Delta=7, PM=80, PI=10, Minscore=50, and MaxPeriod=2000”.
Finally, we integrated all of the repeat annotation results with an in-house program.
42.67% of the total assembled genome was identified as TEs (table S13). A considerable
proportion of the 2.76% of the genome assembly comprised of gaps is also likely to be
repeat sequences.
The sequence divergence rate was also calculated for each family of TEs (fig. S12A)
and their distribution across the genome was plotted (fig S12B).
1.15 Gene prediction:
The sheep genome assembly was annotated with the Ensembl gene annotation
system (59) (Ensembl release 74, December 2013). Protein-coding gene models were
annotated by combining alignments of UniProt (60) mammal and other vertebrate protein
sequences and RNA-Seq models generated from different individuals and different tissue
types (table S7) and gap filling with human and cow translations from Ensembl (release
69). Short non-coding genes were also annotated to provide the final gene set (tables S14-
16).
The genome was repeat-masked with RepeatMasker, using the RepBase library (-
species sheep) and using a custom library generated with RepeatModeler, and Dust (61).
Additional low complexity regions were identified using TRF (62).
Protein-coding models were generated by aligning sheep and other vertebrate
protein sequences from UniProt to the repeat-masked genome using Genewise (63).
Protein-coding models were also generated using our in-house RNA-Seq pipeline (64).
The RNA-Seq data set used for generating the gene models consists of different tissue
types from a trio of Texel sheep; ram, ewe and lamb plus an embryo from the same ram-
ewe pairing provided by The Roslin Institute, 7 tissue types from the reference female
Texel and 1 sample of Gansu alpine fine wool sheep skin provided by BGI and whole
blood samples from 1 Polypay sheep and 2 Rambouillet sheep provided by USDA-ARS-
ADRU (table S7). These data were aligned to the genome using BWA (65), resulting in
736 billion reads aligning from 819 billion reads. The alignments were processed by
collapsing the transcribed regions into a set of potential exons. Partially aligned reads
were re-mapped using Exonerate (66) and this step identified 367 million spliced reads or
introns. These introns together with the set of transcribed exons were combined to
produce transcript models, one set for each of 94 individual tissues and one set produced
by merging data from all of the above mentioned tissues. The longest open reading frame
in each of these models was BLASTed (67) against the set of UniProt protein existence
(PE) levels 1 (existence at protein level) and 2 (existence at transcript level) protein
sequences in order to classify the models according to their protein-coding potential.
Data from the above two pipelines were filtered to remove poorly supported
models. Untranslated regions were added to the coding models using sheep cDNA and
RNA-Seq models. The preliminary sets of coding models were combined, prioritizing
10
well supported models built from UniProt proteins and the merged set of RNA-Seq data
and redundant models were removed. Human and cow Ensembl translations (release 69)
were used to fill in gaps. The resulting unique set of transcript models were clustered into
multi-transcript genes where each transcript in a gene has at least one coding exon that
overlaps a coding exon from another transcript within the same gene. The set of protein-
coding gene models was screened for pseudogenes. Short non-coding RNA genes were
predicted using annotation from RFAM (68) and miRBase (69). The sheep gene
annotation is available on the Ensembl website (http://www.ensembl.org/Ovis_aries/),
including orthologues, gene trees, and whole-genome alignments against human, mouse
and other mammals. Also included are the tissue-specific RNA-Seq transcript models,
indexed BAM files, and the complete set of splice junctions identified by our pipeline.
Further information about the annotation process can be found in a PDF document here:
(http://www.ensembl.org/Ovis_aries/Info/Annotation#assembly).
1.16 Functional Annotation and gene ontology (GO) assignment
InterProScan (version 4.8) was used to assign GO terms to each protein-coding
gene. Member database Pfam (release 27.0), PRINT (release 42.0), PROSITE (release
20.96), ProDom (release 2010.1), SMART (release 7.0), PANTHER (release 8.1) were
searched. KEGG (release 58) and UniProt database (Swissprot/TrEMBL release 2012.3)
were also searched for homology-based gene function assignment.
1.17 Gene family clusters
Protein-coding genes for cattle, pig, horse, dog, human, mouse, opossum were
downloaded from Ensembl release 64. The gene sets for yak and goat were obtained from
the BGI-Shenzhen internal database and the camel gene sets were downloaded from
NCBI. For gene loci with alternative splicing isoforms, only the transcript with the
longest translation product was retained. We performed an “all versus all” alignment
using BLASTP with E-value < 1E-7, and conjoined fragmental alignments using Solar
(70). Then a simplified version of the Treefam methodology (71) was used to cluster
genes from different species into gene families which contain genes that descended from
the same gene in the last common ancestor. The number of orthologous genes across the
eleven species were calculated and plotted in a Venn diagram (fig. S13).
1.18 Phylogenetic tree reconstruction for mammalian species
After gene family clustering, single copy genes, which were determined to contain
only orthologous genes from each species according to the Treefam methodology (71),
were selected to reconstruct phylogenetic relationship of these mammalian species.
Multiple sequence alignment for each gene family was performed by MUSCLE (72)
(version 3.8.31) and four-fold degenerate sites were extracted and concatenated to
generate super alignments. We built phylogenetic trees using MrBayes (73) which takes
advantage of both codon-based and amino acid-based algorithms and adjusts them to the
topology of the species tree, to form a more accurate consensus tree according to four-
fold degenerate sites. To estimate the divergence time of the selected species, we used a
molecular clock model implemented in PAML mcmctree (74). The divergence times
were constrained according to the fossil calibration times (124.6-134.8 million years ago
(Mya) between human-opossum, 95.3-113 Mya between human-cow, 61.5-100.5 Mya
11
between human-mouse, 48.3-53.5 Mya between cow-pig, 18.3-28.5 Mya between cow-
sheep) (75). The different molecular clocks (divergence rate) might be explained by the
body size hypothesis or the generation-time hypothesis, which propose that the larger the
body size is or the longer the generation-time is, the slower the molecular clock. In
addition we can identify weak or strong selection for each lineage from their dN/dS
ratios.
1.19 Expansion and contraction of gene families
Gene family expansion and contraction changes were detected by CAFÉ (76)
(Computational Analysis of gene Family Evolution) based on the phylogenetic tree
reconstructed in 1.18. The P-value cutoff was set to 0.05, the number of randomizations
was set to be 10000, and λ value was searched. By manually checking the functional
annotation of each gene in each gene family, false positive families with discrepant
functional descriptions were filtered out. The expansion and contraction of orthologous
gene clusters in the nine mammalian species analyzed was calculated (fig. S14).
1.20 Detection of positively selected genes
As described above in section 1.17, BLASTP and Treefam methodologies were used
to define orthologs among the goat, cow and sheep. In total, 14,407 orthologous pairs
were analyzed for positive selection. The coding sequence of orthologs was aligned using
Prank software(77) with default parameters. Ka and Ks were calculated for the aligned
orthologs using Ka/Ks calculator software with default parameters
(http://code.google.com/p/kaks-calculator/wiki/KaKs_Calculator). The branch-site model
of positive selection from PAML (74) was used to identify sheep-specific, sheep-goat
branch, and cattle-specific fast evolving genes. Gene Ontology enrichment of the fast
evolving genes between sheep and cattle identified significant enrichment of a number of
GO terms relating to the immune response (fig. S15).
1.21 BAC sequencing and de novo assembly
Using mapping information for BAC-end sequences from the sheep CHORI-243
BAC library clones (27), one BAC located in the MOGAT3 gene region (CH243-423F23)
was picked and sequenced using HiSeq 2500 to generate 250 bp paired-end reads.
Adaptor sequences were trimmed off and sequences matching either the BAC vector
(pTARBAC2.1) or E. coli DH10B were removed prior to assembling the data. De
novo assembly was carried out using CLC Genomics Workbench 6 (CLC Bio) with the
following parameters: similarity = 0.99, length fraction = 0.9, insertion cost = 3, deletion
cost = 3, mismatch cost = 2 and minimum size = 2000 bp. Then the de novo assembled
contigs were mapped onto the sheep reference genome, Oar v3.1, and linked into one
scaffold manually, based on the integrated information of mate-paired reads (17 kb, 9 kb
and 5 kb libraries) and the reference genome contig mapping order. Then, all of the BAC
sequence reads were mapped onto the de novo BAC assembly scaffold, and the local
mapped depth was checked. All of the gap regions and high depth regions were re-
assembled using CAP3 (78).
1.22 In situ hybridization
12
In situ hybridization images were generated using EST clones in a large experiment
described in detail in Adelson et al., (79).
1.23 Identification of sheep segmental duplications (SD) and copy number variations
(CNV)
A whole genome alignment comparison pipeline (WGAC) was used for calling SDs
with the same cutoff (>= 1 kb in length, >= 90% sequence identity) (47).
We also used the WSSD strategy to detect SDs in sheep genome. We used BWA
(65) to align paired-end reads of female and male Texel sheep with default parameters
onto unmasked Oar v3.1. The maximum edit distance (as ~5% for the default divergence
cutoff in BWA) was automatically chosen for different read lengths. Approximately 95%
of reads could be mapped onto the reference genome. We then counted the aligned read
numbers within 200 bp sliding windows and 100 bp steps using custom Perl scripts. The
GC bias of the Illumina GAII platform was corrected using LOESS smoothing toward a
pattern of uniform coverage at all GC percentages as previously described (80). All
gapped and repeat-containing 200 bp windows were filtered out. Since the male ChrX is
not diploid, except for the recombining region (PAR) (ChrX:1-7,050,204), the ChrX non-
PAR region was analyzed separately to calculate average sequence depth and standard
deviation (STDEV).
SD/CNV calls were initially selected if five out of seven or more sequential 200
bp overlapping windows had read depth values that significantly differed from the
average (duplications > mean + (2 x STDEV)). We adjusted Bickhart’s WSSD pipeline
(49), by using long alignment reads and small window size to make the depth calling
more sensitive.
If two or more duplicated regions are fully assembled, their read depth will tend
to the average depth. To get the full SD/CNV dataset, we combined the ≥ 95% sequence
identity WGAC dataset (to keep the same identity for WSSD) and the WSSD results
using custom Perl scripts. Only SDs/CNVs > 1 kb in length were kept in the final data
set.
1.24 Genome-wide and transcriptome-wide identification of allele specific expressed
SNPs and genes
By combining DNA and RNA sequencing reads from the same individual, allelic-
specific expressed genes can be accurately identified. We surveyed all of the expressed
alleles for the 5.5 million SNPs identified in the reference female Texel sheep with 15 Gb
of RNA-Seq data from seven tissues from the sequenced individual. A 90:10 cutoff for
the ratio of expression from the two alleles was taken as the boundary between allelic
specific expression and random allelic expression. If > 90% of total expression is from
one allele at 20-fold sequence coverage the statistical test shows strong power (> 97%
correct) (81). Considering that SNPs located in SD or CNV regions may strongly
interfere with the prediction of allelic specific expression, we checked all the SNPs which
had ≥ 20 expressed reads and filtered out SNPs located in SD-CNV regions. We also
removed the potential segmental duplicated SNPs, which have duplicated syntenic
regions in any of three allied genomes, using BLAST searches with 301 bp sequences
(the SNPs and their 150 bp flanking sequences) against the goat ≥ 95% identity and 150
13
bp), Tibetan antelope (≥ 93% identity and 100 bp) and cattle (≥ 85% identity and 100 bp)
genome assemblies.
1.25 Manual annotation of the genomic complement of tandem duplicated regions, such
as MOGAT3 and EDC region.
Reference protein sequences from related species were obtained by searching the
NCBI and Ensembl databases (website, release 73). Then these protein sequences were
mapped to the sheep genome assembly, or newly assembled BACs, using TBLASTN
(version 2.2.23, parameters “-e 1e-5 –F F”). Since the result of TBLASTN (which were
shown as HSPs, high-scoring pairs) were fragmented, genblastA (82) (version 1.0,
parameters “-e 1e-5 -g T -f F -a 0.5 –d 100000 -r 100 –c 0.01 -s -100 ”) was used to
group the adjacent HSPs derived from the same query protein into a representative
homologous hit of the query protein. Redundant hits were filtered and only the best hit
was retained for each gene locus. Genomic regions matched with reference proteins were
extended upstream and downstream by 2000 bp make sure that intact potential gene
structures were included. GeneWise software(63) (parameters “-genesf -gff -sum”) was
used to predict gene structures for each protein coding region. Pseudogenes were
identified by the presence of premature stop codons.
1.26 Membrane anchor prediction.
To identify the presence or absence of potential membrane anchor regions amino
acid sequences of proteins were submitted to the TMMM2.0 (83) server
(http://www.cbs.dtu.dk/services/TMHMM/).
Supplementary text
The reference sheep genome assembly
Two unrelated Texel sheep, a ewe and a ram, were sequenced by the International
Sheep Genomics Consortium (84) using Illumina sequencing (table S1). The sequencing
reads were assembled into the genome as described below (fig. S16) (10). The 75-fold
coverage Illumina reads of the Texel ewe were de novo assembled into contigs and
scaffolds using SOAPdenovo. The 120-fold coverage Illumina sequence from both
animals was used for gap filling. This preliminary 2.71 Gb assembly, with an N50 length
of contigs and scaffolds of 17.4 kb and 1.1 Mb respectively (table S17) was pre-released
as Oar v2.0 and is available from GigaDB (85). To fill high-GC gaps an additional ~21-
fold coverage of Illumina sequencing data from the male Texel, using a protocol with less
bias against high-GC sequences, and 2 Gb of MeDIP-seq for high GC content sequence
from the female Texel were generated (fig. S17) (10). The coverage of the 5’ ends of
genes was significantly improved over Oar v2.1 (fig. S18). The final assembly has a very
similar distribution of GC content to the bovine and other mammalian genome assemblies
(fig. S19) SOAPdenovo is prone to creating artificial segmental duplications (86) and
multiple gap filling steps can also lead to incorrect elongation at the ends of contigs,
generating artificial dispersed duplicates. We systematically identified 12,008 candidate
artificial tandem duplicates and 5,508 artificial dispersed duplicates by checking their
read depth and the relationship with their flanking sequences (10). Removal of these extra
14
tandem copies excluded 28 Mb of sequence with an average length of duplicates of 2.3
kb.
The final sheep scaffold set has a contig and scaffold N50 length of 40 kb and 2.2
Mb respectively, achieving a total assembled length of 2.61 Gb (table S4). Using the
CHORI-243 sheep BAC library sequences (27) and long insert mate-pair reads we
constructed 349 super-scaffolds with an N50 of 37.1 Mb (table S18). The Ovine SNP50
BeadChip (87) and microsatellite and other markers were used to generate linkage data
and a high density RH map with 39,042 SNP markers (tables S2, S3 and fig. S11) (10) to
anchor scaffolds and super-scaffolds to the 26 autosomes and the X chromosome to
construct the Oar v3.1 assembly. The ~5,700 unmapped scaffolds have a total length of
32 Mb (1.2%). To check the integrity of the assembly, 15 Gb of expressed sequence data
generated from seven tissues from the sequenced Texel ewe were de novo assembled into
52,821 model mRNAs (average length of 920 bp) (10). 99.3% of the model mRNAs
mapped to the Oar v3.1 assembly with an average coverage of 98.4% (table S19). Of the
54,590 Ovine SNP50 BeadChip (87) oligonucleotides only 375 probes did not have a hit,
indicating the coverage of single copy regions is about 99.3%. Comparison of 16
complete CHORI-243 BAC sequences determined using the Sanger methodology with
the genome assembly identified on average 1.98% nucleotides missing from the genome
assembly (table S20 and fig. S20). 89% of the gap regions were in multiple copy repeats,
especially the newly evolved LINE RTE/BovB elements (table S21). Furthermore, on
comparison with the assembly of the MHC region from a Chinese Merino sheep, which
was also derived from BAC-based Sanger sequencing (54), Oar v3.1contains no long
gaps or large rearrangements, but does contain some additional sequence (fig. S21).
Ovine SNP50 BeadChip (87) genotypes of the two Texels revealed >97.8% concordance
with sequencing and a low false positive rate for heterozygous SNPs (<0.33%). As a
consequence of the extensive checking and manual curation of the genome assembly, the
contig N50 is twice as long as the recently sequenced ruminants, yak (88), Tibetan
antelope (46) and goat (28) (table S22).
All scaffolds and chromosomes of Oar v3.1 have been submitted to the NCBI under
bioproject accession number PRJNA169880, and the assembly for Oar v3.1 has been
assigned accession number GCA_000298735.1. Oar v3.1 is the representative reference
genome for small ruminants, and with the cattle genome (89) a co-reference genome for
all ruminants.
Sheep genome architecture
To investigate the rate of recent segmental duplication (>95% identity, >1 kb length)
in the sheep genome, we used whole-genome shotgun sequencing with GC content
adjustment(49) for the two Texel individuals, and whole genome alignment
comparison(47) (10). In total 7,912 candidate duplicated regions with a total length of
25.8 Mb were identified (Fig.1, tables S5, S23, S24). We successfully detected the
previously described 4 kb duplication of the growth hormone gene (90).
Two continuous and very similar mitochondrial DNA insertions were identified in
the X chromosome (56.33 Mb, length 14 kb) and chromosome 2 (55.2 Mb, length 9 kb)
(fig. S22) and verified by PCR and sequencing across the junctions (fig. S23).
The heterozygosity rate of approximately 0.2%,is 1 to 2 times higher than reported
for individual humans, pigs, cattle, dogs and horses (91-94) (table S22). It is also
15
approximately 70% higher than reported for the reference goat individual (28).
Approximately 75% of modern sheep breeds have retained an effective population size in
excess of 300, higher than cattle and much higher than most breeds of dog (95),
suggesting domestic sheep arose from a highly heterogeneous pre-domestication gene
pool, and that the genetic bottleneck during domestication was not as severe as for other
domestic animals.
The sheep and cattle genomes have been assembled independently using their own
physical maps, allowing for comparison to identify rearrangements, while the goat has
only pseudo-chromosomes which were assembled based on conserved synteny with the
cattle genome using long super-scaffolds (28). Sheep and cattle have 90% DNA sequence
identity and have similar karyotypes (96) (figs. S1, S24). The 141 breakpoints between
the sheep Oar v3.1 (GCA_000298735.1) and cattle UMD3.1 (GCA_000003055.3)
genome assemblies identified (table S6), including four known Robertsonian
translocations involving the autosomes (97, 98). We also identified a large, 7 Mb,
inversion in sheep chromosome 13 relative to cattle chromosome 13 (figs. S1, S24).
Comparison of the sheep and goat genome assemblies confirms that three of the four
Robertsonian translocations occurred on the sheep lineage. The Robertsonian
translocation in sheep chromosome 9 and the inversion on sheep chromosome 13 are
present in both the sheep and goat branches (figs. S1, S24). The conservation of synteny
relationship of chromosome X is more complicated than the autosomes, with eight
inversions and translocations and centromere loss and acquisition after the divergence of
cattle and sheep (figs. S25, S26). The mammalian X and Y chromosomes maintain a
short region of homology (pseudoautosome region, PAR), allowing pairing and
recombination. In contrast to a previous study suggesting that cattle and sheep share the
same PAR boundary in the vicinity of GPR143 (99), our data places the sheep PAR
boundary downstream of SHROOM2 at about 7 Mb (fig. S25B). We note that the
centromere of sheep chromosome X is now located at the PAR boundary, so that the PAR
appears to represent the entire short arm of the X chromosome in sheep (fig. S25A).
To further investigate breakpoints cattle (CHORI-240 BAC library) and sheep
(CHORI-243 BAC library) BAC end sequences (27) were used. Fifty two out of the 141
breakpoints were confirmed using BACs, including 30 inversions and 13 translocation
events (table S6). We also noted that there are 58 potential genome assembly errors on
cattle genome UMD3.1, and 38 potential >100 kb gaps in Oar v3.1, most of which
contain tandemly duplicated sequences and/or known CNV regions, such as the
multidrug transporter ABCC4 cluster on sheep chromosome 10, and multiple clusters of
olfactory receptor genes. The genome rearrangements described above are likely to
represent recent genetic divergence between sheep and cattle.
Sheep transcriptome
~1.2 Tb of RNA-Seq data was generated from the 94 individual tissue samples
(table S7). The data were used to annotate the sheep genome assembly by the Ensembl
pipeline (10). 20,921 protein coding genes (with 22,823 transcripts), 291 pseudogenes,
and 3,961 short ncRNA and 24 mitochondrial ncRNA genes were annotated (tables S15,
S16). A table linking Ensembl ids to the gene names used in the text is included in the
supplementary material (table S25). The distribution of mRNA length, CDS length, exon
length and intron length of the sheep genes was plotted against the equivalent parameters
16
of the human, mouse and cow genes. Distributions of the lengths of these features were
very similar to the other genomes (fig. S27).
Imprinted genes exhibit allele-specific expression in a parent-of-origin dependent
manner due to epigenetic modification (100). However, to date only a few candidate
imprinted genes have been previously identified in sheep (101). We used the
heterozygous SNPs and the RNA-Seq data from the same Texel ewe to investigate mono-
allelic gene expression (10). Of the 41,000 SNPs located in expressed sequences and with
>20 fold coverage, 1,788 SNPs exhibited more than 90% of reads from one allele. Thus
802 protein coding SNPs and 986 non-coding SNPs show allele-specific expression
(table S26). The allele-specific expression was generally conserved across adult tissues,
with 93.2% of the SNPs mono-allelically expressed in all of the studied tissues. The
longest candidate imprinted region was the BEGAIN-DIO3 region on chromosome 18
with 60 continuous mono-allelically expressed SNPs. This region contains the well
documented sheep polar over dominance mutation for muscle development, the Callipyge
phenotype (OMIA 001354-9940), and possibly also the Carwell (or rib-eye muscling)
QTL (102) (OMIA 001355-9940). However, many of the genes in the BEGAIN-DIO3
region show high and mono-allelic expression in the brain, suggesting that imprinting in
the region is also correlated with neural regulation (103).
Ruminant gene family expansions
As expected, confirming that our methodology for the detection of gene family
expansions was appropriate, we found expansions in both the LYZ (7 genes on
chromosome 3) and RNASE1 (11 genes on chromosome 7) families, both important in
the digestive system of ruminants (89, 104), on the sheep branch. Lysozyme C (a member
of the LYZ family) is an antibacterial protein in the innate immune system, but in
ruminants it is thought that some members of the lysozyme C family are major digestive
enzymes playing a role in the digestion of bacteria entering the abomasum from the
rumen (105). Two family members, LYZ4 and LYZ5, are extremely highly expressed in
the abomasum, contributing ~10% of total expressed mRNA, whilst LYZ3 is mainly
expressed in the intestine (fig. S28). Interestingly, LYZ1 is one of the top ten most highly
expressed protein coding genes in the rumen. It is also expressed, albeit at a much lower
level, in many other tissues, including the rectum, caecum, colon, duodenum, skin and
the kidney. 445 cattle LYZ1 ESTs have been deposited in GenBank (UniGene cluster,
Bt.67194); the majority of sequences are from the rumen. The role of LYZ1 in the rumen
is unknown, although on the basis of its sequence and tissue expression pattern, it is
predicted to exhibit antibacterial activity and is likely to contribute to the protection of
the rumen epithelium from the activity of the infective/pathogenic bacteria. The predicted
3D structure of the abomasal lysozymes is different from LYZ1 (fig. S29). Interestingly,
the one amino-acid deletion in the abomasal lysozymes is located within the equivalent
region to an antimicrobial peptide identified in chicken lysozyme (106).
We also identified the Agouti locus (OMIA 000201-9940) (Fig. 1A), which is a
large ~190 kb copy number variation contributing to the variability of coat color in sheep
(107), (table S5).
Two large ruminant specific gene families encode proteins probably involved with
the development of the placenta, especially the mechanisms of apoptosis of ruminant
endometrium, and other processes during pregnancy and lactation (108). The pregnancy-
17
associated glycoprotein genes (PAGs), which are most closely related to pepsins, have
more than 42 tandem duplicated members in sheep (fig. S30A), and the prolactin related
genes (PRPs) have 12 members (figs. S31, S32). Within the PAG locus, one gene, PGA5
encoding pepsinogen A, is highly expressed in the abomasum and duodenum (fig. S30B)
and presumably plays a role in digestion. However, most PAGs are expressed in a
specialized subset of trophoblasts called binucleate cells (BNC) in the placenta, and their
secreted proteins are used as an early pregnancy diagnosis signal in ruminants (109). The
BNC also secrete prolactin-like proteins (110) which exhibit the strongest positive
selection signal between sheep and cattle. All these results suggest the reproduction
system of ruminants has undergone rapid evolutionary change.
TCHHL2 expression in cattle and other mammals
At the time of submission of this article 22 ESTs and mRNAs were mapped to the
bovine genome assembly overlapping the predicted bovine TCHHL2 gene
(LOC101909330) on the NCBI Map viewer. Seventeen of these ESTs, mainly expressed
in the rumen, were included in the UniGene cluster Bt.14362.
No ESTs expressed from TCHHL2 from other mammals were present in
GenBank.
PRD-SPRRII genes and expression in cattle and other mammals
At the time of submission of this article our analysis of the PRD-SPRRII locus in the
EDC region in cattle identified eight genes, all annotated as loci of unknown function at
the NCBI, with several also annotated as non-coding RNAs (table S27). However, our
analysis predicts that all eight genes encode PRD-SPRRII family proteins (fig. S33). No
ESTs expressed from PRD-SPRII family genes from other mammals were present in
GenBank.
LCE7A expression in sheep, cattle and other mammals
At the time of submission of the article 37 EST sequences derived from the sheep
LCE7A gene were deposited in GenBank: One from Merino 103-105 day old fetal sheep
skin, 10 from Romney Marsh 130 day old fetal sheep skin (in top 200 most expressed
genes) and 26 from adult Romney Marsh sheep skin (in top 100 most expressed genes).
The sequences are included in the NCBI, UniGene cluster Oar.17500. One EST derived
from the cattle LCE7A gene (LOC786364) from Hereford 6 month old fetal skin was
deposited in GenBank in NCBI, UniGene cluster Bt.75903.
No ESTs expressed from orthologues of LCE7A from any other mammals, except
the mouse (M. musculus) were present in GenBank. A single mouse stomach cDNA
(gi|74203369|dbj|BAE20850.1) has been deposited in GenBank. After the removal of the
first 96 amino acids of the predicted product of the mouse cDNA the protein is
homologous to sheep LCE7A and the genes have conserved synteny.
MOGAT2 expression in cattle
At the time of submission of this article three cattle MOGAT2 genes were
annotated in RefSeq: NM_001099136.1 (UniGene cluster Bt.61395, 5 ESTs all expressed
in the skin), NM_001104970.1. (UniGene cluster Bt.42344, 8 ESTs all expressed in the
18
skin, NM_001001154.1. (UniGene cluster Bt.32553, 4 ESTs, three expressed in the
intestine).
MOGAT3 expression in cattle
At the time of submission of this article one cattle MOGAT3 EST (GenBank
accession CF762562.1) was deposited in GenBank. It was sequenced from a cattle skin
cDNA library.
Non-membrane anchored form of MOGAT3
Searches of the EST section of GenBank identified one sheep EST (EE862887.1)
encoding the alternate amino-terminal sequence of MOGAT3 in frame with the common
amino-acid sequence and consistent with the RNA-Seq data (fig. S9).
Despite extensive searches of GenBank no evidence for the expression of an
equivalent unanchored form of MOGAT3 was identified from any other mammal,
including cattle.
FABP9 expression in sheep and other mammals
The fatty acid binding protein encoding gene, FABP9, which was expressed highly
in sheep skin (in the top 0.5% of all genes), but not in any other sheep tissue studied,
including the testis, which is the major site of expression of FABP9 in the mouse (111),
may play a role in the transport of MAG (112). At the time of submission of this
manuscript 27 sheep FABP9 ESTs had been deposited in GenBank, UniGene cluster
Oar.6318, expression is in the skin. Expression of FABP9 has also been observed in cattle
skin, UniGene cluster Bt.23479, containing 1 EST. However, FABP9 is also expressed in
pig skin, 15 ESTs in GenBank, UniGene cluster Ssc.100103. In contrast, no EST or
mRNA sequences of human FABP9 had been deposited in GenBank, UniGene cluster
Hs.653176.
Acknowledgements
We thank the laboratory division of BGI Shenzhen for their assistance in
sequencing the DNA and RNA samples. We thank the ARK-Genomics (now Edinburgh
Genomics) sequencing and informatics teams including S. Smith, K. Troup, F. Turner
and J. Loecherbach. We thank H.A. Finlayson as well as staff in The Roslin Institute’s
Animal Services Division for their help with the gene expression study. We thank the
BCM-HGSC sequencing team including Y. Wu, I. Newsham, R. Thornton, P. Aqrawi, R.
Goodspeed, L. Jackson, C. Mandapat, Y. Zhu, N. Saada, L.-L. Pu, S. Gross, G. Fowler, J.
Deng, W. Hale and J. Santibanez. We thank the Otago University/AgResearch
sequencing team, T. Van Stijn, G. Payne, C.J. Rand and C. Mason. We thank B. Zhang
and Q. Chen (BGI-Shenzhen) for RNA-Seq analysis and X. Li (State Key Laboratory of
Genetic Resources and Evolution, Kunming Institute of Zoology) and G. Yin and T.
Deng (BGI-Shenzhen) for analysis of MeDIP-seq data. We thank the following for
contribution to the project by sending unpublished gene expression data to the annotation
project that was not used in the final gene models; R.L. Tellam and T. Vuocolo (CSIRO
Animal, Food and Health Sciences); P.K. Dearden and E.J. Duncan (Genetics Otago,
Otago University); H. Blair, P.R. Kenyon and S.J. Pain (Institute of Veterinary, Animal
19
and Biomedical Sciences, Massey University); I.E Lindquist, C.W. Beattie and E.J.
Retzel (National Center for Genome Resources (NCGR)); C.A. Bidwell (Department of
Animal Sciences, Purdue University); C. Couldrey and P. Maclean (AgResearch); L. Yu
and D. Burt (The Roslin Institute); O.M. Keane (Animal & Bioscience Department,
Teagasc); J. Kantanen, K. Pokharel, M. Li and J. Peippo (Biotechnology and Food
Research, MTT Agrifood Research Finland); M. de Veer (BRL Innate Immunity
Laboratory, Dept. Physiology, Monash University); A. Bonnet and Gwenola Tosser-
Klopp (INRA, Laboratoire de Génétique Cellulaire); D.R.Herndon (USDA-ARS).We
thank M. Colgrave and H. Goswami (CSIRO Animal, Food and Health Sciences), for
proteomics data not included in the final manuscript. We thank the anonymous reviewers
and R. Xiang for their comments on the manuscript.
20
Fig. S1
Comparison of the sheep, goat and cattle genomes. Three dimensional plot of the
relationship between the sheep (Oar v3.1), goat (CHI1.0) and cattle (UMD3.1) genome
assemblies. Blown up regions highlight the Robertsonian translocation between cattle
chromosome 14 and the telomeric end of cattle chromosome 9 to form sheep
chromosome 9 or goat chromosome 14, and the inversion on cattle chromosome 13
relative to the sheep and goat chromosome 13.
21
A
Fig. S2
Cross species gene analyses. Phylogenetic tree using the 4,850 single-copy orthologous
genes on 4-fold degenerate sites using MrBayes program.
22
A
B
C
D
Fig. S3
The consensus sequence logos of the core 15 amino acid repeat of the predicted TCHHL2
protein. A: Platypus. B: Tasmanian devil. C: Shrew. D. Pig. Images were generated using
web logo.
23
Fig. S4
The consensus sequence logos of the core 15 amino acid repeat of the predicted TCHHL2
protein in sheep and cattle. Images were generated using web logo.
24
A
B
Fig. S5
The encoded protein sequences of LCE gene family members. A: LCE7A is very
different from the proteins encoded by the other LCE genes in the sheep genome, with
seven tandem three amino acid repeats "PQX", starting at position 26 of LCE7A. B: The
protein coding sequence alignment of LCE7A and the predicted products of its
orthologous genes in other mammals. The three amino acid repeat "PQX" is only present
in the ruminant group (sheep, goat and cattle) proteins.
Candidate cross-linking sites for transglutaminase
25
A. B.
Fig. S6
Localization of LCE7A expression in adult Merino skin. Expression was localized by in
situ hybridization with an EST, fs827.z1 (GenBank accession, CF118671.1) using the
methodology as described by Adelson et al., (74), images have not been published
previously. Hair/wool follicle (hf) and inner root sheath (irs) are indicated. (74). A:
Original magnification 50x, negative 75_99_15. B: Original magnification 200x, negative
75_99_16.
hf irs
26
Fig. S7
MOGAT2 expression level using log reads number in different tissues in sheep and goat.
27
A
B
Fig. S8
Sequence alignment between sheep and cattle MOGAT3 regions, and the expression
level of cattle MOGAT3 genes. A: Syntenic relationship between sheep and cattle for
MOGAT3 locus. The pseudogenes are in black. The expressed MOGAT3 genes -2,-3 and
-6 in sheep have almost identical sequences. On the genome lines the light blue regions
have a one to one relationship between the sheep and cattle genomes and the dark blue
regions have a one to many or many to many relationship between the sheep and cattle
genomes. B: MOGAT3 expression level using reads number for different tissues in cattle.
The apparent expression in the skin of MOGAT3-1, predicted to be a pseudo gene,
indicates that some assembly errors may exist in this region of the Btau4.7 cattle
assembly in, which is also very different from the gapped MOGAT3 region in the
UMD3.1 cattle genome assembly
28
A
>MOGAT3 variant 1 with predicted membrane anchor
MAEGEHLGVSSTLPPTPSMKTLKKQWLEVLSTYQYVLCFCFLGLFFSLAGFLLLFTSLWY
LSVLYLVWLFLDWDTPQQGGRRYQWLRNWTAWKHLSDYFPLKLVKTAELPPDRNYVLGSH
PHGIMAVGTICNFATEGTGLSQVFPGLRFSLAVLNCLLYLPGHREYFLSCGACSVNRQSL
DYVLSQSQLGRAVVIVLGGANEALYAVPGEHCLTLRNRKGFVRLALRHGASLVPVYSFGE
NDIFRVKAFAPDSWQHLLQVTSKKLLSFCPCIFWGRGLFSAKSWGLLPLARPITTVVGRP
IPVPQCPQPTEEQVDHYHTLYMKALEQLFEEHKESCGLPASTRLTFI
>MOGAT3 variant 2 with alternate start codon no membrane anchor
MVFQTGPSCLWPHLILSSQDLGGEGRPFSWALAFSVFLPSSCPLPPQLVKTAELPPDRNY
VLGSHPHGIIAVGTICNFATEGTGLSQVFPGLRFSLAVLNCLLYLPGHREYFLSCGACSV
NRQSLDYVLSQSQLGRAVVIVLGGANEALYAVPGEHCLTLRNRKGFVRLALRHGASLVPV
YSFGENDIFRVKAFAPDSWQHLLQVTSKKLLSFCPCIFWGRGLFSAKSWGLLPLARPITT
VVGRPIPVPQCPQPTEEQVDHYHTLYMKALEQLFEEHKESCGLPASTRLTFI
B
Fig. S9
Alternatively spliced sheep MOGAT3 transcripts. A: open reading frames predicted from
RNA-Seq and EST data, amino acid sequences are translated from the sequenced
MOGAT3 BAC CH243-423F23. The amino acid sequences predicted to be specific to
each isoform are highlighted in yellow. B: Expression of the two splice variants of
MOGAT3-4 determined from RNA-Seq data from Gansu alpine fine wool sheep skin.
29
A
B
Fig. S10
Transmembrane domain predictions for MOGAT3 splice variants. A: MOGAT3-
Contig26 membrane anchored form. B: MOGAT3-Contig47 alternate start codon no
membrane anchor. Transmembrane domains were predicted as described in
Supplementary methods section 2.26.
30
A
B
Fig. S12
Distribution of divergence of each type of transposable element (TEs) in the sheep
genome. A: The divergence rate was calculated between the identified TE elements in the
genome and the consensus sequence in the TE library used (Repbase release 16.02). B:
The distribution on sheep chromosomes for repeats and gene regions
31
Fig. S13
Cross species gene analyses. Venn diagram showing shared orthologous gene groups
among species of O. aries, C. hircus, B. mutus, B. taurus, S. scrofa, C. bactrianus, C.
familiaris, E. caballus, H. sapiens, M. musculus and M. domestica.
32
Fig. S14
Dynamic evolution of orthologous gene clusters in nine mammalian genomes. The
estimated numbers of orthologous groups (18,897) in the most recent common ancestral
species (MRCA) are shown at the root node. The numbers of orthologous groups that
expanded or contracted in each lineage are shown on the corresponding branch; +,
expansion; −, contraction.
33
Fig. S15
Gene Ontology term enrichment for the fast evolutionary genes (high ka/ks ratio)
between sheep and cattle using http://cbl-gorilla.cs.technion.ac.il/GOrilla/ database.
12,680 orthologous gene pairs, which are associated with a GO term, were ranked based
on their ka/ks ratio. The enrichment GO terms with p-value <1e-7 were reported here.
Most of the enrichment GO terms are related with immune response.
34
Fig. S16
Flowchart for the sheep genome assembly
35
Fig. S17
Distribution of the GC-content of the gap sequences filled between the Oar v2.0 and Oar
v3.1 assemblies. To enrich for the gaps with high-GC content additional sequence data,
~21 fold coverage of high GC biased data from the male Texel and ~1 fold coverage of
MeDIP-seq data from the female Texel was used in the construction of Oar v3.1.
36
Fig. S18
Identification of missing sequences at the start regions of sheep genes. The start, middle
and end 60 bp of CDS regions, from 22,915 cattle protein coding genes, were mapped
onto the Oar v2.0, Oar v3.1 and goat (Chi v1.0) genome assemblies using BLAST. The
hit number in the different assemblies were counted, respectively for the start, middle and
end regions. Missing sequence of the start region of ~1700 genes was recruited between
the Oar v2.0 and Oar v3.1 assemblies.
37
Fig. S19
Distribution of GC content for five mammalian genome assemblies. The X-axis
represents GC content and the Y-axis represents the proportion of the sliding window for
a given GC content. Sliding windows are 500 bp, with an overlap of 250 bp between two
adjacent windows. Differences in GC content distribution among relevant species can be
inferred from this graph. In general, species that are closely related are expected to
possess similar distribution curves. As predicted, the sheep (O. aries) and cattle (B.
taurus) have similar GC distributions.
38
39
40
41
Fig. S20
Comparison between Oar v3.1 and 16 known BAC sequences deposited in NCBI
GenBank. The white regions indicate gaps in Oar v3.1, and most of which are located in
repeat regions, especially in recently evolved repeats with high sequencing depth.
42
A
B
Fig. S21
Comparison of the two contigs of the BAC-based assembly of the Chinese Merino MHC
region with the two equivalent loci in the OARv3.1 assembly. A: MHC contig 1 vs
OARv3.1 chr20. B: MHC contig 2 vs OARv3.1 chr20. Based on the comparison, there is
no significant deletion on OARv3.1 chr20, but an extra sequence of ~200 kb assembled
around OAR20 27.4Mb, which was also observed in cattle genome UMD3.1 chr23 and
goat CHI 1.0 chr 23, implying that Merino MHC contig 1 may have a gap around
position 414 Kb. The full MHC sequences in OARv3.1 are syntenic to the sequences in
the equivalent part of the cattle MHC region. The Chinese Merino MHC region BAC
sequences were obtained from GenBank, accession numbers FJ986852 - FJ985877.
43
Fig. S22
Identification of partial mitochondrial genome insertions at two genomic loci. Two
continuous and highly similar mitochondrial DNA segments were identified in sheep
genomic DNA, located on OARX 56.3Mb (with a length of 14kb) and OAR2 55.2Mb
(with a length of 9kb).
44
A
B
Primer: Forward:5’-TAGGAAAAGCCTTAAAAGTCCA-3’ Reverse:5’-
GAATATTATGCTCCGTTGCTTC-3’
Primer: Forward:5’-ATTTGGCCTACATCCATGACT-3’ Reverse:5’-
TGAGCACCTACTATATGTCAG-3’
Fig. S23
Identification of partial mitochondrial genome insertions at two genomic loci. A:
Checking the two mitochondrial DNA insertions using PCR the length of the
PCR products matched with the expected length based on the scaffold sequences. B:
Sequence of the PCR products from sheep genomic DNA confirming the mitochondrial
DNA insertion around 56.33 Mb on OARX. Black: genomic sequence; White:
mitochondrial sequence.
45
46
47
48
49
50
Fig. S24
Alignment grids showing the synteny-based chromosomal comparisons between Oar v3.1
and UMD3.1. The colinear relationship was identified for every chromosome by
NUCmer (http://mummer.sourceforge.net, NUCleotide MUMmer, version 3.06) with
parameter “-c 500”. Red segments represent homologous regions with consistent orders.
51
A
B
Fig. S25
The organization of the sheep and cattle X chromosomes. A: The sequences of the sheep
and cattle X chromosomes were aligned with the human ChrX sequence. Lines connect
regions of orthologous sequences. Different colors represent major blocks involved in
rearrangements. The PAR region is indicated in black and the centromere by a
constriction in the width of the chromosome. B: Analysis of junction between the PAR
region and the non-PAR region in sheep. The high GC content and high SNP level for the
male Texel from 0 to 7 Mb on chrX, suggest it contains the PAR region. The large
number of satellite II sequence (~2000 copies) and high methylation level (1458 mapped
MeDIP-seq reads), indicates that the centromere position is around 7.07 Mb, marked by
the purple dashed line. The BAC pair ends from sheep (CHORI-243, marked as blue
lines) and cattle (CHORI-240, marked as yellow lines) were mapped onto this region.
Cattle BACs could cross this boundary region, whereas sheep BACs could not, indicating
that a new chrX centromere originated on the ovine or bovine branch, after the separation
between sheep and cattle.
52
A B
Fig. S26
Sheep lineage-specific evolutionary breakpoint regions (EBRs) on chromosome X.
Results are shown at a 300 kb resolution of homologous synteny block (HSB) detection,
the size of each "X" is proportional to the length of each HSB. The red lines indicate the
positions of the EBRs detected in sheep. A complete set of HSBs defined in our analysis
is available from the Evolution Highway comparative chromosome browser
(http://evolutionhighway.ncsa.uiuc.edu).
53
Fig. S27
The distributions in mRNA, CDS, exon and intron lengths of Ensembl annotated protein
coding genes in four genomes.
54
Fig. S28
Gene tree for artiodactyl lysozyme C genes. The phylogenetic ML tree of lysozyme C
genes calculated on the basis of the coding sequences from sheep (OAR), cattle (BTA)
and pig (SSC); the sheep lysozymes are named after their gene order on chromosome 3.
55
A secondary structure CCCCHHHHHHHHHHCCCCCCCCCCHHHHHHHHHHHHCCCCCCEEEECCCCEEEECCCCEECCCCC
1........10........20........30........40........50........60....
OARLYZ4 KVFERCELARTLKELGLDGYKGVSLANWLCLTKWESSYNTKATNYNPGSESTDYGIFQINSKWWC
OARLYZ5 KVFERCELARTLKKLGLDDYKGVSLANWLCLTKWESGYNTKATNYNPGSESTDYGIFQINSKWWC
BTA_ENSBTAG00000026088 KVFERCELARTLKKLGLDGYKGVSLANWLCLTKWESSYNTKATNYNPSSESTDYGIFQINSKWWC
BTA_ENSBTAG00000046628 KVFERCELARTLKKLGLDGYKGVSLANWLCLTKWESSYNTKATNYNPSSESTDYGIFQINSKWWC
BTA_ENSBTAG00000046511 KVFERCELARTLKKLGLDGYKGVSLANWLCLTKWESSYNTKATNYNPGSESTDYGIFQINSKWWC
OARLYZ6 KKFQRCELARTLKKLGLDGYKGVSLANWLCLTKWESGYNTKATNYNPSSENTDYGIFQINSKWWC
BTA_ENSBTAG00000026322 KVFERCELARTLKKLGLDGYKGVSLSKRLCLTKWESSYNTKATNYNPSNESTDYGIYQINSKWWC
BTA_ENSBTAG00000026323 KKFEKCELARTLRRYGLDGYKGVSLANWMCLTYGESRYNTRVTNYNPGSKSTDYGIFQINSKWWC
OARLYZ3 KKFERCELARTLRRFGLDGYNGVSLANWMCLIYGESRYNTQVTNYNPGSKSTDYGIFQINSKWWC
OARLYZ2 KKFERCELARTLKKFGLAGYKGVSLANWMCLAYGESRYNTQAINYNPGSKSTDYGIFQINSKWWC
BTA_ENSBTAG00000000198 KTFKRCELAKTLKNLGLAGYKGVSLANWMCLAEGESSYNTQAKNYNPGSKSTDYGIFQINSKWWC
BTA_ENSBTAG00000022971 KTFERCELARTLKNLGLAGYKGVSLANWMCLAKGESGYNTQAKNYSPGFKSTDYGIFQINSKWWC
BTA_ENSBTAG00000020564 KTFKRCELARTLKNLGLAGYKGVSLADWMCLAKGESSYNTQAKNFNRGSQSTDYGIFQINSKWWC
BTA_ENSBTAG00000039170 KKFQKCELARTLKRLGLDGYKGISLAKWVCLASWERSYNTCATNYNRGDKSSDYGIFQINSRRWC
OARLYZ1 KKFERCELARTLKRLGLDGYRGVSLANWMCLARWESNYNTRATNYNHGDKSTDYGIFQINSRWWC
BTA_ENSBTAG00000011941 KKFQRCELARTLKKLGLDGYRGVSLANWVCLARWESNYNTRATNYNRGDKSTDYGIFQINSRWWC
OARLYZ7 KVFERCELARTLKRFGMDGFRGISLANWMCLARWESSYNTQATNYNSGDRSTDYGIFQINSHWWC
BTA_ENSBTAG00000026779 KVFERCELARSLKRFGMDNFRGISLANWMCLARWESNYNTQATNYNAGDQSTDYGIFQINSHWWC
SSC_ENSSSCG00000000492 KVYDRCEFARILKKSGMDGYRGVSLANWVCLAKWESDFNTKAINHNVG--STDYGIFQINSRYWC
secondary structure CCCCCCCCCCCCCCCHHHHHCCCCHHHHHHHHHHHCCCCHHHHCHHHHHHCCCCCCHHHCCCCCC
....70.......80.........90........100........110.......120.......
OARLYZ4 NDGKTPNAVDGCHVSCSELMENNIAKAVACAKHIVSE-QGITAWVAWKSHCRDHDVSSYVEGCSL
OARLYZ5 NDGKTPNAVDGCHVSCSALMENDIEKAVACAKHIVSE-QGITAWVAWKSHCRDHDVSSYVEGCTL
BTA_ENSBTAG00000026088 NDGKTPNAVDGCHVSCSELMENDIAKAVACAKHIVSE-QGITAWVAWKSHCRDHDVSSYVEGCTL
BTA_ENSBTAG00000046628 NDGKTPNAVDGCHVSCSELMENDIAKAVACAKHIVSE-QGITAWVAWKSHCRDHDVSSYVQGCTL
BTA_ENSBTAG00000046511 NDGKTPNAVDGCHVSCSELMENDIAKAVACAKQIVSE-QGITAWVAWKSHCRDHDVSSYVEGCTL
OARLYZ6 NDGKTPKAVDGCHVSRSELMENDIAKAVTCAKKIVSE-QGITVWVAGKSHCRDHDISSYVEGCTL
BTA_ENSBTAG00000026322 ---KTPKAVDGCPVSHSKLMGNDIAKAVACAKKIVSE-QGITAWVAWKSHCRDHDVSSYVEGCTL
BTA_ENSBTAG00000026323 NDGKTPKAVNGCGVSCSAMLKDDITQAVACAKTIVSR-QGITAWVAWKNKCRNRDVSSYIRGCKL
OARLYZ3 NDGKTPRAVNGCGVSCSALLKDDITQAVACAKKIVSR-QGITAWVAWKNNCRNRNVSSYIQGCKL
OARLYZ2 NDGKTPKAVNGCGVSCSALLKDDITQAVACAKKIVSQ-QGITAWVAWKNNCQNRDVTSYVKGCGV
BTA_ENSBTAG00000000198 NDGKTPKAVNGCGVSCSALLKDDITQAVACAKKIVSQ-QGITAWVAWTNKCRNRDLTSYVKGCGV
BTA_ENSBTAG00000022971 NDGKTPKAVNGCGVSCSALLKDDITQAVACAKKIVSQ-LGLTAWVAWKNKCQNRDLTSYVQGCRV
BTA_ENSBTAG00000020564 NDGKTPNAVNGCGVSCSALLKDDITQAVACAKKIVSQ-QGLTAWVAWKNNCRNRDLTSYVQGCGV
BTA_ENSBTAG00000039170 NDGKTPRAVNACRIPCSALLKDDITQAVASAKK-VSDPQGVRAWVVWRNKCQNQDLRSYVQDCGV
OARLYZ1 NDGKTPRAVNACRIPCSALLKDDITQAVECAKRVVRDPQGIKAWVAWRNKCQNKDLRSYVKGCRV
BTA_ENSBTAG00000011941 NDGKTPKAVNACRIPCSALLKDDITQAVACAKRVVRDPQGIKAWVAWRNKCQNRDLRSYVQGCRV
OARLYZ7 NDGKTPGAVNACHIPCSALLQDDITQAVACAKRVVSDPQGIRAWVAWRSHCQNQDLTSYIQGCGV
BTA_ENSBTAG00000026779 NDGKTPGAVNACHLPCGALLQDDITQAVACAKRVVSDPQGIRAWVAWRSHCQNQDLTSYIQGCGV
SSC_ENSSSCG00000000492 NDGKTPKAVNACHISCKVLLDDDLSQDIECAKRVVRDPLGVKAWVAWRAHCQNKDVSQYIRGCKL
B LYZ4 expressed in abomasum LYZ1 expressed in rumen
Fig. S29
Amino acid sequence alignment and predicted 3D structure of sheep rumen and
abomasum expressed lysozymes: A: Protein sequence alignment. A common deleted
amino acid in the digestive lysozymes (mainly expressed in stomach and intestine) the
proline at position 103 is indicated in yellow. B: Based on 3D structure prediction, by
3D-JIGSAW (version 2.0) , we noticed a longer alpha helix around position 103 in
digestive lysozymes than other antibacterial lysozymes, which is accord with the role as
alpha helix structural disruptor for proline.
P103
56
A
B
Fig. S30
A: The phylogenetic Maximum Composite Likelihood (ML) tree of PAG genes
calculated on the basis of the coding sequences. The red line labels the pepsinogen A
(PGA5). B: Expression level using log reads number for different tissues in Texel sheep.
The red triangle labels the PGA5, whereas the green triangles show the PAGs
57
Fig. S31
The expression level of genes encoding prolactin related proteins. Generated using log
read number from placenta and uterus tissues in Texel ewe sheep.
58
Fig. S32
Gene tree for ovine and bovine prolactin genes. Ovine genes are shown in purple and
bovine genes in blue. The numbers on the branches are Ka/Ks ratio calculated by the
PAML branch model. Placental prolactin-related protein (PRP); prolactin (PRL);
Chorionic somatomammotropin hormone (CSH). The cattle genes are from Ensembl
annotation of UMD3.1, and the sheep genes are from Oar v3.1.
59
>LOC100300593
MSHHPHPHPHPHPHPHPHPHPHPHQHQHHHQCKVPCHPPPKVCPPKCHEPCPPHPCPSP
PSQKKCPPGPPCPPCEQKCPPKWK
>LOC100848030
MSQQQHPHPHQHQHQHHHQCKEPCHPPPKVCPPKCHEPCPPHPCPPPLGQKKCPPGPPC
PPCEQKCPPKWK
>LOC100848041
MSHPPHPHPHPHPHPHPHQHHHQCKEPCHPPPKVCPPKCHEPCPPHPCPSPPSQKKCPP
GPPCPPCKQKCPPKWK
>LOC100848051
MSHHQHPHPHQNQHQHHHQCKEPCHPPPKVCPPKCHEPCPPHPCPPPLSQKKCPPGPPC
PPCEHKCPPKWK
>LOC100848464
MSHHPHPHPHLHPHQHHHQCKEPCHPPPIVCPRKCQEPCPPHPCPPPLGQKKCPPVPPC
PPCEHKCPPKWK
>LOC100297713
MSNHPHPHPHPHQHQHHHHHQCKEPCHPPPKVCPPKCHEPCPPHPCPSPPSQKKCPPGP
PCPPCEQKCPPKWK
>LOC100848091
MSHHPHPHPHPHPHPHQHHHQCKEPCHPPPKVCPPKCHKPCPPHPCPPPLGQKKCPPGP
PCPPCEQKCPPKWK
>LOC100301420
MSHHPHPHPHPHPHPHPHPHPHQCKEPCHPPPKVCPPKCHEPCPPHPCPPPLGQKKCPP
GPPCPPCEQKCPPKWK
Fig. S33
Predicted amino acid sequences of NCBI cattle RefSeq PRD-SPRRII-related loci entries
in the EDC region.
60
Table S1
Summary of sequencing datasets used for the assembly of sheep genome.
* MeDIP-seq for high GC content sequence
**new Illumina protocol for GC content unbiased sequence
***Six females of different breeds were sequenced at 0.5 fold coverage, Merino, Texel,
Awassi, Poll Dorset, Romney and Scottish Blackface, which was used to generate the early
draft sheep genome, Oar v1.0.
Sample Purpose Sequence
method
Paired-
end
libraries
Insert
sizes
Lib-
raries
GA
Lanes
Total
length
(Gb)
Reads
Length
(bp)
Coverage
(fold)
Female assembly Illumina 180 bp 1 4 23.8 101 7.93
Female assembly Illumina 350 bp 4 21 105.0 101 35.0
Female assembly Illumina 800 bp 2 6 32.0 101 10.7
Female assembly Illumina 2 kp 2 11 35.7 45 11.9
Female assembly Illumina 5 kb 2 6 18.5 45 6.17
Female assembly Illumina 9 kb 1 3 8.3 45 2.77
Female assembly Illumina 17 kb 1 1 1.8 45 0.60
Female* fill gap Illumina 200 bp 1 1 2.0 45 0.67
Male fill gap Illumina 200 bp 1 16 77 101 24.0
Male fill gap Illumina 500 bp 1 24 72 101 25.5
Male** fill gap Illumina 554 bp 8 1 36 101 12.0
Male** fill gap Illumina 1.3 kb 1 1 27 101 9.00
Male check Roche 454 8 kb 3.3 1.10
Male check Roche 454 20 kb 1.5 0.50
Male check Sanger 184 kb 0.3 687 0.09
Six*** fill gap Roche 454 --- 9.0 3.00
61
Table S4
De novo assembly result for Oar v3.1 scaffolds.
Contig Size Contig
Number Scaffold Size
Scaffold
Number
N90 10,562 65,304 474,708 1301
N80 17,604 46,985 883,008 913
N70 24,477 34,817 1,261,951 666
N60 31,783 25,732 1,673,097 486
N50 39,959 18,614 2,231,873 350
N40 49,777 12,913 2,717,352 244
N30 61,672 8,325 3,371,060 158
N20 77,817 4,653 4,205,983 89
N10 105,369 1,817 5,600,573 36
maximum length 383,429 11,902,472
Total Size 2,534,293,732 2,606,199,298
Total Number(>100 bp) 131,971 8,261
Total Number(>2 kb) 106,156 6,767
62
Table S8
Comparative gene cluster statistics.
Species Number
of genes
Genes in
clustered
Families
Unclustered
Genes
Family
Number
Unique
Familes
Average
Genes
per
Family
O. aries 20,908 16,703 581 16,122 28 1.26
C. hircus 22,175 18,221 2,315 15,906 22 1.25
B. mutus 22,282 16,950 629 16,321 33 1.33
S. scrofa 17,433 14,012 934 13,078 32 1.26
B. taurus 19,970 15,988 104 15,884 0 1.25
C. familiaris 19,258 16,092 389 15,703 11 1.20
E. caballus 20,419 15,807 222 15,585 20 1.30
H. sapiens 21,375 16,852 841 16,011 75 1.28
M. musculus 22,927 17,010 1,049 15,961 88 1.37
C. bactrianus 20,251 17,102 4,014 13,088 106 1.24
M. domestica 19,439 19,439 1,150 16,037 183 1.14
63
Table S12
Statistics of the completeness of assembled sheep genome based on 248 CEGs1.
Number of
CEGs Mapped
proteins Completeness
(%)
Complete2 248 246(246)
3 99.19(99.19)
Group 1 66 64(66) 96.97(100.00)
Group 2 56 56(56) 100.00(100.00)
Group 3 61 61(59) 100.00(96.72)
Group 4 65 65(65) 100.00(100.00)
Partial2 248 247(247) 99.60(99.60)
Group 1 66 65(66) 98.48(100.00)
Group 2 56 56(56) 100.00(100.00)
Group 3 61 61(60) 100.00(98.36)
Group 4 65 65(65) 100.00(100.00) 1The CEGs database contains groups of genes from the following species: Homo sapiens,
Drosophila melanogaster, Arabidopsis thaliana, Caenorhabditis elegans, Saccharomyces
cerevisiae and Schizosaccharomyces pombe. 2The CEGs were classed as 4 groups based on the conservation.
3The results are for Oar v3.1, (result in parentheses are for the cattle genome assembly
UMD3.1).
64
Table S13
Transposable element content of Oar v3.1.
Repeat Class1
Length(bp) Coverage(%)
LINE 728,855,852 27.83
L1 319,721,973 12.21
RTE (BovB) 359,281,823 13.72
L2 26,886,976 1.03
CR1 2,069,573 0.08
other 20,895,507 0.80
SINEs 177,574,932 6.78
BOV-A 86,706,307 3.31
tRNA 60,289,254 2.30
MIR 29,844,110 1.14
Other 735,261 0.03
LTR 124,398,733 4.75
ERVs 121,689,503 4.65
LTR other 2,709,230 0.10
DNA transposon 59,832,665 2.28
Other 26,753,185 1.02
Interspersed repeat total 1,117,415,367 42.67 1Classes were assigned according to homology to known transposable elements in the
Repbase database (http://www.girinst.org/repbase/, Release 16.02).
65
Table S14
Comparative gene statistics.
Gene set Number
Average
transcript
length
(bp)
Average
CDS
length
(bp)
Average
exons
per gene
Average
exon
length
(bp)
Average
intron
length (bp)
Ovis aries 20,921 35,331.80 1,559.02 9.61 162.15 3,918.68
Bos taurus 19,970 35,399.92 1,610.89 9.65 167.01 3,908.19
Homo sapiens 21,375 47,027.29 1,660.16 9.56 173.71 5,300.82
Mus musculus 22,927 35,443.23 1,550.56 8.65 179.28 4,430.11
66
Table S15
Annotated genes in the Ensembl sheep gene list.
Number Percent (%)
Total genes 20921 100.00
Total annotated genes 20782 99.34
Swissprot 20444 97.72
TrEMBL 20757 99.22
InterPro 18028 86.17
KEGG 15841 75.72
GO 15068 72.02
Unannotated 139 0.66
67
Table S16
Classes of non-coding RNAs in the Ensembl sheep annotation.
Class Number Length (bp)
rRNA 305 34,441
snoRNA 756 86,792
misc_RNA 361 70,855
miRNA 1,305 113,558
snRNA 1,234 136,240
total 3,961 441,886
68
Table S17
De novo assembly results for Oar v2.0 scaffolds.
Contig Size Contig
Number Scaffold Size
Scaffold
Number
N90 3,593 158,224 202,439 2,823
N80 6,988 109,238 406,941 1,900
N70 10,219 79,515 618,490 1,358
N60 13,604 58,124 841,456 983
N50* 17,369 41,706 1,079,158 696
N40 21,802 28,718 1,379,732 474
N30 27,487 18,387 1,760,652 300
N20 35,087 10,218 2,326,727 166
N10 48,038 3,994 3,251,999 68
maximum length 271,172 7,616,799
Total Size 2,523,285,863 2,709,936,754
Total Number(>100 bp) 794,490 490,776
Total Number(>2 kb) 194,049 8,115
*N50 is the size of the contig/scaffold such that the 50% of the total assembly bases are
in contigs/scaffolds of this length or longer. 6.9% of the scaffold sequence is “N” (gap).
69
Table S18
Super-scaffold assemblies used for the chromosome assembly.
Super scaffold Size (bp) Number
N50 37,056,980 23
N60 28,759,665 30
N70 21,637,475 41
N80 14,866,675 55
N90 6,068,238 83
Total Number 349
Total Size 2,569,509,652
70
Table S19
Assessing genome assembly using 52,821 de novo assembled model mRNAs from the
RNA-seq data from the 7 tissues from the female Texel sheep.
Length Coverage status when
mapping ESTs
Oar v2.0 Oar v3.1
Mapped
mRNAs
number
% of
total
mRNAs
Average
coverage
(%) by
length
Mapped
mRNA
number
% of
total
mRNAs
Average
coverage
(%) by
length
Single hit with 90% coverage 45357 85.9 99.5 47221 89.4 99.6
Single hit with 20~90%
coverage 3171 6 84 2410 4.6 85.4
Best hit of multiple hits 4088 7.7 99 2833 5.4 96
In total for matched mRNAs 52616 99.6 98 52464 99.3 98.4
Unmapped mRNAs 205 - - 357 - -
71
Table S20
Genome assembly QC using BAC sequences – alignment statistics.
BAC ID BAC
Length
(bp) Coverage
NO. of
Alignment
Blocks
Matched
Oar v3.1 Gap
Num
Gap
Length
(bp)
Gap
Ratio
AC148038.3 198,973 100% 23 OAR4 4 2,505 1.26%
AC147928.3 190,282 100% 30 OAR4 12 3,672 1.93%
AC147927.3 187,152 100% 17 OAR4 8 1,197 0.64%
AC147929.3 186,963 100% 32 OAR4 7 5,943 3.18%
AC147930.2 176,451 100% 22 OAR4 4 709 0.40%
AC152892.3 162,012 100% 14 OAR4 4 3,684 2.27%
AC147844.3 156,221 100% 31 OAR4 10 5,663 3.62%
AC148245.3 153,498 100% 32 OAR4 7 5,063 3.30%
AC147843.3 151,538 100% 17 OAR4 3 1,149 0.76%
AC148039.3 143,308 100% 12 OAR4 4 4,579 3.20%
AC147842.3 139,569 100% 10 OAR4 1 186 0.13%
AC147841.3 131,638 100% 13 OAR4 4 2,536 1.93%
AC148115.3 95,794 100% 13 OAR4 5 6,281 6.56%
AC162117.3 80,882 100% 3 OAR4 0 0 0.00%
HM355886.3 78,116 100% 10 OAR17 2 1,661 2.13%
AC159152.3 30,087 100% 2 OAR4 0 0 0.00%
2,262,484 100% 281 75 44,828 1.98%
72
Table S21
Genome assembly QC using BAC sequences – repeat sequences.
Repeat Class Ratio of
gap sequencs
LTR/Copia 0.03%
TandemRepeat 0.85%
LTR/ERVL 0.08%
SINE/MIR 0.10%
SINE/tRNA-Glu 0.17%
DNA/MULE-
MuDR 0.01%
LINE/L2 0.02%
DNA/Sola 0.07%
LTR/ERVK 4.04%
LINE/L1 8.12%
LTR/ERV1 0.01%
DNA/PIF-Harbinger 0.04%
LINE/CR1 0.01%
DNA/hAT-Charlie 0.41%
DNA/CMC-EnSpm 0.09%
DNA/CMC-Transib 0.09%
LTR/Gypsy 0.04%
DNA/DNA 0.04%
LTR 0.72%
LINE/RTE-BovB 82.49%
SINE/BovA 2.58%
73
Table S22
Comparison of Oar v3.1 with other mammalian reference genome assemblies.
Horse
EquCab
2.0
Pig
Sscrofa
10.2
Cattle
UMD
3.1
Yak
BosGru
2.0
Goat
Chi
1.0
Sheep
Oar v
3.1
sequence clones Fosmid
BAC
Fosmid
shotgun
BAC
shotgun
150-500
bp
2 kb
5 kb
10 kb
20 kb
150-800 bp
2 kb
5 kb
10 kb
20 kb
40 kb
150-800 bp
2 kb
5 kb
10 kb
20 kb
BAC
Coverage 6.8× 25× 7.1× 65× 66× 150×
Total scaffold length 2.47 Gb 2.8 Gb 2.67 Gb 2.65 Gb 2.64 Gb 2.61 Gb
N50 contig 112 kb 73 kb 97 kb 20 kb 19 kb 40 kb
Heterozygosity Rate 1/2000 --- 1/1700 1/1200 1/800 1/500
Anchor to chromosome 96% 93% 99% --- 95% 99%
Repeat ratio 49.5% 48.2% 46.5% 41.8% 42.2% 42.7%
74
Table S23
Sheep segmental duplications from WGAC (>90% identity) analysis.
Cutoff blocks Length (Mb)
>1 kb 16,249 33.83
>5 kb 721 5.26
>10 kb 42 0.83
75
Table S24
Sheep segmental duplications from WSSD combined with WGAC (>95% identity)
analysis.
Cutoff blocks length (Mb)
>1 kb 7,912 25.77
>5 kb 1,097 13.39
>10 kb 434 8.85
76
Table S27
Bovine PRD-SPRRII family genes in the EDC region.
NCBI Gene UniGene cluster Number of sequences Site(s) of expression
LOC100297713 Bt.98911 11 predominantly reticulum
LOC100300593 Bt.23535 142 predominantly rumen
Bt.99623 2 rumen
Bt.67198 30 predominantly rumen
Bt.99621 1 rumen
Bt.109358 46 rumen
LOC100301420 Bt.102924 1 rumen
LOC100848030 Bt.9678 504 predominantly rumen
LOC100848041 Bt.92420 31 predominantly rumen
LOC100848051 Bt.92418 47 predominantly rumen
Bt.99625 6 rumen
LOC100848091 Bt.20608 87 predominantly rumen
LOC100848464 not in UniGene
77
Captions for additional figures and tables
Fig. S11
Robust sheep RH map versus the Oar v3.1 assembly.
Table S2
Sheep linkage map SM5.
Table S3
Robust sheep RH map.
Table S5
List of segmental duplications.
Table S6
Breakpoints between the sheep Oar v3.1 and cattle UMD3 assemblies defined by BACs.
Table S7
List of tissue samples used to generate the RNA-Seq data used in the Ensembl
annotation.
Table S9
Gene family expansions in ruminants.
Table S10
EDC region gene annotation sheep.
Table S11
Lipid gene expression in sheep skin.
Table S25
Gene name to Ensembl gene id translation.
Table S26
Genes with allele specific gene expression.
78
References
1. R. R. Hofmann, Evolutionary steps of ecophysiological adaptation and diversification of ruminants: A comparative view of their digestive system. Oecologia 78, 443–457 (1989). doi:10.1007/BF00378733
2. M. J. Wolin, Fermentation in the rumen and human large intestine. Science 213, 1463–1468 (1981). doi:10.1126/science.7280665 Medline
3. T. J. Hackmann, J. N. Spain, Invited review: Ruminant ecology and evolution: perspectives useful to ruminant livestock research and production. J. Dairy Sci. 93, 1320–1334 (2010). doi:10.3168/jds.2009-2071 Medline
4. C. A. E. Strömberg, Evolution of grasses and grassland ecosystems. Annu. Rev. Earth Planet. Sci. 39, 517–544 (2011). doi:10.1146/annurev-earth-040809-152402
5. E. J. Edwards, C. P. Osborne, C. A. Strömberg, S. A. Smith, W. J. Bond, P. A. Christin, A. B. Cousins, M. R. Duvall, D. L. Fox, R. P. Freckleton, O. Ghannoum, J. Hartwell, Y. Huang, C. M. Janis, J. E. Keeley, E. A. Kellogg, A. K. Knapp, A. D. Leakey, D. M. Nelson, J. M. Saarela, R. F. Sage, O. E. Sala, N. Salamin, C. J. Still, B. Tipple; C4 Grasses Consortium, The origins of C4 grasslands: Integrating evolutionary and ecosystem science. Science 328, 587–591 (2010). doi:10.1126/science.1177216 Medline
6. E. N. Bergman, Energy contributions of volatile fatty acids from the gastrointestinal tract in various species. Physiol. Rev. 70, 567–590 (1990). Medline
7. K. A. Johnson, D. E. Johnson, Methane emissions from cattle. J. Anim. Sci. 73, 2483–2492 (1995). Medline
8. M. E. Stewart, in Biology of the Integument, J. Bereiter-Hahn, A. G. Matoltsy, K. S. Richards, Eds. (Springer, Berlin Heidelberg, 1986), vol. 2, chap. 43, pp. 824–832.
9. M. L. Schlossman, J. P. McCarthy, Lanolin and its derivatives. J. Am. Oil Chem. Soc. 55, 447–450 (1978). doi:10.1007/BF02911911
10. Materials and methods are available as supplementary material on Science Online
11. A. Clop, F. Marcq, H. Takeda, D. Pirottin, X. Tordoir, B. Bibé, J. Bouix, F. Caiment, J. M. Elsen, F. Eychenne, C. Larzul, E. Laville, F. Meish, D. Milenkovic, J. Tobin, C. Charlier, M. Georges, A mutation creating a potential illegitimate microRNA target site in the myostatin gene affects muscularity in sheep. Nat. Genet. 38, 813–818 (2006). doi:10.1038/ng1810 Medline
12. M. Kypriotou, M. Huber, D. Hohl, The human epidermal differentiation complex: Cornified envelope precursors, S100 proteins and the ‘fused genes’ family. Exp. Dermatol. 21, 643–649 (2012). doi:10.1111/j.1600-0625.2012.01472.x Medline
13. L. Wang, R. L. Baldwin, 6th, B. W. Jesse, Identification of two cDNA clones encoding small proline-rich proteins expressed in sheep ruminal epithelium. Biochem. J. 317, 225–233 (1996). Medline
79
14. H. J. Song, G. Poy, N. Darwiche, U. Lichti, T. Kuroki, P. M. Steinert, T. Kartasova, Mouse Sprr2 genes: A clustered family of genes showing differential expression in epithelial tissues. Genomics 55, 28–42 (1999). doi:10.1006/geno.1998.5607 Medline
15. J. Deng, R. Pan, R. Wu, Distinct roles for amino- and carboxyl-terminal sequences of SPRR1 protein in the formation of cross-linked envelopes of conducting airway epithelial cells. J. Biol. Chem. 275, 5739–5747 (2000). doi:10.1074/jbc.275.8.5739 Medline
16. P. M. Steinert, E. Candi, T. Kartasova, L. Marekov, Small proline-rich proteins are cross-bridging proteins in the cornified cell envelopes of stratified squamous epithelia. J. Struct. Biol. 122, 76–85 (1998). doi:10.1006/jsbi.1998.3957 Medline
17. K. R. Feingold, Thematic review series: Skin lipids. The role of epidermal lipids in cutaneous permeability barrier homeostasis. J. Lipid Res. 48, 2531–2546 (2007). doi:10.1194/jlr.R700013-JLR200 Medline
18. D. Marshall, M. J. Hardman, K. M. Nield, C. Byrne, Differentially expressed late constituents of the epidermal cornified envelope. Proc. Natl. Acad. Sci. U.S.A. 98, 13031–13036 (2001). doi:10.1073/pnas.231489198 Medline
19. F. P. W. Radner, S. Grond, G. Haemmerle, A. Lass, R. Zechner, Fat in the skin: Triacylglycerol metabolism in keratinocytes and its role in the development of neutral lipid storage disease. Dermatoendocrinology 3, 77–83 (2011). doi:10.4161/derm.3.2.15472 Medline
20. D. Cheng, T. C. Nelson, J. Chen, S. G. Walker, J. Wardwell-Swanson, R. Meegalla, R. Taub, J. T. Billheimer, M. Ramaker, J. N. Feder, Identification of acyl coenzyme A:monoacylglycerol acyltransferase 3, an intestinal specific enzyme implicated in dietary fat absorption. J. Biol. Chem. 278, 13611–13614 (2003). doi:10.1074/jbc.C300042200 Medline
21. A. M. Hall, K. Kou, Z. Chen, T. A. Pietka, M. Kumar, K. M. Korenblat, K. Lee, K. Ahn, E. Fabbrini, S. Klein, B. Goodwin, B. N. Finck, Evidence for regulated monoacylglycerol acyltransferase expression and activity in human liver. J. Lipid Res. 53, 990–999 (2012). doi:10.1194/jlr.P025536 Medline
22. A. Kazantseva, A. Goltsov, R. Zinchenko, A. P. Grigorenko, A. V. Abrukova, Y. K. Moliaka, A. G. Kirillov, Z. Guo, S. Lyle, E. K. Ginter, E. I. Rogaev, Human hair growth deficiency is linked to a genetic defect in the phospholipase gene LIPH. Science 314, 982–985 (2006). doi:10.1126/science.1133276 Medline
23. A. Inoue, N. Arima, J. Ishiguro, G. D. Prestwich, H. Arai, J. Aoki, LPA-producing enzyme PA-PLA₁α regulates hair follicle development by modulating EGFR signalling. EMBO J. 30, 4248–4260 (2011). doi:10.1038/emboj.2011.296 Medline
24. G. Bobe, J. W. Young, D. C. Beitz, Invited review: Pathology, etiology, prevention, and treatment of fatty liver in dairy cows. J. Dairy Sci. 87, 3105–3124 (2004). doi:10.3168/jds.S0022-0302(04)73446-3 Medline
25. D. L. Ingle, D. E. Bauman, U. S. Garrigus, Lipogenesis in the ruminant: In vivo site of fatty acid synthesis in sheep. J. Nutr. 102, 617–623 (1972). Medline
80
26. A. M. Crawford, K. G. Dodds, A. J. Ede, C. A. Pierson, G. W. Montgomery, H. G. Garmonsway, A. E. Beattie, K. Davies, J. F. Maddox, S. W. Kappes, R. T. Stone, T. C. Nguyen, J. M. Penty, E. A. Lord, J. E. Broom, J. Buitkamp, W. Schwaiger, J. T. Epplen, P. Matthew, M. E. Matthews, D. J. Hulme, K. J. Beh, R. A. McGraw, C. W. Beattie, An autosomal genetic linkage map of the sheep genome. Genetics 140, 703–724 (1995). Medline
27. B. P. Dalrymple, E. F. Kirkness, M. Nefedov, S. McWilliam, A. Ratnakumar, W. Barris, S. Zhao, J. Shetty, J. F. Maddox, M. O’Grady, F. Nicholas, A. M. Crawford, T. Smith, P. J. de Jong, J. McEwan, V. H. Oddy, N. E. Cockett; International Sheep Genomics Consortium, Using comparative genomics to reorder the human genome sequence into a virtual sheep genome. Genome Biol. 8, R152 (2007). doi:10.1186/gb-2007-8-7-r152 Medline
28. Y. Dong, M. Xie, Y. Jiang, N. Xiao, X. Du, W. Zhang, G. Tosser-Klopp, J. Wang, S. Yang, J. Liang, W. Chen, J. Chen, P. Zeng, Y. Hou, C. Bian, S. Pan, Y. Li, X. Liu, W. Wang, B. Servin, B. Sayre, B. Zhu, D. Sweeney, R. Moore, W. Nie, Y. Shen, R. Zhao, G. Zhang, J. Li, T. Faraut, J. Womack, Y. Zhang, J. Kijas, N. Cockett, X. Xu, S. Zhao, J. Wang, W. Wang, Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus). Nat. Biotechnol. 31, 135–141 (2013). doi:10.1038/nbt.2478 Medline
29. R. Geng, C. Yuan, Y. Chen, Exploring differentially expressed genes by RNA-Seq in cashmere goat (Capra hircus) skin during hair follicle development and cycling. PLOS ONE 8, e62704 (2013). doi:10.1371/journal.pone.0062704 Medline
30. R. A. Scholey, N. J. Evans, R. W. Blowey, J. P. Massey, R. D. Murray, R. F. Smith, W. E. Ollier, S. D. Carter, Identifying host pathogenic pathways in bovine digital dermatitis by RNA-Seq analysis. Vet. J. 197, 699–706 (2013). doi:10.1016/j.tvjl.2013.03.008 Medline
31. D. Takai, P. A. Jones, Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc. Natl. Acad. Sci. U.S.A. 99, 3740–3745 (2002). doi:10.1073/pnas.052410099 Medline
32. R. Li, C. Yu, Y. Li, T. W. Lam, S. M. Yiu, K. Kristiansen, J. Wang, SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009). doi:10.1093/bioinformatics/btp336 Medline
33. R. Li, W. Fan, G. Tian, H. Zhu, L. He, J. Cai, Q. Huang, Q. Cai, B. Li, Y. Bai, Z. Zhang, Y. Zhang, W. Wang, J. Li, F. Wei, H. Li, M. Jian, J. Li, Z. Zhang, R. Nielsen, D. Li, W. Gu, Z. Yang, Z. Xuan, O. A. Ryder, F. C. Leung, Y. Zhou, J. Cao, X. Sun, Y. Fu, X. Fang, X. Guo, B. Wang, R. Hou, F. Shen, B. Mu, P. Ni, R. Lin, W. Qian, G. Wang, C. Yu, W. Nie, J. Wang, Z. Wu, H. Liang, J. Min, Q. Wu, S. Cheng, J. Ruan, M. Wang, Z. Shi, M. Wen, B. Liu, X. Ren, H. Zheng, D. Dong, K. Cook, G. Shan, H. Zhang, C. Kosiol, X. Xie, Z. Lu, H. Zheng, Y. Li, C. C. Steiner, T. T. Lam, S. Lin, Q. Zhang, G. Li, J. Tian, T. Gong, H. Liu, D. Zhang, L. Fang, C. Ye, J. Zhang, W. Hu, A. Xu, Y. Ren, G. Zhang, M. W. Bruford, Q. Li, L. Ma, Y. Guo, N. An, Y. Hu, Y. Zheng, Y. Shi, Z. Li, Q. Liu, Y. Chen, J. Zhao, N. Qu, S. Zhao, F. Tian, X. Wang, H. Wang, L. Xu, X. Liu, T. Vinar, Y. Wang, T. W. Lam, S. M. Yiu, S. Liu, H. Zhang, D. Li, Y. Huang, X. Wang, G. Yang, Z. Jiang, J. Wang, N. Qin, L. Li, J. Li, L. Bolund, K. Kristiansen, G. K. Wong, M. Olson, X. Zhang, S. Li, H. Yang, J. Wang, J. Wang, The sequence and de novo assembly
81
of the giant panda genome. Nature 463, 311–317 (2010). doi:10.1038/nature08696 Medline
34. P. Laurent, L. Schibler, A. Vaiman, J. Laubier, C. Delcros, G. Cosseddu, D. Vaiman, E. P. Cribiu, M. Yerle, A 12 000-rad whole-genome radiation hybrid panel in sheep: Application to the study of the ovine chromosome 18 region containing a QTL for scrapie susceptibility. Anim. Genet. 38, 358–363 (2007). doi:10.1111/j.1365-2052.2007.01607.x Medline
35. C. H. Wu, W. Jin, K. Nomura, T. Goldammer, T. Hadfield, B. P. Dalrymple, S. McWilliam, J. F. Maddox, N. E. Cockett, A radiation hybrid comparative map of ovine chromosome 1 aligned to the virtual sheep genome. Anim. Genet. 40, 435–455 (2009). doi:10.1111/j.1365-2052.2009.01857.x Medline
36. J. D. Storey, R. Tibshirani, Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U.S.A. 100, 9440–9445 (2003). doi:10.1073/pnas.1530509100 Medline
37. T. Faraut, S. de Givry, P. Chabrier, T. Derrien, F. Galibert, C. Hitte, T. Schiex, A comparative genome approach to marker ordering. Bioinformatics 23, e50–e56 (2007). doi:10.1093/bioinformatics/btl321 Medline
38. S. de Givry, M. Bouchez, P. Chabrier, D. Milan, T. Schiex, CARHTA GENE: Multipopulation integrated genetic and radiation hybrid mapping. Bioinformatics 21, 1703–1704 (2005). doi:10.1093/bioinformatics/bti222 Medline
39. B. Servin, T. Faraut, N. Iannuccelli, D. Zelenika, D. Milan, High-resolution autosomal radiation hybrid maps of the pig genome and their contribution to the genome sequence assembly. BMC Genomics 13, 585 (2012). doi:10.1186/1471-2164-13-585 Medline
40. B. Servin, S. de Givry, T. Faraut, Statistical confidence measures for genome maps: Application to the validation of genome assemblies. Bioinformatics 26, 3035–3042 (2010). doi:10.1093/bioinformatics/btq598 Medline
41. J. D. White, P. G. Allingham, C. M. Gorman, D. L. Emery, P. Hynd, J. Owens, A. Bell, J. Siddell, G. Harper, B. J. Hayes, H. D. Daetwyler, J. Usmar, M. E. Goddard, J. M. Henshall, S. Dominik, H. Brewer, J. H. J. van der Werf, F. W. Nicholas, R. Warner, C. Hofmyer, T. Longhurst, T. Fisher, P. Swan, R. Forage, V. H. Oddy, Design and phenotyping procedures for recording wool, skin, parasite resistance, growth, carcass yield and quality traits of the SheepGENOMICS mapping flock. Anim. Prod. Sci. 52, 157–171 (2012). doi:10.1071/AN11085
42. J. E. Miller, S. C. Bishop, N. E. Cockett, R. A. McGraw, Segregation of natural and experimental gastrointestinal nematode infection in F2 progeny of susceptible Suffolk and resistant Gulf Coast Native sheep and its usefulness in assessment of genetic variation. Vet. Parasitol. 140, 83–89 (2006). doi:10.1016/j.vetpar.2006.02.043 Medline
43. T. C. Matise, M. Perlin, A. Chakravarti, Automated construction of genetic linkage maps using an expert system (MultiMap): A human genome linkage map. Nat. Genet. 6, 384–390 (1994). doi:10.1038/ng0494-384 Medline
82
44. P. Green, Construction and comparison of chromosome 21 radiation hybrid and linkage maps using CRI-MAP. Cytogenet. Cell Genet. 59, 122–124 (1992). doi:10.1159/000133221 Medline
45. A. V. Zimin, A. L. Delcher, L. Florea, D. R. Kelley, M. C. Schatz, D. Puiu, F. Hanrahan, G. Pertea, C. P. Van Tassell, T. S. Sonstegard, G. Marçais, M. Roberts, P. Subramanian, J. A. Yorke, S. L. Salzberg, A whole-genome assembly of the domestic cow, Bos taurus. Genome Biol. 10, R42 (2009). doi:10.1186/gb-2009-10-4-r42
46. R. L. Ge, Q. Cai, Y. Y. Shen, A. San, L. Ma, Y. Zhang, X. Yi, Y. Chen, L. Yang, Y. Huang, R. He, Y. Hui, M. Hao, Y. Li, B. Wang, X. Ou, J. Xu, Y. Zhang, K. Wu, C. Geng, W. Zhou, T. Zhou, D. M. Irwin, Y. Yang, L. Ying, H. Bao, J. Kim, D. M. Larkin, J. Ma, H. A. Lewin, J. Xing, R. N. Platt, 2nd, D. A. Ray, L. Auvil, B. Capitanu, X. Zhang, G. Zhang, R. W. Murphy, J. Wang, Y. P. Zhang, J. Wang, Draft genome sequence of the Tibetan antelope. Nat. Commun. 4, 1858 (2013). doi:10.1038/ncomms2860 Medline
47. G. E. Liu, M. Ventura, A. Cellamare, L. Chen, Z. Cheng, B. Zhu, C. Li, J. Song, E. E. Eichler, Analysis of recent segmental duplications in the bovine genome. BMC Genomics 10, 571 (2009). doi:10.1186/1471-2164-10-571 Medline
48. R. S. Harris, thesis, The Pennsylvania State University, University Park (2007).
49. D. M. Bickhart, Y. Hou, S. G. Schroeder, C. Alkan, M. F. Cardone, L. K. Matukumalli, J. Song, R. D. Schnabel, M. Ventura, J. F. Taylor, J. F. Garcia, C. P. Van Tassell, T. S. Sonstegard, E. E. Eichler, G. E. Liu, Copy number variation of individual cattle genomes using next-generation sequencing. Genome Res. 22, 778–790 (2012). doi:10.1101/gr.133967.111 Medline
50. E. T. Wang, R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S. F. Kingsmore, G. P. Schroth, C. B. Burge, Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008). doi:10.1038/nature07509 Medline
51. D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, S. L. Salzberg, TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013). doi:10.1186/gb-2013-14-4-r36 Medline
52. C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, L. Pachter, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010). doi:10.1038/nbt.1621 Medline
53. I. Birol, S. D. Jackman, C. B. Nielsen, J. Q. Qian, R. Varhol, G. Stazyk, R. D. Morin, Y. Zhao, M. Hirst, J. E. Schein, D. E. Horsman, J. M. Connors, R. D. Gascoyne, M. A. Marra, S. J. Jones, De novo transcriptome assembly with ABySS. Bioinformatics 25, 2872–2877 (2009). doi:10.1093/bioinformatics/btp367 Medline
54. J. Gao, K. Liu, H. Liu, H. T. Blair, G. Li, C. Chen, P. Tan, R. Z. Ma, A complete DNA sequence map of the ovine major histocompatibility complex. BMC Genomics 11, 466 (2010). doi:10.1186/1471-2164-11-466 Medline
83
55. S. Kurtz, A. Phillippy, A. L. Delcher, M. Smoot, M. Shumway, C. Antonescu, S. L. Salzberg, Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004). doi:10.1186/gb-2004-5-2-r12 Medline
56. G. Parra, K. Bradnam, I. Korf, CEGMA: A pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007). doi:10.1093/bioinformatics/btm071 Medline
57. T. Wicker, F. Sabot, A. Hua-Van, J. L. Bennetzen, P. Capy, B. Chalhoub, A. Flavell, P. Leroy, M. Morgante, O. Panaud, E. Paux, P. SanMiguel, A. H. Schulman, A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 8, 973–982 (2007). doi:10.1038/nrg2165 Medline
58. N. Chen, Curr. Protoc. Bioinformatics, Chapter 4, Unit 4 10 (2004).
59. V. Curwen, E. Eyras, T. D. Andrews, L. Clarke, E. Mongin, S. M. Searle, M. Clamp, The Ensembl automatic gene annotation system. Genome Res. 14, 942–950 (2004). doi:10.1101/gr.1858004 Medline
60. UniProt Consortium, Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 41, (D1), D43–D47 (2013). doi:10.1093/nar/gks1068 Medline
61. A. Morgulis, E. M. Gertz, A. A. Schäffer, R. Agarwala, A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comput. Biol. 13, 1028–1040 (2006). doi:10.1089/cmb.2006.13.1028 Medline
62. G. Benson, Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999). doi:10.1093/nar/27.2.573 Medline
63. E. Birney, R. Durbin, Using GeneWise in the Drosophila annotation experiment. Genome Res. 10, 547–548 (2000). doi:10.1101/gr.10.4.547 Medline
64. J. E. Collins, S. White, S. M. J. Searle, D. L. Stemple, Incorporating RNA-seq data into the zebrafish Ensembl genebuild. Genome Res. 22, 2067–2078 (2012). doi:10.1101/gr.137901.112 Medline
65. H. Li, R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009). doi:10.1093/bioinformatics/btp324 Medline
66. G. S. Slater, E. Birney, Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005). doi:10.1186/1471-2105-6-31 Medline
67. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990). doi:10.1016/S0022-2836(05)80360-2 Medline
68. S. W. Burge, J. Daub, R. Eberhardt, J. Tate, L. Barquist, E. P. Nawrocki, S. R. Eddy, P. P. Gardner, A. Bateman, Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 41, (D1), D226–D232 (2013). doi:10.1093/nar/gks1005 Medline
69. A. Kozomara, S. Griffiths-Jones, miRBase: Integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res. 39, (Database), D152–D157 (2011). doi:10.1093/nar/gkq1027 Medline
84
70. X. J. Yu, H. K. Zheng, J. Wang, W. Wang, B. Su, Detecting lineage-specific adaptive evolution of brain-expressed genes in human using rhesus macaque as outgroup. Genomics 88, 745–751 (2006). doi:10.1016/j.ygeno.2006.05.008 Medline
71. J. Ruan, H. Li, Z. Chen, A. Coghlan, L. J. Coin, Y. Guo, J. K. Hériché, Y. Hu, K. Kristiansen, R. Li, T. Liu, A. Moses, J. Qin, S. Vang, A. J. Vilella, A. Ureta-Vidal, L. Bolund, J. Wang, R. Durbin, TreeFam: 2008 Update. Nucleic Acids Res. 36, (Database), D735–D740 (2008). doi:10.1093/nar/gkm1005 Medline
72. R. C. Edgar, MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004). doi:10.1093/nar/gkh340 Medline
73. J. P. Huelsenbeck, F. Ronquist, MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17, 754–755 (2001). doi:10.1093/bioinformatics/17.8.754 Medline
74. Z. Yang, PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13, 555–556 (1997). Medline
75. M. J. Benton, P. C. J. Donoghue, Paleontological evidence to date the tree of life. Mol. Biol. Evol. 24, 26–53 (2007). doi:10.1093/molbev/msl150 Medline
76. T. De Bie, N. Cristianini, J. P. Demuth, M. W. Hahn, CAFE: A computational tool for the study of gene family evolution. Bioinformatics 22, 1269–1271 (2006). doi:10.1093/bioinformatics/btl097 Medline
77. A. M. Szalkowski, Fast and robust multiple sequence alignment with phylogeny-aware gap placement. BMC Bioinformatics 13, 129 (2012). doi:10.1186/1471-2105-13-129
78. X. Huang, A. Madan, CAP3: A DNA sequence assembly program. Genome Res. 9, 868–877 (1999). doi:10.1101/gr.9.9.868 Medline
79. D. L. Adelson, G. R. Cam, U. DeSilva, I. R. Franklin, Gene expression in sheep skin and wool (hair). Genomics 83, 95–105 (2004). doi:10.1016/S0888-7543(03)00210-6 Medline
80. Y. Benjamini, T. P. Speed, Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012). doi:10.1093/nar/gks001
81. M. Nothnagel, A. Wolf, A. Herrmann, K. Szafranski, I. Vater, M. Brosch, K. Huse, R. Siebert, M. Platzer, J. Hampe, M. Krawczak, Statistical inference of allelic imbalance from transcriptome data. Hum. Mutat. 32, 98–106 (2011). doi:10.1002/humu.21396 Medline
82. R. She, J. S. Chu, K. Wang, J. Pei, N. Chen, GenBlastA: Enabling BLAST to identify homologous gene sequences. Genome Res. 19, 143–149 (2009). doi:10.1101/gr.082081.108 Medline
83. E. L. Sonnhammer, G. von Heijne, A. Krogh, A hidden Markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6, 175–182 (1998). Medline
84. A. L. Archibald, N. E. Cockett, B. P. Dalrymple, T. Faraut, J. W. Kijas, J. F. Maddox, J. C. McEwan, V. Hutton Oddy, H. W. Raadsma, C. Wade, J. Wang, W. Wang, X. Xun; International Sheep Genomics Consortium, The sheep genome reference sequence: A
85
work in progress. Anim. Genet. 41, 449–453 (2010). doi:10.1111/j.1365-2052.2010.02100.x Medline
85. X. Xu, W. Chen, R. Talbot, K. Worley, Y. Jiang, W. Barris, B. Dalrymple, J. Maddox, T. Farault, R. Brauning, M. Xie, W. Zhang, A. Archibald, J. Kijas, N. Cockett, J. McEwan, H. Oddy, F. Nicholas, K. Kristensen, J. Wang, W. Wang, Genome data from the sheep. GigaScience (2011); http://dx.doi.org/10.5524/100023
86. S. L. Salzberg, A. M. Phillippy, A. Zimin, D. Puiu, T. Magoc, S. Koren, T. J. Treangen, M. C. Schatz, A. L. Delcher, M. Roberts, G. Marçais, M. Pop, J. A. Yorke, GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012). doi:10.1101/gr.131383.111 Medline
87. J. W. Kijas, J. A. Lenstra, B. Hayes, S. Boitard, L. R. Porto Neto, M. San Cristobal, B. Servin, R. McCulloch, V. Whan, K. Gietzen, S. Paiva, W. Barendse, E. Ciani, H. Raadsma, J. McEwan, B. Dalrymple; International Sheep Genomics Consortium Members, Genome-wide analysis of the world’s sheep breeds reveals high levels of historic mixture and strong recent selection. PLOS Biol. 10, e1001258 (2012). doi:10.1371/journal.pbio.1001258 Medline
88. Q. Qiu, G. Zhang, T. Ma, W. Qian, J. Wang, Z. Ye, C. Cao, Q. Hu, J. Kim, D. M. Larkin, L. Auvil, B. Capitanu, J. Ma, H. A. Lewin, X. Qian, Y. Lang, R. Zhou, L. Wang, K. Wang, J. Xia, S. Liao, S. Pan, X. Lu, H. Hou, Y. Wang, X. Zang, Y. Yin, H. Ma, J. Zhang, Z. Wang, Y. Zhang, D. Zhang, T. Yonezawa, M. Hasegawa, Y. Zhong, W. Liu, Y. Zhang, Z. Huang, S. Zhang, R. Long, H. Yang, J. Wang, J. A. Lenstra, D. N. Cooper, Y. Wu, J. Wang, P. Shi, J. Wang, J. Liu, The yak genome and adaptation to life at high altitude. Nat. Genet. 44, 946–949 (2012). doi:10.1038/ng.2343 Medline
89. C. G. Elsik, R. L. Tellam, K. C. Worley, R. A. Gibbs, D. M. Muzny, G. M. Weinstock, D. L. Adelson, E. E. Eichler, L. Elnitski, R. Guigó, D. L. Hamernik, S. M. Kappes, H. A. Lewin, D. J. Lynn, F. W. Nicholas, A. Reymond, M. Rijnkels, L. C. Skow, E. M. Zdobnov, L. Schook, J. Womack, T. Alioto, S. E. Antonarakis, A. Astashyn, C. E. Chapple, H. C. Chen, J. Chrast, F. Câmara, O. Ermolaeva, C. N. Henrichsen, W. Hlavina, Y. Kapustin, B. Kiryutin, P. Kitts, F. Kokocinski, M. Landrum, D. Maglott, K. Pruitt, V. Sapojnikov, S. M. Searle, V. Solovyev, A. Souvorov, C. Ucla, C. Wyss, J. M. Anzola, D. Gerlach, E. Elhaik, D. Graur, J. T. Reese, R. C. Edgar, J. C. McEwan, G. M. Payne, J. M. Raison, T. Junier, E. V. Kriventseva, E. Eyras, M. Plass, R. Donthu, D. M. Larkin, J. Reecy, M. Q. Yang, L. Chen, Z. Cheng, C. G. Chitko-McKown, G. E. Liu, L. K. Matukumalli, J. Song, B. Zhu, D. G. Bradley, F. S. Brinkman, L. P. Lau, M. D. Whiteside, A. Walker, T. T. Wheeler, T. Casey, J. B. German, D. G. Lemay, N. J. Maqbool, A. J. Molenaar, S. Seo, P. Stothard, C. L. Baldwin, R. Baxter, C. L. Brinkmeyer-Langford, W. C. Brown, C. P. Childers, T. Connelley, S. A. Ellis, K. Fritz, E. J. Glass, C. T. Herzig, A. Iivanainen, K. K. Lahmers, A. K. Bennett, C. M. Dickens, J. G. Gilbert, D. E. Hagen, H. Salih, J. Aerts, A. R. Caetano, B. Dalrymple, J. F. Garcia, C. A. Gill, S. G. Hiendleder, E. Memili, D. Spurlock, J. L. Williams, L. Alexander, M. J. Brownstein, L. Guan, R. A. Holt, S. J. Jones, M. A. Marra, R. Moore, S. S. Moore, A. Roberts, M. Taniguchi, R. C. Waterman, J. Chacko, M. M. Chandrabose, A. Cree, M. D. Dao, H. H. Dinh, R. A. Gabisi, S. Hines, J. Hume, S. N. Jhangiani, V. Joshi, C. L. Kovar, L. R. Lewis, Y. S.
86
Liu, J. Lopez, M. B. Morgan, N. B. Nguyen, G. O. Okwuonu, S. J. Ruiz, J. Santibanez, R. A. Wright, C. Buhay, Y. Ding, S. Dugan-Rocha, J. Herdandez, M. Holder, A. Sabo, A. Egan, J. Goodell, K. Wilczek-Boney, G. R. Fowler, M. E. Hitchens, R. J. Lozado, C. Moen, D. Steffen, J. T. Warren, J. Zhang, R. Chiu, J. E. Schein, K. J. Durbin, P. Havlak, H. Jiang, Y. Liu, X. Qin, Y. Ren, Y. Shen, H. Song, S. N. Bell, C. Davis, A. J. Johnson, S. Lee, L. V. Nazareth, B. M. Patel, L. L. Pu, S. Vattathil, R. L. Williams, Jr., S. Curry, C. Hamilton, E. Sodergren, D. A. Wheeler, W. Barris, G. L. Bennett, A. Eggen, R. D. Green, G. P. Harhay, M. Hobbs, O. Jann, J. W. Keele, M. P. Kent, S. Lien, S. D. McKay, S. McWilliam, A. Ratnakumar, R. D. Schnabel, T. Smith, W. M. Snelling, T. S. Sonstegard, R. T. Stone, Y. Sugimoto, A. Takasuga, J. F. Taylor, C. P. Van Tassell, M. D. Macneil, A. R. Abatepaulo, C. A. Abbey, V. Ahola, I. G. Almeida, A. F. Amadio, E. Anatriello, S. M. Bahadue, F. H. Biase, C. R. Boldt, J. A. Carroll, W. A. Carvalho, E. P. Cervelatti, E. Chacko, J. E. Chapin, Y. Cheng, J. Choi, A. J. Colley, T. A. de Campos, M. De Donato, I. K. Santos, C. J. de Oliveira, H. Deobald, E. Devinoy, K. E. Donohue, P. Dovc, A. Eberlein, C. J. Fitzsimmons, A. M. Franzin, G. R. Garcia, S. Genini, C. J. Gladney, J. R. Grant, M. L. Greaser, J. A. Green, D. L. Hadsell, H. A. Hakimov, R. Halgren, J. L. Harrow, E. A. Hart, N. Hastings, M. Hernandez, Z. L. Hu, A. Ingham, T. Iso-Touru, C. Jamis, K. Jensen, D. Kapetis, T. Kerr, S. S. Khalil, H. Khatib, D. Kolbehdari, C. G. Kumar, D. Kumar, R. Leach, J. C. Lee, C. Li, K. M. Logan, R. Malinverni, E. Marques, W. F. Martin, N. F. Martins, S. R. Maruyama, R. Mazza, K. L. McLean, J. F. Medrano, B. T. Moreno, D. D. Moré, C. T. Muntean, H. P. Nandakumar, M. F. Nogueira, I. Olsaker, S. D. Pant, F. Panzitta, R. C. Pastor, M. A. Poli, N. Poslusny, S. Rachagani, S. Ranganathan, A. Razpet, P. K. Riggs, G. Rincon, N. Rodriguez-Osorio, S. L. Rodriguez-Zas, N. E. Romero, A. Rosenwald, L. Sando, S. M. Schmutz, L. Shen, L. Sherman, B. R. Southey, Y. S. Lutzow, J. V. Sweedler, I. Tammen, B. P. Telugu, J. M. Urbanski, Y. T. Utsunomiya, C. P. Verschoor, A. J. Waardenberg, Z. Wang, R. Ward, R. Weikard, T. H. Welsh, Jr., S. N. White, L. G. Wilming, K. R. Wunderlich, J. Yang, F. Q. Zhao; Bovine Genome Sequencing and Analysis Consortium, The genome sequence of taurine cattle: A window to ruminant biology and evolution. Science 324, 522–528 (2009). doi:10.1126/science.1169588 Medline
90. E. Gootwine, Placental hormones and fetal-placental development. Anim. Reprod. Sci. 82-83, 551–566 (2004). doi:10.1016/j.anireprosci.2004.04.008 Medline
91. C. M. Wade, E. Giulotto, S. Sigurdsson, M. Zoli, S. Gnerre, F. Imsland, T. L. Lear, D. L. Adelson, E. Bailey, R. R. Bellone, H. Blöcker, O. Distl, R. C. Edgar, M. Garber, T. Leeb, E. Mauceli, J. N. MacLeod, M. C. Penedo, J. M. Raison, T. Sharpe, J. Vogel, L. Andersson, D. F. Antczak, T. Biagi, M. M. Binns, B. P. Chowdhary, S. J. Coleman, G. Della Valle, S. Fryc, G. Guérin, T. Hasegawa, E. W. Hill, J. Jurka, A. Kiialainen, G. Lindgren, J. Liu, E. Magnani, J. R. Mickelson, J. Murray, S. G. Nergadze, R. Onofrio, S. Pedroni, M. F. Piras, T. Raudsepp, M. Rocchi, K. H. Røed, O. A. Ryder, S. Searle, L. Skow, J. E. Swinburne, A. C. Syvänen, T. Tozaki, S. J. Valberg, M. Vaudin, J. R. White, M. C. Zody, E. S. Lander, K. Lindblad-Toh, Broad Institute Genome Sequencing Platform, Broad Institute Whole Genome Assembly Team, Genome sequence, comparative analysis, and population genetics of the domestic horse. Science 326, 865–867 (2009). doi:10.1126/science.1178158 Medline
87
92. J. M. Kidd, S. Gravel, J. Byrnes, A. Moreno-Estrada, S. Musharoff, K. Bryc, J. D. Degenhardt, A. Brisbin, V. Sheth, R. Chen, S. F. McLaughlin, H. E. Peckham, L. Omberg, C. A. Bormann Chung, S. Stanley, K. Pearlstein, E. Levandowsky, S. Acevedo-Acevedo, A. Auton, A. Keinan, V. Acuña-Alonzo, R. Barquera-Lozano, S. Canizales-Quinteros, C. Eng, E. G. Burchard, A. Russell, A. Reynolds, A. G. Clark, M. G. Reese, S. E. Lincoln, A. J. Butte, F. M. De La Vega, C. D. Bustamante, Population genetic inference from personal genome data: Impact of ancestry and admixture on human genomic variation. Am. J. Hum. Genet. 91, 660–671 (2012). doi:10.1016/j.ajhg.2012.08.025 Medline
93. M. A. M. Groenen, A. L. Archibald, H. Uenishi, C. K. Tuggle, Y. Takeuchi, M. F. Rothschild, C. Rogel-Gaillard, C. Park, D. Milan, H. J. Megens, S. Li, D. M. Larkin, H. Kim, L. A. Frantz, M. Caccamo, H. Ahn, B. L. Aken, A. Anselmo, C. Anthon, L. Auvil, B. Badaoui, C. W. Beattie, C. Bendixen, D. Berman, F. Blecha, J. Blomberg, L. Bolund, M. Bosse, S. Botti, Z. Bujie, M. Bystrom, B. Capitanu, D. Carvalho-Silva, P. Chardon, C. Chen, R. Cheng, S. H. Choi, W. Chow, R. C. Clark, C. Clee, R. P. Crooijmans, H. D. Dawson, P. Dehais, F. De Sapio, B. Dibbits, N. Drou, Z. Q. Du, K. Eversole, J. Fadista, S. Fairley, T. Faraut, G. J. Faulkner, K. E. Fowler, M. Fredholm, E. Fritz, J. G. Gilbert, E. Giuffra, J. Gorodkin, D. K. Griffin, J. L. Harrow, A. Hayward, K. Howe, Z. L. Hu, S. J. Humphray, T. Hunt, H. Hornshøj, J. T. Jeon, P. Jern, M. Jones, J. Jurka, H. Kanamori, R. Kapetanovic, J. Kim, J. H. Kim, K. W. Kim, T. H. Kim, G. Larson, K. Lee, K. T. Lee, R. Leggett, H. A. Lewin, Y. Li, W. Liu, J. E. Loveland, Y. Lu, J. K. Lunney, J. Ma, O. Madsen, K. Mann, L. Matthews, S. McLaren, T. Morozumi, M. P. Murtaugh, J. Narayan, D. T. Nguyen, P. Ni, S. J. Oh, S. Onteru, F. Panitz, E. W. Park, H. S. Park, G. Pascal, Y. Paudel, M. Perez-Enciso, R. Ramirez-Gonzalez, J. M. Reecy, S. Rodriguez-Zas, G. A. Rohrer, L. Rund, Y. Sang, K. Schachtschneider, J. G. Schraiber, J. Schwartz, L. Scobie, C. Scott, S. Searle, B. Servin, B. R. Southey, G. Sperber, P. Stadler, J. V. Sweedler, H. Tafer, B. Thomsen, R. Wali, J. Wang, J. Wang, S. White, X. Xu, M. Yerle, G. Zhang, J. Zhang, J. Zhang, S. Zhao, J. Rogers, C. Churcher, L. B. Schook, Analyses of pig genomes provide insight into porcine demography and evolution. Nature 491, 393–398 (2012). doi:10.1038/nature11622 Medline
94. R. N. Kim, D. S. Kim, S. H. Choi, B. H. Yoon, A. Kang, S. H. Nam, D. W. Kim, J. J. Kim, J. H. Ha, A. Toyoda, A. Fujiyama, A. Kim, M. Y. Kim, K. H. Park, K. S. Lee, H. S. Park, Genome analysis of the domestic dog (Korean Jindo) by massively parallel sequencing. DNA Res. 19, 275–288 (2012). doi:10.1093/dnares/dss011 Medline
95. J. W. Kijas, D. Townley, B. P. Dalrymple, M. P. Heaton, J. F. Maddox, A. McGrath, P. Wilson, R. G. Ingersoll, R. McCulloch, S. McWilliam, D. Tang, J. McEwan, N. Cockett, V. H. Oddy, F. W. Nicholas, H. Raadsma; International Sheep Genomics Consortium, A genome wide survey of SNP variation reveals the genetic structure of sheep breeds. PLOS ONE 4, e4668 (2009). doi:10.1371/journal.pone.0004668 Medline
96. L. Iannuzzi, G. P. Di Meo, Chromosomal evolution in bovids: A comparison of cattle, sheep and goat G- and R-banded chromosomes and cytogenetic divergences among cattle, goat and river buffalo sex chromosomes. Chromosome Res. 3, 291–299 (1995). doi:10.1007/BF00713067 Medline
88
97. J. F. Maddox, A presentation of the differences between the sheep and goat genetic maps. Genet. Sel. Evol. 37, (Suppl 1), S1–S10 (2005). doi:10.1186/1297-9686-37-S1-S1
98. T. Goldammer, R. M. Brunner, A. Rebl, C. H. Wu, K. Nomura, T. Hadfield, J. F. Maddox, N. E. Cockett, Cytogenetic anchoring of radiation hybrid and virtual maps of sheep chromosome X and comparison of X chromosomes in sheep, cattle, and human. Chromosome Res. 17, 497–506 (2009). doi:10.1007/s10577-009-9047-9 Medline
99. A. S. Van Laere, W. Coppieters, M. Georges, Characterization of the bovine pseudoautosomal boundary: Documenting the evolutionary history of mammalian sex chromosomes. Genome Res. 18, 1884–1895 (2008). doi:10.1101/gr.082487.108 Medline
100. R. L. Jirtle, J. R. Weidman, Imprinted and more equal. Am. Sci. 95, 143–149 (2007). doi:10.1511/2007.64.1019
101. I. M. Morison, J. P. Ramsay, H. G. Spencer, A census of mammalian imprinting. Trends Genet. 21, 457–465 (2005). doi:10.1016/j.tig.2005.06.008 Medline
102. N. E. Cockett, M. A. Smit, C. A. Bidwell, K. Segers, T. L. Hadfield, G. D. Snowder, M. Georges, C. Charlier, The callipyge mutation and other genes that affect muscle hypertrophy in sheep. Genet. Sel. Evol. 37, (Suppl 1), S65–S81 (2005). doi:10.1186/1297-9686-37-S1-S65 Medline
103. E. A. Glazov, S. McWilliam, W. C. Barris, B. P. Dalrymple, Origin, evolution, and biological role of miRNA cluster in DLK-DIO3 genomic region in placental mammals. Mol. Biol. Evol. 25, 939–948 (2008). doi:10.1093/molbev/msn045 Medline
104. T. M. Jermann, J. G. Opitz, J. Stackhouse, S. A. Benner, Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily. Nature 374, 57–59 (1995). doi:10.1038/374057a0 Medline
105. D. E. Dobson, E. M. Prager, A. C. Wilson, Stomach lysozymes of ruminants. I. Distribution and catalytic properties. J. Biol. Chem. 259, 11607–11616 (1984). Medline
106. H. R. Ibrahim, U. Thomas, A. Pellegrini, A helix-loop-helix peptide at the upper lip of the active site cleft of lysozyme confers potent antimicrobial activity with membrane permeabilization action. J. Biol. Chem. 276, 43767–43774 (2001). doi:10.1074/jbc.M106317200 Medline
107. B. J. Norris, V. A. Whan, A gene duplication affecting expression of the ovine ASIP gene is responsible for white and black sheep. Genome Res. 18, 1282–1293 (2008). doi:10.1101/gr.072090.107 Medline
108. B. P. Telugu, A. M. Walker, J. A. Green, Characterization of the bovine pregnancy-associated glycoprotein gene family: Analysis of gene sequences, regulatory regions within the promoter and expression of selected genes. BMC Genomics 10, 185 (2009). doi:10.1186/1471-2164-10-185 Medline
109. J. A. Green, S. Xie, X. Quan, B. Bao, X. Gan, N. Mathialagan, J. F. Beckers, R. M. Roberts, Pregnancy-associated bovine and ovine glycoproteins exhibit spatially and temporally
89
distinct expression patterns during pregnancy. Biol. Reprod. 62, 1624–1631 (2000). doi:10.1095/biolreprod62.6.1624 Medline
110. K. Koshi, K. Ushizawa, K. Kizaki, T. Takahashi, K. Hashizume, Expression of endogenous retrovirus-like transcripts in bovine trophoblastic cells. Placenta 32, 493–499 (2011). doi:10.1016/j.placenta.2011.04.002 Medline
111. R. Oko, C. R. Morales, A novel testicular protein, with sequence similarities to a family of lipid binding proteins, is a major component of the rat sperm perinuclear theca. Dev. Biol. 166, 235–245 (1994). doi:10.1006/dbio.1994.1310 Medline
112. W. S. Lagakos, X. Guan, S. Y. Ho, L. R. Sawicki, B. Corsico, S. Kodukula, K. Murota, R. E. Stark, J. Storch, Liver fatty acid-binding protein binds monoacylglycerol in vitro and in mouse liver cytosol. J. Biol. Chem. 288, 19805–19815 (2013). doi:10.1074/jbc.M113.473579 Medline