+ All Categories
Home > Documents > Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6....

Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6....

Date post: 29-Sep-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
89
www.sciencemag.org/cgi/content/full/344/6188/1168/DC1 Supplementary Materials for The sheep genome illuminates biology of the rumen and lipid metabolism Yu Jiang, Min Xie, Wenbin Chen, Richard Talbot, Jillian F. Maddox, Thomas Faraut, Chunhua Wu, Donna M. Muzny, Yuxiang Li, Wenguang Zhang, Jo-Ann Stanton, Rudiger Brauning, Wesley C. Barris, Thibaut Hourlier, Bronwen L. Aken, Stephen M. J. Searle, David L. Adelson, Chao Bian, Graham R. Cam, Yulin Chen, Shifeng Cheng, Udaya DeSilva, Karen Dixen, Yang Dong, Guangyi Fan, Ian R. Franklin, Shaoyin Fu, Pablo Fuentes-Utrilla, Rui Guan, Margaret A. Highland, Michael E. Holder, Guodong Huang, Aaron B. Ingham, Shalini N. Jhangiani, Divya Kalra, Christie L. Kovar, Sandra L. Lee, Weiqing Liu, Xin Liu, Changxin Lu, Tian Lv, Tittu Mathew, Sean McWilliam, Moira Menzies, Shengkai Pan, David Robelin, Bertrand Servin, David Townley, Wenliang Wang, Bin Wei, Stephen N. White, Xinhua Yang, Chen Ye, Yaojing Yue, Peng Zeng, Qing Zhou, Jacob B. Hansen, Karsten Kristiansen, Richard A. Gibbs, Paul Flicek, Christopher C. Warkup, Huw E. Jones, V. Hutton Oddy, Frank W. Nicholas, John C. McEwan, James W. Kijas, Jun Wang, Kim C. Worley,* Alan L. Archibald,* Noelle Cockett,* Xun Xu,* Wen Wang,* Brian P. Dalrymple* *To whom correspondence should be addressed. E-mail: [email protected] (B.P.D.); [email protected] (W.W); [email protected] (X.X.); [email protected] (A.L.A.); [email protected] (K.C.W.); [email protected] (N.C.) Published 6 June 2014, Science 344, 1168 (2014) DOI: 10.1126/science.1252806 This PDF file includes: Materials and Methods Supplementary Text Figs. S1 to S10, S12 to S33 Tables S1, S4, S8, S12 to S24, S27 Other Supplementary Material for this manuscript includes the following: (available at www.sciencemag.org/cgi/content/full/344/6188/1168/DC1) Fig. S11 Tables S2, S3, S5 to S7, S9 to S11, S25, S26
Transcript
Page 1: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

www.sciencemag.org/cgi/content/full/344/6188/1168/DC1

Supplementary Materials for

The sheep genome illuminates biology of the rumen and lipid metabolism

Yu Jiang, Min Xie, Wenbin Chen, Richard Talbot, Jillian F. Maddox, Thomas Faraut, Chunhua Wu, Donna M. Muzny, Yuxiang Li, Wenguang Zhang, Jo-Ann Stanton,

Rudiger Brauning, Wesley C. Barris, Thibaut Hourlier, Bronwen L. Aken, Stephen M. J. Searle, David L. Adelson, Chao Bian, Graham R. Cam, Yulin Chen,

Shifeng Cheng, Udaya DeSilva, Karen Dixen, Yang Dong, Guangyi Fan, Ian R. Franklin, Shaoyin Fu, Pablo Fuentes-Utrilla, Rui Guan, Margaret A. Highland, Michael E. Holder, Guodong Huang, Aaron B. Ingham, Shalini N. Jhangiani, Divya Kalra, Christie L. Kovar,

Sandra L. Lee, Weiqing Liu, Xin Liu, Changxin Lu, Tian Lv, Tittu Mathew, Sean McWilliam, Moira Menzies, Shengkai Pan, David Robelin, Bertrand Servin,

David Townley, Wenliang Wang, Bin Wei, Stephen N. White, Xinhua Yang, Chen Ye, Yaojing Yue, Peng Zeng, Qing Zhou, Jacob B. Hansen, Karsten Kristiansen,

Richard A. Gibbs, Paul Flicek, Christopher C. Warkup, Huw E. Jones, V. Hutton Oddy, Frank W. Nicholas, John C. McEwan, James W. Kijas, Jun Wang, Kim C. Worley,* Alan L. Archibald,* Noelle Cockett,* Xun Xu,* Wen Wang,* Brian P. Dalrymple*

*To whom correspondence should be addressed. E-mail: [email protected] (B.P.D.); [email protected] (W.W); [email protected] (X.X.); [email protected]

(A.L.A.); [email protected] (K.C.W.); [email protected] (N.C.)

Published 6 June 2014, Science 344, 1168 (2014)

DOI: 10.1126/science.1252806 This PDF file includes:

Materials and Methods Supplementary Text Figs. S1 to S10, S12 to S33 Tables S1, S4, S8, S12 to S24, S27

Other Supplementary Material for this manuscript includes the following: (available at www.sciencemag.org/cgi/content/full/344/6188/1168/DC1)

Fig. S11 Tables S2, S3, S5 to S7, S9 to S11, S25, S26

Page 2: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

2

Materials and Methods

1.1 DNA sample preparation and sequencing

The data for the sheep reference genome were generated at two sequencing facilities

in late 2009. The Texel breed was selected because Texel is a popular terminal sire breed

in several countries. In addition, the Texel breed served as one of the paternal grandsire

breeds of the sheep International Mapping Flock (IMF) (26).

Together with BGI-Shenzhen, the Kunming Institute of Zoology commissioned the

whole-genome shotgun sequencing and de novo assembly for a 6-month old Texel ewe,

which was provided by Dr. Jacob B. Hansen of the University of Copenhagen. The

genomic DNA was isolated from liver tissue by standard molecular biology techniques,

and then sequenced using the Genome Analyzer II sequencer (Illumina, CA, USA). For

small insert size libraries, 161 Gb of 101 bp paired-end reads from 7 libraries with insert

sizes ranging from 180 to 800 bp were generated (table S1). In addition, 60 Gb of 45 bp

mate-paired reads were generated from libraries with average insert sizes of 2 kb, 5 kb, 9

kb and 17 kb (table S1) with several additional steps of DNA circularization, digestion of

linear DNA, fragmentation of circularized DNA, and purification of biotinylated DNA.

To read through higher GC content, we also generated 2 Gb MeDIP-seq (see section 1.3).

Simultaneously, the ARK-Genomics Center for Comparative and Functional

Genomics at The Roslin Institute sequenced a single inbred Texel ram (about 14%

inbreeding over the last 5 generations of mating), which was used previously as the DNA

source for the CHORI-243 BAC library and the virtual sheep genome (27). A total of 149

Gb of 101 bp Illumina paired-end sequences from two libraries with insert sizes of 200

bp and 500 bp were obtained (table S1). In addition, 5 Gb of large insert size (8kb and

20kb) mate-paired reads from the same male Texel were generated at the Baylor College

of Medicine Human Genome Sequence Center (BCM-HGSC) using 454 technology

(table S1). An additional ~21 fold coverage of mate-paired reads data from The Roslin

Institute for the male Texel was also used to fill gaps (table S1), especially those located

in GC-rich regions. The mate-paired library was prepared using a combination of the 8 kb

Roche mate paired kit primers following the Roche instructions until the circularized

DNA was sheared and the biotinylated DNA fragments captured, then Illumina specific

primers were added and processed for Illumina sequencing. The new Illumina Truseq

SBS v3 reagents and TruSeq PE v3 cluster kits for the HiSeq 2000 (Illumina, CA, USA)

were used; which Illumina claims leads to less GC bias than earlier kits that were

previously used on the Illumina GA IIx instrument. Another 9 Gb of 454 reads and 0.3

Gb of BAC end sequences (sequenced by Sanger technology) (table S1), which were

used to generate the sheep draft genome Oar v1.0, were also integrated for gaps filling

and assembly checking.

1.2 Sample collection, RNA isolation, and RNA-Seq method

RNA was purified using Trizol (Invitrogen, CA, USA) for sequencing from seven

fresh tissues (Heart, Liver, Ovary, Kidney, Brain, Lung, White adipose) from the same

single Texel ewe used for the genome sequencing (table S7). RNA sequencing libraries

were constructed using the Illumina mRNA-Seq Prep Kit. Briefly, first strand cDNA

synthesis was performed with oligo-T primer and Superscript II reverse transcriptase

(Invitrogen, USA). The second strand was synthesized with E. coli DNA PolI

Page 3: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

3

(Invitrogen, CA, USA). Double stranded cDNA was purified with a QIAquick PCR

purification kit (Qiagen, Germantown, MD), and sheared with a nebulizer (Invitrogen,

CA, USA) to 100-500 bp fragments. After end repair and addition of a 3’-dA overhang,

the cDNA was ligated to Illumina PE adapter oligo mix, and 200±20 bp fragments were

selected by gel purification. After 15 cycles of PCR amplification the 200 bp paired-end

libraries were sequenced using the Illumina Hiseq 2000 platform.

For wool related studies, RNA was extracted from a skin sample collected from the

flank of one super-fine strain (fiber diameter=16.5 micron) of Gansu alpine fine wool

sheep (adult ewe, two year old) in June 2010. Gansu alpine fine wool sheep were

developed in the Huang Cheng District of Gansu Province, China, by crossing Mongolian

or Tibetan sheep with Xinjiang fine wool sheep and then with several fine wool sheep

breeds from the Union of Soviet Socialist Republics, such as Caucasian and Salsk. RNA

libraries were constructed and sequenced as described above.

A large tissue survey of 83 tissue samples from another 4 Texels (Texel breed fiber

diameter 28-33 micron) was sequenced (150 bp paired-end RNA-Seq reads) using the

Illumina HiSeq 2500 at The Roslin Institute in 2013 (table S7). The RNA sequencing

libraries were prepared using a minor modification of the Illumina Truseq stranded total

RNA sequencing kit. The rRNA was removed using the EpicentreRiboZero

(Human/mouse/rat) kit. The rRNA depleted RNA was fragmented as indicated in the kit

protocol except that the fragmentation time was reduced to 15 seconds at 94oC using the

suggested modification in the protocol. The rest of the library preparation was as

indicated in the protocol except that the number of cycles for library enrichment was

reduced to 12 cycles to reduce the chance of duplicate products. The libraries were

quantified by qPCR before pooling into pools for sequencing.

We also downloaded goat and cattle RNA-Seq data from public databases; liver

RNA-Seq data from Yunling Black Goat (28); skin RNA-Seq data from Shanbei White

Cashmere Goat (29), cattle skin RNA-Seq data (30), and cattle liver RNA-Seq data

(GSE37544) from the NCBI GEO database..

1.3 DNA sample preparation and MeDIP sequencing We isolated the genomic DNA from liver tissue of the sequenced female Texel by

standard molecular biology techniques. MeDIP DNA libraries were prepared following

the protocol as previously described (31). Each MeDIP library was subjected to paired-

end sequencing using Illumina HiSeq 2000. Five microgram of DNA was isolated using

E.Z.N.A. HP Tissue DNA Midi Kit (Omega) and was sonicated to ~100-500 bp

fragments with a Bioruptor sonicator (Diagenode). Then libraries were constructed by

adopting the Illumina Paired-End protocol consisting of end repair, A-base addition and

adaptor ligation steps, which were performed using Illumina’s Paired-End DNA Sample

Prep kit following the manufacturer’s instructions. Adaptor-ligated DNA was

immunoprecipitated by anti-methylcytidine monoclonal antibodies. The specificity of the

enrichment was confirmed by qPCR using SYBR ® Green master mix (Applied

Biosystems) and primers for positive and negative internal control DNA of non-human

samples were supplied in the Magnetic Methylated DNA Immunoprecipitation kit

(Diagenode). Cycling of qPCR validation consisted of 95°C 5 min, followed by 40 cycles

95°C 15 s and 60°C 1 min. The enriched fragments with methylation and 10% input

DNA were purified on ZYMO DNA Clean & Concentrator-5 columns (ZYMO)

Page 4: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

4

following the manufacturer’s instructions. DNA was eluted in 30 μl buffer EB and its

concentration was measured. Enriched fragments were amplified by adaptor-mediated

PCR in a final reaction volume of 50 μl consisting of 23 μl purified DNA, 25 μl Phusion

DNA polymerase mix (NEB) and 2 μl PCR primers. Amplification consisted of 94°C 30

s, 10 cycles of 94°C 30 s, 60°C 30 s, 72°C 30 s, followed by prolonged extension for 5

min at 72 °C and the products were held at 4°C. Amplification quality and quantity were

evaluated by Agilent 2100 Analyzer DNA 1000 chips purified by 2% agarose gel and

eluted in 15 μl buffer EB. Ultra-high-throughput 50 bp paired-end sequencing was carried

out using the Illumina HiSeq 2000 according to manufacturer’s instructions. Raw

sequencing data were processed by the Illumina base-calling pipeline. Then the

sequenced reads were mapped onto Oar v3.1 using SOAP2.20

(http://soap.genomics.org.cn) (32).

1.4 Scaffold assembly and filling intra-scaffold gaps

We assembled the ~75-fold coverage of Texel ewe reads into contigs and scaffolds

using SOAPdenovo (33). The gaps between the constructed scaffolds were mainly

composed of repeats that were masked during scaffold construction. To close these gaps,

we used the paired-end information to retrieve read pairs with one end mapped to a

unique contig while the other located in the gap region using GapCloser for SOAPdenovo

software (http://soap.genomics.org.cn). The ~120-fold coverage Illumina sequence from

both the ewe and the ram was used for the first round of gap filling. Then, ~144 ×

Illumina reads (includes 21 × GC unbiased data and 2 Gb MeDIP-seq data) from both the

Texel ewe and ram were applied for gap filling to improve the assembly. We

subsequently conducted a local assembly for these collected reads. We also used 3 × 454

and 0.1 × BAC reads to try and cross the remaining gaps, by uniquely mapping both of

the 45 bp tips of reads onto the flanking sequence of the same gap using

SOAPaligner/soap2.20 software (http://soap.genomics.org.cn) (32).

1.5 Construction of high density sheep RH (radiation hybrid) maps

A specific method for genotype calling of the two sheep RH panels (34, 35) was

used. The genotyping experiment undertaken with the Ovine SNP50 BeadChip (Illumina,

San Diego, CA) provides two measures of fluorescence intensity (one for each SNP

allele) for each RH clone and each SNP. The maximum intensity over the two alleles

(Imax) was used as a measure of the retention of the marker in the clone. We modeled the

distribution of Imax for non-retained SNPs (Dnull) within a clone using either a Gumbel

or a Normal distribution. The parameters of Dnull were estimated empirically using the

first 10% of the observed Imax, choosing for each clone the distribution family that was a

best fit for the data. Given Dnull, a p-value was computed for each SNP based on its

observed Imax. Finally the FDR approach of Storey and Tibshirani (36) was used to

perform the genotype calling across all clones and all SNPs. Specifically, data points with

Imax corresponding to an FDR of < 1% were called present, those with Imax

corresponding to an FDR of 1% to10% were called missing and those with Imax

corresponding to an FDR greater than 10% were called absent. This procedure (i)

controls the false positive rate, (ii) controls the false negative rate and (iii) estimates the

retention fraction. The genotype calling procedure was applied independently for each

panel. For both panels, the false positive rate was expected to be 1%. For the INRA

Page 5: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

5

panel, the estimate of the retention fraction was 35%, 4.3% of false negatives were

expected and the missing data proportion was 6.5%. The selectable marker gene for the

INRA RH panel was HPRT which maps to the nonPAR (non pseudoautosomal region) of

the X chromosome meaning that this panel is biased towards retaining the X chromosome

nonPAR. This in conjunction with the sex of the INRA RH panel animal being male

(single copy of chromosome X nonPAR) means that the INRA RH panel has very low

resolving power for the X chromosome nonPAR. For the Utah panel, which was selected

for retention of the TK1 gene, which maps to OAR11 ~53 Mb, the estimate of the

retention fraction was 31%, 11% of false negatives were expected and the missing data

proportion was 4.1%. This panel also appears to have been derived from a male animal

meaning that it would have poorer resolving power for the chromosome X nonPAR

(single chromosome copy) and the region on OAR11 around TK1 and in fact the calling

procedure failed to call the SNPs on OAR11 from 45 Mb to the end of the chromosome

(62 Mb).

Out of the 49,035 SNP markers, 41,999 could be reliably called on both RH panels

using the method described above. RH maps were constructed for each chromosome

using the comparative approach (37) implemented in the carthagene software (38). The

principle of using the two panels to create the RH maps in the context of a comparative

approach is the following: the likelihood of a given order was computed independently

for the dataset corresponding to each panel and the product of likelihoods was used as the

likelihood of the data and combined with the prior probability of the order to produce the

objective criteria to maximize. A reduction in the instance of The Travelling Salesman

Problem (TSP) was performed, enabling the use of an efficient TSP solver in the

optimization step. Those comprehensive maps are made of 38,202 markers. Because high

density maps are likely to contain regions where local ordering is poorly supported by the

data, we constructed robust maps for each sheep chromosome. A robust map consists of a

subset of markers whose order was associated with a strong posterior probability (39, 40).

A total of 33,386 markers were included in the robust maps (table S3). The alignment of

the robust map with the final Oar v3.1 assembly was calculated for each chromosome

(fig. S11).

1.6 Construction of a sheep linkage map

Ovine SNP50 BeadChip genotyping data was obtained for three genetic mapping

populations: the International Mapping Flock (IMF, (26)), the SheepGENOMICS flock

(41), and the Louisiana State University (LSU) flock (42). The number of animals whose

genotypes were used comprised: 117 from the 3-generation full-sibling IMF; 20 sires and

3,831 progeny from the 2-generation half-sibling SheepGENOMICS flock; and 449 from

the complex F2-type 3-generation LSU pedigree.

SNPs were initially assigned to chromosomes using a dataset that included both the

IMF SNP data and the IMF genotype data used for IMF sheep map version 5 (table S2)

and the find-all-linkage-groups option of a version of MultiMap (43) that incorporated

lispcri version 2.503 (adapted by Jill Maddox and Ian Evans,

http://www.animalgenome.org/tools/share/crimap/). The SNP chromosome groupings

were then used to assign additional SNPs to chromosomes with find-all-linkage-groups in

the FMFS and LSU populations. All SNPs that could be assigned to a chromosome were

assumed to derive from single copy sequence, and the single copy nature of these SNPs

Page 6: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

6

was checked in the sequence assembly. Putative single copy SNPs with multiple

sequence assembly locations were investigated and the correct chromosome identified.

Genetic maps, for comparing alternate putative sequence assembly, radiation hybrid

(RH) SNP and bovine orders, were constructed using the chrompic option of CRI-MAP

2.503, a modified version of CRI-MAP (44). The three populations were used separately

for autosomes and the pseudoautosomal region of the X chromosome, while only the IMF

and LSU dataset were used for the non-pseudoautosomal region as the

SheepGENOMICS data set contained only male informative meioses. In addition, low

density de novo lod6 (autosomes, SheepGENOMICS dataset) and lod3 X chromosome

maps (IMF, LSU datasets) were constructed as a further check on gross chromosomal

SNP ordering. Comparisons between two different map types (sequence assembly versus

RH map, sheep order versus bovine order) only used SNPs that were present in both of

the compared map types. Genetic maps were investigated for possible map expansions

due to incorrect ordering and double recombinants. This approach identified a number of

erroneously positioned SNPs on early versions of the sequence assembly. The positions

of these SNPs were further investigated and most major discrepancies (with large

log10likelihood differences for the genetic map order relative to the sequence assembly

order) were resolved, with a number of changes made to the sequence assembly.

1.7 Construction of Super-scaffolds and anchoring super-scaffolds to chromosomes

Scaffolds that were clearly chimeric were identified by remapping the female Texel

long insert size paired-ends reads (17 kb and 9 kb) to the draft assembly using

SOAPaligner/soap2.20 software (http://soap.genomics.org.cn) (32), then confirmed by

comparison with the bovine UMD3.1 genome assembly (45), goat genome assembly

(28), and antelope scaffold sets (46). Chimeric scaffolds were then manually split in the

gap between adjacent contigs mapped to two different bovine chromosomes.

Super-scaffolds were built from the set of scaffolds using the BAC-end sequences

derived from the male Texel BAC library (CHORI-243) and the predicted locations on

Oar v1.0 of SNPs included on the Illumina Ovine SNP50 BeadChip. This was undertaken

as a single integrated process and non-congruent BACs and out of position SNPs were

minimized. Several rounds of manual checking and final error correction were carried out

using the BAC-end sequences of Ovine CHORI-243 library and 454 mate-paired reads

derived from 8 kb and 20 kb insert libraries of the male Texel.

Unmapped scaffolds with a length of less than 2 kb were discarded. Super-scaffolds

were initially ordered and oriented into chromosomes using the locations of the SNPs in

the sheep RH map, with BLASTN, using unique BLASTN hits with E-values < 1 × 10-10

and hit length > 100 bp. The positions of the SNPs in the sheep linkage map were used to

identify remaining errors and to refine the assembly. Components of the genome

assembly (i.e. scaffolds, and corresponding quality files) and the component assembly

instruction file (i.e. the agp format file) were generated and are available at

(http://www.livestockgenomics.csiro.au/sheep/oar2.0.php) and

(http://www.livestockgenomics.csiro.au/sheep/oar3.1.php).

1.8 Removal of artificial sequence duplications

Page 7: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

7

We systemically identified all of the duplicates by Whole Genome Assembly

Comparison (WGAC) (47). After mapping the 17 kb, 9 kb, 5 kb and 2 kb insert size mate

paired reads to repeat-masked genome using SOAPaligner/soap2.20, we estimated the

length of every duplicated region. If the estimated length was >1 kb shorter than the

assembled length, a local comparison with itself, and orthologous regions in the goat,

antelope, cattle genome assemblies using LASTZ (48) was used to confirm the

boundaries of the duplicated sequence. Duplicates that failed both of the two following

criteria were removed: (1) no reads could link it with its flanking sequences; (2) read

depth after the GC content adjustment was < (whole genome average depth – (3×

STDEV)) of read coverage (49).

1.9 Calling SNPs and validating SNP calling

All of the paired-ends reads from the Texel ewe and Texel ram were mapped back to

the assembled genome by SOAPaligner/soap 2.20 with an average depth of ~41-fold for

the Texel ewe and ~40-fold for the Texel ram. Then, SNPs were called by SOAPsnp (50)

separately for both of the ewe and ram. Next, four steps were used to filter out unreliable

SNPs: (1) a Q20 quality cutoff was used; (2) at least 10 supporting reads were required;

(3) the overall depth, including randomly placed repetitive hits, had to be less than 100;

(4) the approximate copy number of flanking sequences had to be less than 2 (to avoid

false positives caused by the alignment of similar reads from duplicates).

After filtering, the Texel ewe consensus sequence was defined as the reference

genome, and the single nuclear difference between the Texel ewe genome and Texel ram

genome were called as homozygous SNPs.

Both the male and female Texel DNA samples were genotyped using the Ovine

SNP50 BeadChip (Illumina, San Diego, CA). Raw signal intensities were converted into

genotype calls using the Genome Studio software (Illumina, San Diego, CA). We then

compared the SNP calls from the sequencing platform and the SNP50 BeadChip.

1.10 RNA-Seq data processing

To obtain high quality reads and precise analysis results, an in-house C++ program

was used to filter out raw reads which might negatively affect the subsequent analysis.

We removed:

Reads that contained > 10% ambiguous base calls (Ns)

Reads that contained > 40% low quality base calls (quality score ≤ 5)

Reads that contained adapter contamination (with >10 bp aligned length and ≤ 2 bp

mismatches)

Read pairs with read 1 and read 2 overlapping by ≥ 10 bp

100% identical read pairs

1.11 Alignment and de novo assembly of the transcriptomes

All of the clean RNA-Seq data were mapped onto the reference genome with

TopHat v.2.0.4 (51). Transcripts were constructed from the mapped using cufflinks

v.2.0.0 (52).

For de novo assembly of the transcriptomes of the seven tissues from the Texel

ewe, SOAPdenovo was used to assemble the filtered reads from each tissue separately

into contigs and scaffolds, with the parameters set to be “-K 23 -M 0 -F -R -D 1 -d 1”.

Page 8: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

8

The contigs and scaffolds were then passed to ABySS (53)

(http://www.bcgsc.ca/platform/bioinfo/software/abyss) to assemble them into longer

sequences, with k-mer setting to be 43. 52,821 non-redundant sequences (ESTs) longer

than 300 bp, with an average length of 920 bp, were obtained.

Skin RNA-Seq reads which mapped onto MOGAT3 region and rumen RNA-Seq

reads which mapped onto TCHHL2 were also de novo assembled using the above

method.

1.12 Evaluation of the sheep assembly with Ovine SNP50 BeadChip sequences, de novo

assembled ESTs, and BACs

The 59,042 verified Ovine SNP50 BeadChip sequences with a SNP and 150 bp

flanking sequences were used to check the sheep assembly. All of them were mapped

against the genome with BLAST with an identity > 95% and a hit length > 100 bp. All of

the de novo assembled ESTs from RNA-Seq reads from the seven tissues were mapped

against the genome with BLAT to estimate the gene space completeness of the genome,

with an identity > 95%.

The sequences of 16 fully sequenced BACs and the sheep Major Histocompatibility

Complex region (assembled from 26 overlapping BACs) (54) were downloaded from

GenBank. The BACs were aligned against the chromosomes using Mummer (version

3.22) (55) with default parameters. The alignment blocks were then chained along the

BACs by in-house Perl scripts and also with manual confirmation. Ewe paired-end reads

with short insert size (insert size < 1000 bp) were mapped to the BACs using SOAP

(version 2.21) (32), and SOAPcoverage (SOAP software package) was used to calculate

sequencing depth for each non-overlapping 100 bp window along each BAC sequence.

1.13 CEGMA evaluation of sheep genome assembly

CEGMA (Core Eukaryotic Genes Mapping Approach) (version 2.3, with parameter

“--mam”) (56) was also used to assess the sheep genome assembly. 248 CEGs (Core

Eukaryotic Genes) downloaded from the webpage of CEGMA software

(http://korflab.ucdavis.edu/datasets/cegma/) were mapped to the sheep genome and

orthologous or paralogous genes of these CEGs were recovered. The CEGs are conserved

and readily identifiable across a broad range of eukaryotic species. The recovered CEG

numbers and completeness of the recovered gene models implies gene space

completeness of our genome assembly (table S12).

1.14 Identification of repetitive elements

We annotated repetitive sequences and transposable elements (TEs) using a

combination of homology to RepBase sequences and de novo prediction approaches.

We de novo constructed a sheep repeat library using RepeatModeler with the default

parameters. The generated results were consensus sequences and classification

information for each repeat family. TEs were classified according to Wicker et al., (57).

Then RepeatMasker v3.2.6 (58) was run on the genome sequences, using the

RepeatModeler consensus sequence its source library with parameters “ -nolow -no_is –

norna -lib repeat”.

Page 9: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

9

RepeatMasker was also run against the RepBase TE library v2009-06-04 with the

parameter “-nolow -no_is -norna -lib repbase” and “-noLowSimple -pvalue 1e-4”. At the

protein level, RepeatProteinMask was applied (58).

We identified non-interspersed repeat sequences by RepeatMasker with the“-noint”

option, including simple repeats, satellites, and low complexity repeats. We also

predicted Tandem repeats using Tandem Repeat Finder, with parameters set to

“Match=2, Mismatch=7, Delta=7, PM=80, PI=10, Minscore=50, and MaxPeriod=2000”.

Finally, we integrated all of the repeat annotation results with an in-house program.

42.67% of the total assembled genome was identified as TEs (table S13). A considerable

proportion of the 2.76% of the genome assembly comprised of gaps is also likely to be

repeat sequences.

The sequence divergence rate was also calculated for each family of TEs (fig. S12A)

and their distribution across the genome was plotted (fig S12B).

1.15 Gene prediction:

The sheep genome assembly was annotated with the Ensembl gene annotation

system (59) (Ensembl release 74, December 2013). Protein-coding gene models were

annotated by combining alignments of UniProt (60) mammal and other vertebrate protein

sequences and RNA-Seq models generated from different individuals and different tissue

types (table S7) and gap filling with human and cow translations from Ensembl (release

69). Short non-coding genes were also annotated to provide the final gene set (tables S14-

16).

The genome was repeat-masked with RepeatMasker, using the RepBase library (-

species sheep) and using a custom library generated with RepeatModeler, and Dust (61).

Additional low complexity regions were identified using TRF (62).

Protein-coding models were generated by aligning sheep and other vertebrate

protein sequences from UniProt to the repeat-masked genome using Genewise (63).

Protein-coding models were also generated using our in-house RNA-Seq pipeline (64).

The RNA-Seq data set used for generating the gene models consists of different tissue

types from a trio of Texel sheep; ram, ewe and lamb plus an embryo from the same ram-

ewe pairing provided by The Roslin Institute, 7 tissue types from the reference female

Texel and 1 sample of Gansu alpine fine wool sheep skin provided by BGI and whole

blood samples from 1 Polypay sheep and 2 Rambouillet sheep provided by USDA-ARS-

ADRU (table S7). These data were aligned to the genome using BWA (65), resulting in

736 billion reads aligning from 819 billion reads. The alignments were processed by

collapsing the transcribed regions into a set of potential exons. Partially aligned reads

were re-mapped using Exonerate (66) and this step identified 367 million spliced reads or

introns. These introns together with the set of transcribed exons were combined to

produce transcript models, one set for each of 94 individual tissues and one set produced

by merging data from all of the above mentioned tissues. The longest open reading frame

in each of these models was BLASTed (67) against the set of UniProt protein existence

(PE) levels 1 (existence at protein level) and 2 (existence at transcript level) protein

sequences in order to classify the models according to their protein-coding potential.

Data from the above two pipelines were filtered to remove poorly supported

models. Untranslated regions were added to the coding models using sheep cDNA and

RNA-Seq models. The preliminary sets of coding models were combined, prioritizing

Page 10: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

10

well supported models built from UniProt proteins and the merged set of RNA-Seq data

and redundant models were removed. Human and cow Ensembl translations (release 69)

were used to fill in gaps. The resulting unique set of transcript models were clustered into

multi-transcript genes where each transcript in a gene has at least one coding exon that

overlaps a coding exon from another transcript within the same gene. The set of protein-

coding gene models was screened for pseudogenes. Short non-coding RNA genes were

predicted using annotation from RFAM (68) and miRBase (69). The sheep gene

annotation is available on the Ensembl website (http://www.ensembl.org/Ovis_aries/),

including orthologues, gene trees, and whole-genome alignments against human, mouse

and other mammals. Also included are the tissue-specific RNA-Seq transcript models,

indexed BAM files, and the complete set of splice junctions identified by our pipeline.

Further information about the annotation process can be found in a PDF document here:

(http://www.ensembl.org/Ovis_aries/Info/Annotation#assembly).

1.16 Functional Annotation and gene ontology (GO) assignment

InterProScan (version 4.8) was used to assign GO terms to each protein-coding

gene. Member database Pfam (release 27.0), PRINT (release 42.0), PROSITE (release

20.96), ProDom (release 2010.1), SMART (release 7.0), PANTHER (release 8.1) were

searched. KEGG (release 58) and UniProt database (Swissprot/TrEMBL release 2012.3)

were also searched for homology-based gene function assignment.

1.17 Gene family clusters

Protein-coding genes for cattle, pig, horse, dog, human, mouse, opossum were

downloaded from Ensembl release 64. The gene sets for yak and goat were obtained from

the BGI-Shenzhen internal database and the camel gene sets were downloaded from

NCBI. For gene loci with alternative splicing isoforms, only the transcript with the

longest translation product was retained. We performed an “all versus all” alignment

using BLASTP with E-value < 1E-7, and conjoined fragmental alignments using Solar

(70). Then a simplified version of the Treefam methodology (71) was used to cluster

genes from different species into gene families which contain genes that descended from

the same gene in the last common ancestor. The number of orthologous genes across the

eleven species were calculated and plotted in a Venn diagram (fig. S13).

1.18 Phylogenetic tree reconstruction for mammalian species

After gene family clustering, single copy genes, which were determined to contain

only orthologous genes from each species according to the Treefam methodology (71),

were selected to reconstruct phylogenetic relationship of these mammalian species.

Multiple sequence alignment for each gene family was performed by MUSCLE (72)

(version 3.8.31) and four-fold degenerate sites were extracted and concatenated to

generate super alignments. We built phylogenetic trees using MrBayes (73) which takes

advantage of both codon-based and amino acid-based algorithms and adjusts them to the

topology of the species tree, to form a more accurate consensus tree according to four-

fold degenerate sites. To estimate the divergence time of the selected species, we used a

molecular clock model implemented in PAML mcmctree (74). The divergence times

were constrained according to the fossil calibration times (124.6-134.8 million years ago

(Mya) between human-opossum, 95.3-113 Mya between human-cow, 61.5-100.5 Mya

Page 11: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

11

between human-mouse, 48.3-53.5 Mya between cow-pig, 18.3-28.5 Mya between cow-

sheep) (75). The different molecular clocks (divergence rate) might be explained by the

body size hypothesis or the generation-time hypothesis, which propose that the larger the

body size is or the longer the generation-time is, the slower the molecular clock. In

addition we can identify weak or strong selection for each lineage from their dN/dS

ratios.

1.19 Expansion and contraction of gene families

Gene family expansion and contraction changes were detected by CAFÉ (76)

(Computational Analysis of gene Family Evolution) based on the phylogenetic tree

reconstructed in 1.18. The P-value cutoff was set to 0.05, the number of randomizations

was set to be 10000, and λ value was searched. By manually checking the functional

annotation of each gene in each gene family, false positive families with discrepant

functional descriptions were filtered out. The expansion and contraction of orthologous

gene clusters in the nine mammalian species analyzed was calculated (fig. S14).

1.20 Detection of positively selected genes

As described above in section 1.17, BLASTP and Treefam methodologies were used

to define orthologs among the goat, cow and sheep. In total, 14,407 orthologous pairs

were analyzed for positive selection. The coding sequence of orthologs was aligned using

Prank software(77) with default parameters. Ka and Ks were calculated for the aligned

orthologs using Ka/Ks calculator software with default parameters

(http://code.google.com/p/kaks-calculator/wiki/KaKs_Calculator). The branch-site model

of positive selection from PAML (74) was used to identify sheep-specific, sheep-goat

branch, and cattle-specific fast evolving genes. Gene Ontology enrichment of the fast

evolving genes between sheep and cattle identified significant enrichment of a number of

GO terms relating to the immune response (fig. S15).

1.21 BAC sequencing and de novo assembly

Using mapping information for BAC-end sequences from the sheep CHORI-243

BAC library clones (27), one BAC located in the MOGAT3 gene region (CH243-423F23)

was picked and sequenced using HiSeq 2500 to generate 250 bp paired-end reads.

Adaptor sequences were trimmed off and sequences matching either the BAC vector

(pTARBAC2.1) or E. coli DH10B were removed prior to assembling the data. De

novo assembly was carried out using CLC Genomics Workbench 6 (CLC Bio) with the

following parameters: similarity = 0.99, length fraction = 0.9, insertion cost = 3, deletion

cost = 3, mismatch cost = 2 and minimum size = 2000 bp. Then the de novo assembled

contigs were mapped onto the sheep reference genome, Oar v3.1, and linked into one

scaffold manually, based on the integrated information of mate-paired reads (17 kb, 9 kb

and 5 kb libraries) and the reference genome contig mapping order. Then, all of the BAC

sequence reads were mapped onto the de novo BAC assembly scaffold, and the local

mapped depth was checked. All of the gap regions and high depth regions were re-

assembled using CAP3 (78).

1.22 In situ hybridization

Page 12: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

12

In situ hybridization images were generated using EST clones in a large experiment

described in detail in Adelson et al., (79).

1.23 Identification of sheep segmental duplications (SD) and copy number variations

(CNV)

A whole genome alignment comparison pipeline (WGAC) was used for calling SDs

with the same cutoff (>= 1 kb in length, >= 90% sequence identity) (47).

We also used the WSSD strategy to detect SDs in sheep genome. We used BWA

(65) to align paired-end reads of female and male Texel sheep with default parameters

onto unmasked Oar v3.1. The maximum edit distance (as ~5% for the default divergence

cutoff in BWA) was automatically chosen for different read lengths. Approximately 95%

of reads could be mapped onto the reference genome. We then counted the aligned read

numbers within 200 bp sliding windows and 100 bp steps using custom Perl scripts. The

GC bias of the Illumina GAII platform was corrected using LOESS smoothing toward a

pattern of uniform coverage at all GC percentages as previously described (80). All

gapped and repeat-containing 200 bp windows were filtered out. Since the male ChrX is

not diploid, except for the recombining region (PAR) (ChrX:1-7,050,204), the ChrX non-

PAR region was analyzed separately to calculate average sequence depth and standard

deviation (STDEV).

SD/CNV calls were initially selected if five out of seven or more sequential 200

bp overlapping windows had read depth values that significantly differed from the

average (duplications > mean + (2 x STDEV)). We adjusted Bickhart’s WSSD pipeline

(49), by using long alignment reads and small window size to make the depth calling

more sensitive.

If two or more duplicated regions are fully assembled, their read depth will tend

to the average depth. To get the full SD/CNV dataset, we combined the ≥ 95% sequence

identity WGAC dataset (to keep the same identity for WSSD) and the WSSD results

using custom Perl scripts. Only SDs/CNVs > 1 kb in length were kept in the final data

set.

1.24 Genome-wide and transcriptome-wide identification of allele specific expressed

SNPs and genes

By combining DNA and RNA sequencing reads from the same individual, allelic-

specific expressed genes can be accurately identified. We surveyed all of the expressed

alleles for the 5.5 million SNPs identified in the reference female Texel sheep with 15 Gb

of RNA-Seq data from seven tissues from the sequenced individual. A 90:10 cutoff for

the ratio of expression from the two alleles was taken as the boundary between allelic

specific expression and random allelic expression. If > 90% of total expression is from

one allele at 20-fold sequence coverage the statistical test shows strong power (> 97%

correct) (81). Considering that SNPs located in SD or CNV regions may strongly

interfere with the prediction of allelic specific expression, we checked all the SNPs which

had ≥ 20 expressed reads and filtered out SNPs located in SD-CNV regions. We also

removed the potential segmental duplicated SNPs, which have duplicated syntenic

regions in any of three allied genomes, using BLAST searches with 301 bp sequences

(the SNPs and their 150 bp flanking sequences) against the goat ≥ 95% identity and 150

Page 13: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

13

bp), Tibetan antelope (≥ 93% identity and 100 bp) and cattle (≥ 85% identity and 100 bp)

genome assemblies.

1.25 Manual annotation of the genomic complement of tandem duplicated regions, such

as MOGAT3 and EDC region.

Reference protein sequences from related species were obtained by searching the

NCBI and Ensembl databases (website, release 73). Then these protein sequences were

mapped to the sheep genome assembly, or newly assembled BACs, using TBLASTN

(version 2.2.23, parameters “-e 1e-5 –F F”). Since the result of TBLASTN (which were

shown as HSPs, high-scoring pairs) were fragmented, genblastA (82) (version 1.0,

parameters “-e 1e-5 -g T -f F -a 0.5 –d 100000 -r 100 –c 0.01 -s -100 ”) was used to

group the adjacent HSPs derived from the same query protein into a representative

homologous hit of the query protein. Redundant hits were filtered and only the best hit

was retained for each gene locus. Genomic regions matched with reference proteins were

extended upstream and downstream by 2000 bp make sure that intact potential gene

structures were included. GeneWise software(63) (parameters “-genesf -gff -sum”) was

used to predict gene structures for each protein coding region. Pseudogenes were

identified by the presence of premature stop codons.

1.26 Membrane anchor prediction.

To identify the presence or absence of potential membrane anchor regions amino

acid sequences of proteins were submitted to the TMMM2.0 (83) server

(http://www.cbs.dtu.dk/services/TMHMM/).

Supplementary text

The reference sheep genome assembly

Two unrelated Texel sheep, a ewe and a ram, were sequenced by the International

Sheep Genomics Consortium (84) using Illumina sequencing (table S1). The sequencing

reads were assembled into the genome as described below (fig. S16) (10). The 75-fold

coverage Illumina reads of the Texel ewe were de novo assembled into contigs and

scaffolds using SOAPdenovo. The 120-fold coverage Illumina sequence from both

animals was used for gap filling. This preliminary 2.71 Gb assembly, with an N50 length

of contigs and scaffolds of 17.4 kb and 1.1 Mb respectively (table S17) was pre-released

as Oar v2.0 and is available from GigaDB (85). To fill high-GC gaps an additional ~21-

fold coverage of Illumina sequencing data from the male Texel, using a protocol with less

bias against high-GC sequences, and 2 Gb of MeDIP-seq for high GC content sequence

from the female Texel were generated (fig. S17) (10). The coverage of the 5’ ends of

genes was significantly improved over Oar v2.1 (fig. S18). The final assembly has a very

similar distribution of GC content to the bovine and other mammalian genome assemblies

(fig. S19) SOAPdenovo is prone to creating artificial segmental duplications (86) and

multiple gap filling steps can also lead to incorrect elongation at the ends of contigs,

generating artificial dispersed duplicates. We systematically identified 12,008 candidate

artificial tandem duplicates and 5,508 artificial dispersed duplicates by checking their

read depth and the relationship with their flanking sequences (10). Removal of these extra

Page 14: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

14

tandem copies excluded 28 Mb of sequence with an average length of duplicates of 2.3

kb.

The final sheep scaffold set has a contig and scaffold N50 length of 40 kb and 2.2

Mb respectively, achieving a total assembled length of 2.61 Gb (table S4). Using the

CHORI-243 sheep BAC library sequences (27) and long insert mate-pair reads we

constructed 349 super-scaffolds with an N50 of 37.1 Mb (table S18). The Ovine SNP50

BeadChip (87) and microsatellite and other markers were used to generate linkage data

and a high density RH map with 39,042 SNP markers (tables S2, S3 and fig. S11) (10) to

anchor scaffolds and super-scaffolds to the 26 autosomes and the X chromosome to

construct the Oar v3.1 assembly. The ~5,700 unmapped scaffolds have a total length of

32 Mb (1.2%). To check the integrity of the assembly, 15 Gb of expressed sequence data

generated from seven tissues from the sequenced Texel ewe were de novo assembled into

52,821 model mRNAs (average length of 920 bp) (10). 99.3% of the model mRNAs

mapped to the Oar v3.1 assembly with an average coverage of 98.4% (table S19). Of the

54,590 Ovine SNP50 BeadChip (87) oligonucleotides only 375 probes did not have a hit,

indicating the coverage of single copy regions is about 99.3%. Comparison of 16

complete CHORI-243 BAC sequences determined using the Sanger methodology with

the genome assembly identified on average 1.98% nucleotides missing from the genome

assembly (table S20 and fig. S20). 89% of the gap regions were in multiple copy repeats,

especially the newly evolved LINE RTE/BovB elements (table S21). Furthermore, on

comparison with the assembly of the MHC region from a Chinese Merino sheep, which

was also derived from BAC-based Sanger sequencing (54), Oar v3.1contains no long

gaps or large rearrangements, but does contain some additional sequence (fig. S21).

Ovine SNP50 BeadChip (87) genotypes of the two Texels revealed >97.8% concordance

with sequencing and a low false positive rate for heterozygous SNPs (<0.33%). As a

consequence of the extensive checking and manual curation of the genome assembly, the

contig N50 is twice as long as the recently sequenced ruminants, yak (88), Tibetan

antelope (46) and goat (28) (table S22).

All scaffolds and chromosomes of Oar v3.1 have been submitted to the NCBI under

bioproject accession number PRJNA169880, and the assembly for Oar v3.1 has been

assigned accession number GCA_000298735.1. Oar v3.1 is the representative reference

genome for small ruminants, and with the cattle genome (89) a co-reference genome for

all ruminants.

Sheep genome architecture

To investigate the rate of recent segmental duplication (>95% identity, >1 kb length)

in the sheep genome, we used whole-genome shotgun sequencing with GC content

adjustment(49) for the two Texel individuals, and whole genome alignment

comparison(47) (10). In total 7,912 candidate duplicated regions with a total length of

25.8 Mb were identified (Fig.1, tables S5, S23, S24). We successfully detected the

previously described 4 kb duplication of the growth hormone gene (90).

Two continuous and very similar mitochondrial DNA insertions were identified in

the X chromosome (56.33 Mb, length 14 kb) and chromosome 2 (55.2 Mb, length 9 kb)

(fig. S22) and verified by PCR and sequencing across the junctions (fig. S23).

The heterozygosity rate of approximately 0.2%,is 1 to 2 times higher than reported

for individual humans, pigs, cattle, dogs and horses (91-94) (table S22). It is also

Page 15: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

15

approximately 70% higher than reported for the reference goat individual (28).

Approximately 75% of modern sheep breeds have retained an effective population size in

excess of 300, higher than cattle and much higher than most breeds of dog (95),

suggesting domestic sheep arose from a highly heterogeneous pre-domestication gene

pool, and that the genetic bottleneck during domestication was not as severe as for other

domestic animals.

The sheep and cattle genomes have been assembled independently using their own

physical maps, allowing for comparison to identify rearrangements, while the goat has

only pseudo-chromosomes which were assembled based on conserved synteny with the

cattle genome using long super-scaffolds (28). Sheep and cattle have 90% DNA sequence

identity and have similar karyotypes (96) (figs. S1, S24). The 141 breakpoints between

the sheep Oar v3.1 (GCA_000298735.1) and cattle UMD3.1 (GCA_000003055.3)

genome assemblies identified (table S6), including four known Robertsonian

translocations involving the autosomes (97, 98). We also identified a large, 7 Mb,

inversion in sheep chromosome 13 relative to cattle chromosome 13 (figs. S1, S24).

Comparison of the sheep and goat genome assemblies confirms that three of the four

Robertsonian translocations occurred on the sheep lineage. The Robertsonian

translocation in sheep chromosome 9 and the inversion on sheep chromosome 13 are

present in both the sheep and goat branches (figs. S1, S24). The conservation of synteny

relationship of chromosome X is more complicated than the autosomes, with eight

inversions and translocations and centromere loss and acquisition after the divergence of

cattle and sheep (figs. S25, S26). The mammalian X and Y chromosomes maintain a

short region of homology (pseudoautosome region, PAR), allowing pairing and

recombination. In contrast to a previous study suggesting that cattle and sheep share the

same PAR boundary in the vicinity of GPR143 (99), our data places the sheep PAR

boundary downstream of SHROOM2 at about 7 Mb (fig. S25B). We note that the

centromere of sheep chromosome X is now located at the PAR boundary, so that the PAR

appears to represent the entire short arm of the X chromosome in sheep (fig. S25A).

To further investigate breakpoints cattle (CHORI-240 BAC library) and sheep

(CHORI-243 BAC library) BAC end sequences (27) were used. Fifty two out of the 141

breakpoints were confirmed using BACs, including 30 inversions and 13 translocation

events (table S6). We also noted that there are 58 potential genome assembly errors on

cattle genome UMD3.1, and 38 potential >100 kb gaps in Oar v3.1, most of which

contain tandemly duplicated sequences and/or known CNV regions, such as the

multidrug transporter ABCC4 cluster on sheep chromosome 10, and multiple clusters of

olfactory receptor genes. The genome rearrangements described above are likely to

represent recent genetic divergence between sheep and cattle.

Sheep transcriptome

~1.2 Tb of RNA-Seq data was generated from the 94 individual tissue samples

(table S7). The data were used to annotate the sheep genome assembly by the Ensembl

pipeline (10). 20,921 protein coding genes (with 22,823 transcripts), 291 pseudogenes,

and 3,961 short ncRNA and 24 mitochondrial ncRNA genes were annotated (tables S15,

S16). A table linking Ensembl ids to the gene names used in the text is included in the

supplementary material (table S25). The distribution of mRNA length, CDS length, exon

length and intron length of the sheep genes was plotted against the equivalent parameters

Page 16: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

16

of the human, mouse and cow genes. Distributions of the lengths of these features were

very similar to the other genomes (fig. S27).

Imprinted genes exhibit allele-specific expression in a parent-of-origin dependent

manner due to epigenetic modification (100). However, to date only a few candidate

imprinted genes have been previously identified in sheep (101). We used the

heterozygous SNPs and the RNA-Seq data from the same Texel ewe to investigate mono-

allelic gene expression (10). Of the 41,000 SNPs located in expressed sequences and with

>20 fold coverage, 1,788 SNPs exhibited more than 90% of reads from one allele. Thus

802 protein coding SNPs and 986 non-coding SNPs show allele-specific expression

(table S26). The allele-specific expression was generally conserved across adult tissues,

with 93.2% of the SNPs mono-allelically expressed in all of the studied tissues. The

longest candidate imprinted region was the BEGAIN-DIO3 region on chromosome 18

with 60 continuous mono-allelically expressed SNPs. This region contains the well

documented sheep polar over dominance mutation for muscle development, the Callipyge

phenotype (OMIA 001354-9940), and possibly also the Carwell (or rib-eye muscling)

QTL (102) (OMIA 001355-9940). However, many of the genes in the BEGAIN-DIO3

region show high and mono-allelic expression in the brain, suggesting that imprinting in

the region is also correlated with neural regulation (103).

Ruminant gene family expansions

As expected, confirming that our methodology for the detection of gene family

expansions was appropriate, we found expansions in both the LYZ (7 genes on

chromosome 3) and RNASE1 (11 genes on chromosome 7) families, both important in

the digestive system of ruminants (89, 104), on the sheep branch. Lysozyme C (a member

of the LYZ family) is an antibacterial protein in the innate immune system, but in

ruminants it is thought that some members of the lysozyme C family are major digestive

enzymes playing a role in the digestion of bacteria entering the abomasum from the

rumen (105). Two family members, LYZ4 and LYZ5, are extremely highly expressed in

the abomasum, contributing ~10% of total expressed mRNA, whilst LYZ3 is mainly

expressed in the intestine (fig. S28). Interestingly, LYZ1 is one of the top ten most highly

expressed protein coding genes in the rumen. It is also expressed, albeit at a much lower

level, in many other tissues, including the rectum, caecum, colon, duodenum, skin and

the kidney. 445 cattle LYZ1 ESTs have been deposited in GenBank (UniGene cluster,

Bt.67194); the majority of sequences are from the rumen. The role of LYZ1 in the rumen

is unknown, although on the basis of its sequence and tissue expression pattern, it is

predicted to exhibit antibacterial activity and is likely to contribute to the protection of

the rumen epithelium from the activity of the infective/pathogenic bacteria. The predicted

3D structure of the abomasal lysozymes is different from LYZ1 (fig. S29). Interestingly,

the one amino-acid deletion in the abomasal lysozymes is located within the equivalent

region to an antimicrobial peptide identified in chicken lysozyme (106).

We also identified the Agouti locus (OMIA 000201-9940) (Fig. 1A), which is a

large ~190 kb copy number variation contributing to the variability of coat color in sheep

(107), (table S5).

Two large ruminant specific gene families encode proteins probably involved with

the development of the placenta, especially the mechanisms of apoptosis of ruminant

endometrium, and other processes during pregnancy and lactation (108). The pregnancy-

Page 17: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

17

associated glycoprotein genes (PAGs), which are most closely related to pepsins, have

more than 42 tandem duplicated members in sheep (fig. S30A), and the prolactin related

genes (PRPs) have 12 members (figs. S31, S32). Within the PAG locus, one gene, PGA5

encoding pepsinogen A, is highly expressed in the abomasum and duodenum (fig. S30B)

and presumably plays a role in digestion. However, most PAGs are expressed in a

specialized subset of trophoblasts called binucleate cells (BNC) in the placenta, and their

secreted proteins are used as an early pregnancy diagnosis signal in ruminants (109). The

BNC also secrete prolactin-like proteins (110) which exhibit the strongest positive

selection signal between sheep and cattle. All these results suggest the reproduction

system of ruminants has undergone rapid evolutionary change.

TCHHL2 expression in cattle and other mammals

At the time of submission of this article 22 ESTs and mRNAs were mapped to the

bovine genome assembly overlapping the predicted bovine TCHHL2 gene

(LOC101909330) on the NCBI Map viewer. Seventeen of these ESTs, mainly expressed

in the rumen, were included in the UniGene cluster Bt.14362.

No ESTs expressed from TCHHL2 from other mammals were present in

GenBank.

PRD-SPRRII genes and expression in cattle and other mammals

At the time of submission of this article our analysis of the PRD-SPRRII locus in the

EDC region in cattle identified eight genes, all annotated as loci of unknown function at

the NCBI, with several also annotated as non-coding RNAs (table S27). However, our

analysis predicts that all eight genes encode PRD-SPRRII family proteins (fig. S33). No

ESTs expressed from PRD-SPRII family genes from other mammals were present in

GenBank.

LCE7A expression in sheep, cattle and other mammals

At the time of submission of the article 37 EST sequences derived from the sheep

LCE7A gene were deposited in GenBank: One from Merino 103-105 day old fetal sheep

skin, 10 from Romney Marsh 130 day old fetal sheep skin (in top 200 most expressed

genes) and 26 from adult Romney Marsh sheep skin (in top 100 most expressed genes).

The sequences are included in the NCBI, UniGene cluster Oar.17500. One EST derived

from the cattle LCE7A gene (LOC786364) from Hereford 6 month old fetal skin was

deposited in GenBank in NCBI, UniGene cluster Bt.75903.

No ESTs expressed from orthologues of LCE7A from any other mammals, except

the mouse (M. musculus) were present in GenBank. A single mouse stomach cDNA

(gi|74203369|dbj|BAE20850.1) has been deposited in GenBank. After the removal of the

first 96 amino acids of the predicted product of the mouse cDNA the protein is

homologous to sheep LCE7A and the genes have conserved synteny.

MOGAT2 expression in cattle

At the time of submission of this article three cattle MOGAT2 genes were

annotated in RefSeq: NM_001099136.1 (UniGene cluster Bt.61395, 5 ESTs all expressed

in the skin), NM_001104970.1. (UniGene cluster Bt.42344, 8 ESTs all expressed in the

Page 18: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

18

skin, NM_001001154.1. (UniGene cluster Bt.32553, 4 ESTs, three expressed in the

intestine).

MOGAT3 expression in cattle

At the time of submission of this article one cattle MOGAT3 EST (GenBank

accession CF762562.1) was deposited in GenBank. It was sequenced from a cattle skin

cDNA library.

Non-membrane anchored form of MOGAT3

Searches of the EST section of GenBank identified one sheep EST (EE862887.1)

encoding the alternate amino-terminal sequence of MOGAT3 in frame with the common

amino-acid sequence and consistent with the RNA-Seq data (fig. S9).

Despite extensive searches of GenBank no evidence for the expression of an

equivalent unanchored form of MOGAT3 was identified from any other mammal,

including cattle.

FABP9 expression in sheep and other mammals

The fatty acid binding protein encoding gene, FABP9, which was expressed highly

in sheep skin (in the top 0.5% of all genes), but not in any other sheep tissue studied,

including the testis, which is the major site of expression of FABP9 in the mouse (111),

may play a role in the transport of MAG (112). At the time of submission of this

manuscript 27 sheep FABP9 ESTs had been deposited in GenBank, UniGene cluster

Oar.6318, expression is in the skin. Expression of FABP9 has also been observed in cattle

skin, UniGene cluster Bt.23479, containing 1 EST. However, FABP9 is also expressed in

pig skin, 15 ESTs in GenBank, UniGene cluster Ssc.100103. In contrast, no EST or

mRNA sequences of human FABP9 had been deposited in GenBank, UniGene cluster

Hs.653176.

Acknowledgements

We thank the laboratory division of BGI Shenzhen for their assistance in

sequencing the DNA and RNA samples. We thank the ARK-Genomics (now Edinburgh

Genomics) sequencing and informatics teams including S. Smith, K. Troup, F. Turner

and J. Loecherbach. We thank H.A. Finlayson as well as staff in The Roslin Institute’s

Animal Services Division for their help with the gene expression study. We thank the

BCM-HGSC sequencing team including Y. Wu, I. Newsham, R. Thornton, P. Aqrawi, R.

Goodspeed, L. Jackson, C. Mandapat, Y. Zhu, N. Saada, L.-L. Pu, S. Gross, G. Fowler, J.

Deng, W. Hale and J. Santibanez. We thank the Otago University/AgResearch

sequencing team, T. Van Stijn, G. Payne, C.J. Rand and C. Mason. We thank B. Zhang

and Q. Chen (BGI-Shenzhen) for RNA-Seq analysis and X. Li (State Key Laboratory of

Genetic Resources and Evolution, Kunming Institute of Zoology) and G. Yin and T.

Deng (BGI-Shenzhen) for analysis of MeDIP-seq data. We thank the following for

contribution to the project by sending unpublished gene expression data to the annotation

project that was not used in the final gene models; R.L. Tellam and T. Vuocolo (CSIRO

Animal, Food and Health Sciences); P.K. Dearden and E.J. Duncan (Genetics Otago,

Otago University); H. Blair, P.R. Kenyon and S.J. Pain (Institute of Veterinary, Animal

Page 19: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

19

and Biomedical Sciences, Massey University); I.E Lindquist, C.W. Beattie and E.J.

Retzel (National Center for Genome Resources (NCGR)); C.A. Bidwell (Department of

Animal Sciences, Purdue University); C. Couldrey and P. Maclean (AgResearch); L. Yu

and D. Burt (The Roslin Institute); O.M. Keane (Animal & Bioscience Department,

Teagasc); J. Kantanen, K. Pokharel, M. Li and J. Peippo (Biotechnology and Food

Research, MTT Agrifood Research Finland); M. de Veer (BRL Innate Immunity

Laboratory, Dept. Physiology, Monash University); A. Bonnet and Gwenola Tosser-

Klopp (INRA, Laboratoire de Génétique Cellulaire); D.R.Herndon (USDA-ARS).We

thank M. Colgrave and H. Goswami (CSIRO Animal, Food and Health Sciences), for

proteomics data not included in the final manuscript. We thank the anonymous reviewers

and R. Xiang for their comments on the manuscript.

Page 20: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

20

Fig. S1

Comparison of the sheep, goat and cattle genomes. Three dimensional plot of the

relationship between the sheep (Oar v3.1), goat (CHI1.0) and cattle (UMD3.1) genome

assemblies. Blown up regions highlight the Robertsonian translocation between cattle

chromosome 14 and the telomeric end of cattle chromosome 9 to form sheep

chromosome 9 or goat chromosome 14, and the inversion on cattle chromosome 13

relative to the sheep and goat chromosome 13.

Page 21: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

21

A

Fig. S2

Cross species gene analyses. Phylogenetic tree using the 4,850 single-copy orthologous

genes on 4-fold degenerate sites using MrBayes program.

Page 22: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

22

A

B

C

D

Fig. S3

The consensus sequence logos of the core 15 amino acid repeat of the predicted TCHHL2

protein. A: Platypus. B: Tasmanian devil. C: Shrew. D. Pig. Images were generated using

web logo.

Page 23: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

23

Fig. S4

The consensus sequence logos of the core 15 amino acid repeat of the predicted TCHHL2

protein in sheep and cattle. Images were generated using web logo.

Page 24: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

24

A

B

Fig. S5

The encoded protein sequences of LCE gene family members. A: LCE7A is very

different from the proteins encoded by the other LCE genes in the sheep genome, with

seven tandem three amino acid repeats "PQX", starting at position 26 of LCE7A. B: The

protein coding sequence alignment of LCE7A and the predicted products of its

orthologous genes in other mammals. The three amino acid repeat "PQX" is only present

in the ruminant group (sheep, goat and cattle) proteins.

Candidate cross-linking sites for transglutaminase

Page 25: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

25

A. B.

Fig. S6

Localization of LCE7A expression in adult Merino skin. Expression was localized by in

situ hybridization with an EST, fs827.z1 (GenBank accession, CF118671.1) using the

methodology as described by Adelson et al., (74), images have not been published

previously. Hair/wool follicle (hf) and inner root sheath (irs) are indicated. (74). A:

Original magnification 50x, negative 75_99_15. B: Original magnification 200x, negative

75_99_16.

hf irs

Page 26: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

26

Fig. S7

MOGAT2 expression level using log reads number in different tissues in sheep and goat.

Page 27: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

27

A

B

Fig. S8

Sequence alignment between sheep and cattle MOGAT3 regions, and the expression

level of cattle MOGAT3 genes. A: Syntenic relationship between sheep and cattle for

MOGAT3 locus. The pseudogenes are in black. The expressed MOGAT3 genes -2,-3 and

-6 in sheep have almost identical sequences. On the genome lines the light blue regions

have a one to one relationship between the sheep and cattle genomes and the dark blue

regions have a one to many or many to many relationship between the sheep and cattle

genomes. B: MOGAT3 expression level using reads number for different tissues in cattle.

The apparent expression in the skin of MOGAT3-1, predicted to be a pseudo gene,

indicates that some assembly errors may exist in this region of the Btau4.7 cattle

assembly in, which is also very different from the gapped MOGAT3 region in the

UMD3.1 cattle genome assembly

Page 28: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

28

A

>MOGAT3 variant 1 with predicted membrane anchor

MAEGEHLGVSSTLPPTPSMKTLKKQWLEVLSTYQYVLCFCFLGLFFSLAGFLLLFTSLWY

LSVLYLVWLFLDWDTPQQGGRRYQWLRNWTAWKHLSDYFPLKLVKTAELPPDRNYVLGSH

PHGIMAVGTICNFATEGTGLSQVFPGLRFSLAVLNCLLYLPGHREYFLSCGACSVNRQSL

DYVLSQSQLGRAVVIVLGGANEALYAVPGEHCLTLRNRKGFVRLALRHGASLVPVYSFGE

NDIFRVKAFAPDSWQHLLQVTSKKLLSFCPCIFWGRGLFSAKSWGLLPLARPITTVVGRP

IPVPQCPQPTEEQVDHYHTLYMKALEQLFEEHKESCGLPASTRLTFI

>MOGAT3 variant 2 with alternate start codon no membrane anchor

MVFQTGPSCLWPHLILSSQDLGGEGRPFSWALAFSVFLPSSCPLPPQLVKTAELPPDRNY

VLGSHPHGIIAVGTICNFATEGTGLSQVFPGLRFSLAVLNCLLYLPGHREYFLSCGACSV

NRQSLDYVLSQSQLGRAVVIVLGGANEALYAVPGEHCLTLRNRKGFVRLALRHGASLVPV

YSFGENDIFRVKAFAPDSWQHLLQVTSKKLLSFCPCIFWGRGLFSAKSWGLLPLARPITT

VVGRPIPVPQCPQPTEEQVDHYHTLYMKALEQLFEEHKESCGLPASTRLTFI

B

Fig. S9

Alternatively spliced sheep MOGAT3 transcripts. A: open reading frames predicted from

RNA-Seq and EST data, amino acid sequences are translated from the sequenced

MOGAT3 BAC CH243-423F23. The amino acid sequences predicted to be specific to

each isoform are highlighted in yellow. B: Expression of the two splice variants of

MOGAT3-4 determined from RNA-Seq data from Gansu alpine fine wool sheep skin.

Page 29: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

29

A

B

Fig. S10

Transmembrane domain predictions for MOGAT3 splice variants. A: MOGAT3-

Contig26 membrane anchored form. B: MOGAT3-Contig47 alternate start codon no

membrane anchor. Transmembrane domains were predicted as described in

Supplementary methods section 2.26.

Page 30: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

30

A

B

Fig. S12

Distribution of divergence of each type of transposable element (TEs) in the sheep

genome. A: The divergence rate was calculated between the identified TE elements in the

genome and the consensus sequence in the TE library used (Repbase release 16.02). B:

The distribution on sheep chromosomes for repeats and gene regions

Page 31: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

31

Fig. S13

Cross species gene analyses. Venn diagram showing shared orthologous gene groups

among species of O. aries, C. hircus, B. mutus, B. taurus, S. scrofa, C. bactrianus, C.

familiaris, E. caballus, H. sapiens, M. musculus and M. domestica.

Page 32: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

32

Fig. S14

Dynamic evolution of orthologous gene clusters in nine mammalian genomes. The

estimated numbers of orthologous groups (18,897) in the most recent common ancestral

species (MRCA) are shown at the root node. The numbers of orthologous groups that

expanded or contracted in each lineage are shown on the corresponding branch; +,

expansion; −, contraction.

Page 33: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

33

Fig. S15

Gene Ontology term enrichment for the fast evolutionary genes (high ka/ks ratio)

between sheep and cattle using http://cbl-gorilla.cs.technion.ac.il/GOrilla/ database.

12,680 orthologous gene pairs, which are associated with a GO term, were ranked based

on their ka/ks ratio. The enrichment GO terms with p-value <1e-7 were reported here.

Most of the enrichment GO terms are related with immune response.

Page 34: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

34

Fig. S16

Flowchart for the sheep genome assembly

Page 35: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

35

Fig. S17

Distribution of the GC-content of the gap sequences filled between the Oar v2.0 and Oar

v3.1 assemblies. To enrich for the gaps with high-GC content additional sequence data,

~21 fold coverage of high GC biased data from the male Texel and ~1 fold coverage of

MeDIP-seq data from the female Texel was used in the construction of Oar v3.1.

Page 36: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

36

Fig. S18

Identification of missing sequences at the start regions of sheep genes. The start, middle

and end 60 bp of CDS regions, from 22,915 cattle protein coding genes, were mapped

onto the Oar v2.0, Oar v3.1 and goat (Chi v1.0) genome assemblies using BLAST. The

hit number in the different assemblies were counted, respectively for the start, middle and

end regions. Missing sequence of the start region of ~1700 genes was recruited between

the Oar v2.0 and Oar v3.1 assemblies.

Page 37: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

37

Fig. S19

Distribution of GC content for five mammalian genome assemblies. The X-axis

represents GC content and the Y-axis represents the proportion of the sliding window for

a given GC content. Sliding windows are 500 bp, with an overlap of 250 bp between two

adjacent windows. Differences in GC content distribution among relevant species can be

inferred from this graph. In general, species that are closely related are expected to

possess similar distribution curves. As predicted, the sheep (O. aries) and cattle (B.

taurus) have similar GC distributions.

Page 38: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

38

Page 39: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

39

Page 40: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

40

Page 41: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

41

Fig. S20

Comparison between Oar v3.1 and 16 known BAC sequences deposited in NCBI

GenBank. The white regions indicate gaps in Oar v3.1, and most of which are located in

repeat regions, especially in recently evolved repeats with high sequencing depth.

Page 42: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

42

A

B

Fig. S21

Comparison of the two contigs of the BAC-based assembly of the Chinese Merino MHC

region with the two equivalent loci in the OARv3.1 assembly. A: MHC contig 1 vs

OARv3.1 chr20. B: MHC contig 2 vs OARv3.1 chr20. Based on the comparison, there is

no significant deletion on OARv3.1 chr20, but an extra sequence of ~200 kb assembled

around OAR20 27.4Mb, which was also observed in cattle genome UMD3.1 chr23 and

goat CHI 1.0 chr 23, implying that Merino MHC contig 1 may have a gap around

position 414 Kb. The full MHC sequences in OARv3.1 are syntenic to the sequences in

the equivalent part of the cattle MHC region. The Chinese Merino MHC region BAC

sequences were obtained from GenBank, accession numbers FJ986852 - FJ985877.

Page 43: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

43

Fig. S22

Identification of partial mitochondrial genome insertions at two genomic loci. Two

continuous and highly similar mitochondrial DNA segments were identified in sheep

genomic DNA, located on OARX 56.3Mb (with a length of 14kb) and OAR2 55.2Mb

(with a length of 9kb).

Page 44: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

44

A

B

Primer: Forward:5’-TAGGAAAAGCCTTAAAAGTCCA-3’ Reverse:5’-

GAATATTATGCTCCGTTGCTTC-3’

Primer: Forward:5’-ATTTGGCCTACATCCATGACT-3’ Reverse:5’-

TGAGCACCTACTATATGTCAG-3’

Fig. S23

Identification of partial mitochondrial genome insertions at two genomic loci. A:

Checking the two mitochondrial DNA insertions using PCR the length of the

PCR products matched with the expected length based on the scaffold sequences. B:

Sequence of the PCR products from sheep genomic DNA confirming the mitochondrial

DNA insertion around 56.33 Mb on OARX. Black: genomic sequence; White:

mitochondrial sequence.

Page 45: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

45

Page 46: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

46

Page 47: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

47

Page 48: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

48

Page 49: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

49

Page 50: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

50

Fig. S24

Alignment grids showing the synteny-based chromosomal comparisons between Oar v3.1

and UMD3.1. The colinear relationship was identified for every chromosome by

NUCmer (http://mummer.sourceforge.net, NUCleotide MUMmer, version 3.06) with

parameter “-c 500”. Red segments represent homologous regions with consistent orders.

Page 51: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

51

A

B

Fig. S25

The organization of the sheep and cattle X chromosomes. A: The sequences of the sheep

and cattle X chromosomes were aligned with the human ChrX sequence. Lines connect

regions of orthologous sequences. Different colors represent major blocks involved in

rearrangements. The PAR region is indicated in black and the centromere by a

constriction in the width of the chromosome. B: Analysis of junction between the PAR

region and the non-PAR region in sheep. The high GC content and high SNP level for the

male Texel from 0 to 7 Mb on chrX, suggest it contains the PAR region. The large

number of satellite II sequence (~2000 copies) and high methylation level (1458 mapped

MeDIP-seq reads), indicates that the centromere position is around 7.07 Mb, marked by

the purple dashed line. The BAC pair ends from sheep (CHORI-243, marked as blue

lines) and cattle (CHORI-240, marked as yellow lines) were mapped onto this region.

Cattle BACs could cross this boundary region, whereas sheep BACs could not, indicating

that a new chrX centromere originated on the ovine or bovine branch, after the separation

between sheep and cattle.

Page 52: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

52

A B

Fig. S26

Sheep lineage-specific evolutionary breakpoint regions (EBRs) on chromosome X.

Results are shown at a 300 kb resolution of homologous synteny block (HSB) detection,

the size of each "X" is proportional to the length of each HSB. The red lines indicate the

positions of the EBRs detected in sheep. A complete set of HSBs defined in our analysis

is available from the Evolution Highway comparative chromosome browser

(http://evolutionhighway.ncsa.uiuc.edu).

Page 53: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

53

Fig. S27

The distributions in mRNA, CDS, exon and intron lengths of Ensembl annotated protein

coding genes in four genomes.

Page 54: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

54

Fig. S28

Gene tree for artiodactyl lysozyme C genes. The phylogenetic ML tree of lysozyme C

genes calculated on the basis of the coding sequences from sheep (OAR), cattle (BTA)

and pig (SSC); the sheep lysozymes are named after their gene order on chromosome 3.

Page 55: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

55

A secondary structure CCCCHHHHHHHHHHCCCCCCCCCCHHHHHHHHHHHHCCCCCCEEEECCCCEEEECCCCEECCCCC

1........10........20........30........40........50........60....

OARLYZ4 KVFERCELARTLKELGLDGYKGVSLANWLCLTKWESSYNTKATNYNPGSESTDYGIFQINSKWWC

OARLYZ5 KVFERCELARTLKKLGLDDYKGVSLANWLCLTKWESGYNTKATNYNPGSESTDYGIFQINSKWWC

BTA_ENSBTAG00000026088 KVFERCELARTLKKLGLDGYKGVSLANWLCLTKWESSYNTKATNYNPSSESTDYGIFQINSKWWC

BTA_ENSBTAG00000046628 KVFERCELARTLKKLGLDGYKGVSLANWLCLTKWESSYNTKATNYNPSSESTDYGIFQINSKWWC

BTA_ENSBTAG00000046511 KVFERCELARTLKKLGLDGYKGVSLANWLCLTKWESSYNTKATNYNPGSESTDYGIFQINSKWWC

OARLYZ6 KKFQRCELARTLKKLGLDGYKGVSLANWLCLTKWESGYNTKATNYNPSSENTDYGIFQINSKWWC

BTA_ENSBTAG00000026322 KVFERCELARTLKKLGLDGYKGVSLSKRLCLTKWESSYNTKATNYNPSNESTDYGIYQINSKWWC

BTA_ENSBTAG00000026323 KKFEKCELARTLRRYGLDGYKGVSLANWMCLTYGESRYNTRVTNYNPGSKSTDYGIFQINSKWWC

OARLYZ3 KKFERCELARTLRRFGLDGYNGVSLANWMCLIYGESRYNTQVTNYNPGSKSTDYGIFQINSKWWC

OARLYZ2 KKFERCELARTLKKFGLAGYKGVSLANWMCLAYGESRYNTQAINYNPGSKSTDYGIFQINSKWWC

BTA_ENSBTAG00000000198 KTFKRCELAKTLKNLGLAGYKGVSLANWMCLAEGESSYNTQAKNYNPGSKSTDYGIFQINSKWWC

BTA_ENSBTAG00000022971 KTFERCELARTLKNLGLAGYKGVSLANWMCLAKGESGYNTQAKNYSPGFKSTDYGIFQINSKWWC

BTA_ENSBTAG00000020564 KTFKRCELARTLKNLGLAGYKGVSLADWMCLAKGESSYNTQAKNFNRGSQSTDYGIFQINSKWWC

BTA_ENSBTAG00000039170 KKFQKCELARTLKRLGLDGYKGISLAKWVCLASWERSYNTCATNYNRGDKSSDYGIFQINSRRWC

OARLYZ1 KKFERCELARTLKRLGLDGYRGVSLANWMCLARWESNYNTRATNYNHGDKSTDYGIFQINSRWWC

BTA_ENSBTAG00000011941 KKFQRCELARTLKKLGLDGYRGVSLANWVCLARWESNYNTRATNYNRGDKSTDYGIFQINSRWWC

OARLYZ7 KVFERCELARTLKRFGMDGFRGISLANWMCLARWESSYNTQATNYNSGDRSTDYGIFQINSHWWC

BTA_ENSBTAG00000026779 KVFERCELARSLKRFGMDNFRGISLANWMCLARWESNYNTQATNYNAGDQSTDYGIFQINSHWWC

SSC_ENSSSCG00000000492 KVYDRCEFARILKKSGMDGYRGVSLANWVCLAKWESDFNTKAINHNVG--STDYGIFQINSRYWC

secondary structure CCCCCCCCCCCCCCCHHHHHCCCCHHHHHHHHHHHCCCCHHHHCHHHHHHCCCCCCHHHCCCCCC

....70.......80.........90........100........110.......120.......

OARLYZ4 NDGKTPNAVDGCHVSCSELMENNIAKAVACAKHIVSE-QGITAWVAWKSHCRDHDVSSYVEGCSL

OARLYZ5 NDGKTPNAVDGCHVSCSALMENDIEKAVACAKHIVSE-QGITAWVAWKSHCRDHDVSSYVEGCTL

BTA_ENSBTAG00000026088 NDGKTPNAVDGCHVSCSELMENDIAKAVACAKHIVSE-QGITAWVAWKSHCRDHDVSSYVEGCTL

BTA_ENSBTAG00000046628 NDGKTPNAVDGCHVSCSELMENDIAKAVACAKHIVSE-QGITAWVAWKSHCRDHDVSSYVQGCTL

BTA_ENSBTAG00000046511 NDGKTPNAVDGCHVSCSELMENDIAKAVACAKQIVSE-QGITAWVAWKSHCRDHDVSSYVEGCTL

OARLYZ6 NDGKTPKAVDGCHVSRSELMENDIAKAVTCAKKIVSE-QGITVWVAGKSHCRDHDISSYVEGCTL

BTA_ENSBTAG00000026322 ---KTPKAVDGCPVSHSKLMGNDIAKAVACAKKIVSE-QGITAWVAWKSHCRDHDVSSYVEGCTL

BTA_ENSBTAG00000026323 NDGKTPKAVNGCGVSCSAMLKDDITQAVACAKTIVSR-QGITAWVAWKNKCRNRDVSSYIRGCKL

OARLYZ3 NDGKTPRAVNGCGVSCSALLKDDITQAVACAKKIVSR-QGITAWVAWKNNCRNRNVSSYIQGCKL

OARLYZ2 NDGKTPKAVNGCGVSCSALLKDDITQAVACAKKIVSQ-QGITAWVAWKNNCQNRDVTSYVKGCGV

BTA_ENSBTAG00000000198 NDGKTPKAVNGCGVSCSALLKDDITQAVACAKKIVSQ-QGITAWVAWTNKCRNRDLTSYVKGCGV

BTA_ENSBTAG00000022971 NDGKTPKAVNGCGVSCSALLKDDITQAVACAKKIVSQ-LGLTAWVAWKNKCQNRDLTSYVQGCRV

BTA_ENSBTAG00000020564 NDGKTPNAVNGCGVSCSALLKDDITQAVACAKKIVSQ-QGLTAWVAWKNNCRNRDLTSYVQGCGV

BTA_ENSBTAG00000039170 NDGKTPRAVNACRIPCSALLKDDITQAVASAKK-VSDPQGVRAWVVWRNKCQNQDLRSYVQDCGV

OARLYZ1 NDGKTPRAVNACRIPCSALLKDDITQAVECAKRVVRDPQGIKAWVAWRNKCQNKDLRSYVKGCRV

BTA_ENSBTAG00000011941 NDGKTPKAVNACRIPCSALLKDDITQAVACAKRVVRDPQGIKAWVAWRNKCQNRDLRSYVQGCRV

OARLYZ7 NDGKTPGAVNACHIPCSALLQDDITQAVACAKRVVSDPQGIRAWVAWRSHCQNQDLTSYIQGCGV

BTA_ENSBTAG00000026779 NDGKTPGAVNACHLPCGALLQDDITQAVACAKRVVSDPQGIRAWVAWRSHCQNQDLTSYIQGCGV

SSC_ENSSSCG00000000492 NDGKTPKAVNACHISCKVLLDDDLSQDIECAKRVVRDPLGVKAWVAWRAHCQNKDVSQYIRGCKL

B LYZ4 expressed in abomasum LYZ1 expressed in rumen

Fig. S29

Amino acid sequence alignment and predicted 3D structure of sheep rumen and

abomasum expressed lysozymes: A: Protein sequence alignment. A common deleted

amino acid in the digestive lysozymes (mainly expressed in stomach and intestine) the

proline at position 103 is indicated in yellow. B: Based on 3D structure prediction, by

3D-JIGSAW (version 2.0) , we noticed a longer alpha helix around position 103 in

digestive lysozymes than other antibacterial lysozymes, which is accord with the role as

alpha helix structural disruptor for proline.

P103

Page 56: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

56

A

B

Fig. S30

A: The phylogenetic Maximum Composite Likelihood (ML) tree of PAG genes

calculated on the basis of the coding sequences. The red line labels the pepsinogen A

(PGA5). B: Expression level using log reads number for different tissues in Texel sheep.

The red triangle labels the PGA5, whereas the green triangles show the PAGs

Page 57: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

57

Fig. S31

The expression level of genes encoding prolactin related proteins. Generated using log

read number from placenta and uterus tissues in Texel ewe sheep.

Page 58: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

58

Fig. S32

Gene tree for ovine and bovine prolactin genes. Ovine genes are shown in purple and

bovine genes in blue. The numbers on the branches are Ka/Ks ratio calculated by the

PAML branch model. Placental prolactin-related protein (PRP); prolactin (PRL);

Chorionic somatomammotropin hormone (CSH). The cattle genes are from Ensembl

annotation of UMD3.1, and the sheep genes are from Oar v3.1.

Page 59: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

59

>LOC100300593

MSHHPHPHPHPHPHPHPHPHPHPHQHQHHHQCKVPCHPPPKVCPPKCHEPCPPHPCPSP

PSQKKCPPGPPCPPCEQKCPPKWK

>LOC100848030

MSQQQHPHPHQHQHQHHHQCKEPCHPPPKVCPPKCHEPCPPHPCPPPLGQKKCPPGPPC

PPCEQKCPPKWK

>LOC100848041

MSHPPHPHPHPHPHPHPHQHHHQCKEPCHPPPKVCPPKCHEPCPPHPCPSPPSQKKCPP

GPPCPPCKQKCPPKWK

>LOC100848051

MSHHQHPHPHQNQHQHHHQCKEPCHPPPKVCPPKCHEPCPPHPCPPPLSQKKCPPGPPC

PPCEHKCPPKWK

>LOC100848464

MSHHPHPHPHLHPHQHHHQCKEPCHPPPIVCPRKCQEPCPPHPCPPPLGQKKCPPVPPC

PPCEHKCPPKWK

>LOC100297713

MSNHPHPHPHPHQHQHHHHHQCKEPCHPPPKVCPPKCHEPCPPHPCPSPPSQKKCPPGP

PCPPCEQKCPPKWK

>LOC100848091

MSHHPHPHPHPHPHPHQHHHQCKEPCHPPPKVCPPKCHKPCPPHPCPPPLGQKKCPPGP

PCPPCEQKCPPKWK

>LOC100301420

MSHHPHPHPHPHPHPHPHPHPHQCKEPCHPPPKVCPPKCHEPCPPHPCPPPLGQKKCPP

GPPCPPCEQKCPPKWK

Fig. S33

Predicted amino acid sequences of NCBI cattle RefSeq PRD-SPRRII-related loci entries

in the EDC region.

Page 60: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

60

Table S1

Summary of sequencing datasets used for the assembly of sheep genome.

* MeDIP-seq for high GC content sequence

**new Illumina protocol for GC content unbiased sequence

***Six females of different breeds were sequenced at 0.5 fold coverage, Merino, Texel,

Awassi, Poll Dorset, Romney and Scottish Blackface, which was used to generate the early

draft sheep genome, Oar v1.0.

Sample Purpose Sequence

method

Paired-

end

libraries

Insert

sizes

Lib-

raries

GA

Lanes

Total

length

(Gb)

Reads

Length

(bp)

Coverage

(fold)

Female assembly Illumina 180 bp 1 4 23.8 101 7.93

Female assembly Illumina 350 bp 4 21 105.0 101 35.0

Female assembly Illumina 800 bp 2 6 32.0 101 10.7

Female assembly Illumina 2 kp 2 11 35.7 45 11.9

Female assembly Illumina 5 kb 2 6 18.5 45 6.17

Female assembly Illumina 9 kb 1 3 8.3 45 2.77

Female assembly Illumina 17 kb 1 1 1.8 45 0.60

Female* fill gap Illumina 200 bp 1 1 2.0 45 0.67

Male fill gap Illumina 200 bp 1 16 77 101 24.0

Male fill gap Illumina 500 bp 1 24 72 101 25.5

Male** fill gap Illumina 554 bp 8 1 36 101 12.0

Male** fill gap Illumina 1.3 kb 1 1 27 101 9.00

Male check Roche 454 8 kb 3.3 1.10

Male check Roche 454 20 kb 1.5 0.50

Male check Sanger 184 kb 0.3 687 0.09

Six*** fill gap Roche 454 --- 9.0 3.00

Page 61: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

61

Table S4

De novo assembly result for Oar v3.1 scaffolds.

Contig Size Contig

Number Scaffold Size

Scaffold

Number

N90 10,562 65,304 474,708 1301

N80 17,604 46,985 883,008 913

N70 24,477 34,817 1,261,951 666

N60 31,783 25,732 1,673,097 486

N50 39,959 18,614 2,231,873 350

N40 49,777 12,913 2,717,352 244

N30 61,672 8,325 3,371,060 158

N20 77,817 4,653 4,205,983 89

N10 105,369 1,817 5,600,573 36

maximum length 383,429 11,902,472

Total Size 2,534,293,732 2,606,199,298

Total Number(>100 bp) 131,971 8,261

Total Number(>2 kb) 106,156 6,767

Page 62: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

62

Table S8

Comparative gene cluster statistics.

Species Number

of genes

Genes in

clustered

Families

Unclustered

Genes

Family

Number

Unique

Familes

Average

Genes

per

Family

O. aries 20,908 16,703 581 16,122 28 1.26

C. hircus 22,175 18,221 2,315 15,906 22 1.25

B. mutus 22,282 16,950 629 16,321 33 1.33

S. scrofa 17,433 14,012 934 13,078 32 1.26

B. taurus 19,970 15,988 104 15,884 0 1.25

C. familiaris 19,258 16,092 389 15,703 11 1.20

E. caballus 20,419 15,807 222 15,585 20 1.30

H. sapiens 21,375 16,852 841 16,011 75 1.28

M. musculus 22,927 17,010 1,049 15,961 88 1.37

C. bactrianus 20,251 17,102 4,014 13,088 106 1.24

M. domestica 19,439 19,439 1,150 16,037 183 1.14

Page 63: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

63

Table S12

Statistics of the completeness of assembled sheep genome based on 248 CEGs1.

Number of

CEGs Mapped

proteins Completeness

(%)

Complete2 248 246(246)

3 99.19(99.19)

Group 1 66 64(66) 96.97(100.00)

Group 2 56 56(56) 100.00(100.00)

Group 3 61 61(59) 100.00(96.72)

Group 4 65 65(65) 100.00(100.00)

Partial2 248 247(247) 99.60(99.60)

Group 1 66 65(66) 98.48(100.00)

Group 2 56 56(56) 100.00(100.00)

Group 3 61 61(60) 100.00(98.36)

Group 4 65 65(65) 100.00(100.00) 1The CEGs database contains groups of genes from the following species: Homo sapiens,

Drosophila melanogaster, Arabidopsis thaliana, Caenorhabditis elegans, Saccharomyces

cerevisiae and Schizosaccharomyces pombe. 2The CEGs were classed as 4 groups based on the conservation.

3The results are for Oar v3.1, (result in parentheses are for the cattle genome assembly

UMD3.1).

Page 64: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

64

Table S13

Transposable element content of Oar v3.1.

Repeat Class1

Length(bp) Coverage(%)

LINE 728,855,852 27.83

L1 319,721,973 12.21

RTE (BovB) 359,281,823 13.72

L2 26,886,976 1.03

CR1 2,069,573 0.08

other 20,895,507 0.80

SINEs 177,574,932 6.78

BOV-A 86,706,307 3.31

tRNA 60,289,254 2.30

MIR 29,844,110 1.14

Other 735,261 0.03

LTR 124,398,733 4.75

ERVs 121,689,503 4.65

LTR other 2,709,230 0.10

DNA transposon 59,832,665 2.28

Other 26,753,185 1.02

Interspersed repeat total 1,117,415,367 42.67 1Classes were assigned according to homology to known transposable elements in the

Repbase database (http://www.girinst.org/repbase/, Release 16.02).

Page 65: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

65

Table S14

Comparative gene statistics.

Gene set Number

Average

transcript

length

(bp)

Average

CDS

length

(bp)

Average

exons

per gene

Average

exon

length

(bp)

Average

intron

length (bp)

Ovis aries 20,921 35,331.80 1,559.02 9.61 162.15 3,918.68

Bos taurus 19,970 35,399.92 1,610.89 9.65 167.01 3,908.19

Homo sapiens 21,375 47,027.29 1,660.16 9.56 173.71 5,300.82

Mus musculus 22,927 35,443.23 1,550.56 8.65 179.28 4,430.11

Page 66: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

66

Table S15

Annotated genes in the Ensembl sheep gene list.

Number Percent (%)

Total genes 20921 100.00

Total annotated genes 20782 99.34

Swissprot 20444 97.72

TrEMBL 20757 99.22

InterPro 18028 86.17

KEGG 15841 75.72

GO 15068 72.02

Unannotated 139 0.66

Page 67: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

67

Table S16

Classes of non-coding RNAs in the Ensembl sheep annotation.

Class Number Length (bp)

rRNA 305 34,441

snoRNA 756 86,792

misc_RNA 361 70,855

miRNA 1,305 113,558

snRNA 1,234 136,240

total 3,961 441,886

Page 68: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

68

Table S17

De novo assembly results for Oar v2.0 scaffolds.

Contig Size Contig

Number Scaffold Size

Scaffold

Number

N90 3,593 158,224 202,439 2,823

N80 6,988 109,238 406,941 1,900

N70 10,219 79,515 618,490 1,358

N60 13,604 58,124 841,456 983

N50* 17,369 41,706 1,079,158 696

N40 21,802 28,718 1,379,732 474

N30 27,487 18,387 1,760,652 300

N20 35,087 10,218 2,326,727 166

N10 48,038 3,994 3,251,999 68

maximum length 271,172 7,616,799

Total Size 2,523,285,863 2,709,936,754

Total Number(>100 bp) 794,490 490,776

Total Number(>2 kb) 194,049 8,115

*N50 is the size of the contig/scaffold such that the 50% of the total assembly bases are

in contigs/scaffolds of this length or longer. 6.9% of the scaffold sequence is “N” (gap).

Page 69: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

69

Table S18

Super-scaffold assemblies used for the chromosome assembly.

Super scaffold Size (bp) Number

N50 37,056,980 23

N60 28,759,665 30

N70 21,637,475 41

N80 14,866,675 55

N90 6,068,238 83

Total Number 349

Total Size 2,569,509,652

Page 70: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

70

Table S19

Assessing genome assembly using 52,821 de novo assembled model mRNAs from the

RNA-seq data from the 7 tissues from the female Texel sheep.

Length Coverage status when

mapping ESTs

Oar v2.0 Oar v3.1

Mapped

mRNAs

number

% of

total

mRNAs

Average

coverage

(%) by

length

Mapped

mRNA

number

% of

total

mRNAs

Average

coverage

(%) by

length

Single hit with 90% coverage 45357 85.9 99.5 47221 89.4 99.6

Single hit with 20~90%

coverage 3171 6 84 2410 4.6 85.4

Best hit of multiple hits 4088 7.7 99 2833 5.4 96

In total for matched mRNAs 52616 99.6 98 52464 99.3 98.4

Unmapped mRNAs 205 - - 357 - -

Page 71: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

71

Table S20

Genome assembly QC using BAC sequences – alignment statistics.

BAC ID BAC

Length

(bp) Coverage

NO. of

Alignment

Blocks

Matched

Oar v3.1 Gap

Num

Gap

Length

(bp)

Gap

Ratio

AC148038.3 198,973 100% 23 OAR4 4 2,505 1.26%

AC147928.3 190,282 100% 30 OAR4 12 3,672 1.93%

AC147927.3 187,152 100% 17 OAR4 8 1,197 0.64%

AC147929.3 186,963 100% 32 OAR4 7 5,943 3.18%

AC147930.2 176,451 100% 22 OAR4 4 709 0.40%

AC152892.3 162,012 100% 14 OAR4 4 3,684 2.27%

AC147844.3 156,221 100% 31 OAR4 10 5,663 3.62%

AC148245.3 153,498 100% 32 OAR4 7 5,063 3.30%

AC147843.3 151,538 100% 17 OAR4 3 1,149 0.76%

AC148039.3 143,308 100% 12 OAR4 4 4,579 3.20%

AC147842.3 139,569 100% 10 OAR4 1 186 0.13%

AC147841.3 131,638 100% 13 OAR4 4 2,536 1.93%

AC148115.3 95,794 100% 13 OAR4 5 6,281 6.56%

AC162117.3 80,882 100% 3 OAR4 0 0 0.00%

HM355886.3 78,116 100% 10 OAR17 2 1,661 2.13%

AC159152.3 30,087 100% 2 OAR4 0 0 0.00%

2,262,484 100% 281 75 44,828 1.98%

Page 72: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

72

Table S21

Genome assembly QC using BAC sequences – repeat sequences.

Repeat Class Ratio of

gap sequencs

LTR/Copia 0.03%

TandemRepeat 0.85%

LTR/ERVL 0.08%

SINE/MIR 0.10%

SINE/tRNA-Glu 0.17%

DNA/MULE-

MuDR 0.01%

LINE/L2 0.02%

DNA/Sola 0.07%

LTR/ERVK 4.04%

LINE/L1 8.12%

LTR/ERV1 0.01%

DNA/PIF-Harbinger 0.04%

LINE/CR1 0.01%

DNA/hAT-Charlie 0.41%

DNA/CMC-EnSpm 0.09%

DNA/CMC-Transib 0.09%

LTR/Gypsy 0.04%

DNA/DNA 0.04%

LTR 0.72%

LINE/RTE-BovB 82.49%

SINE/BovA 2.58%

Page 73: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

73

Table S22

Comparison of Oar v3.1 with other mammalian reference genome assemblies.

Horse

EquCab

2.0

Pig

Sscrofa

10.2

Cattle

UMD

3.1

Yak

BosGru

2.0

Goat

Chi

1.0

Sheep

Oar v

3.1

sequence clones Fosmid

BAC

Fosmid

shotgun

BAC

shotgun

150-500

bp

2 kb

5 kb

10 kb

20 kb

150-800 bp

2 kb

5 kb

10 kb

20 kb

40 kb

150-800 bp

2 kb

5 kb

10 kb

20 kb

BAC

Coverage 6.8× 25× 7.1× 65× 66× 150×

Total scaffold length 2.47 Gb 2.8 Gb 2.67 Gb 2.65 Gb 2.64 Gb 2.61 Gb

N50 contig 112 kb 73 kb 97 kb 20 kb 19 kb 40 kb

Heterozygosity Rate 1/2000 --- 1/1700 1/1200 1/800 1/500

Anchor to chromosome 96% 93% 99% --- 95% 99%

Repeat ratio 49.5% 48.2% 46.5% 41.8% 42.2% 42.7%

Page 74: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

74

Table S23

Sheep segmental duplications from WGAC (>90% identity) analysis.

Cutoff blocks Length (Mb)

>1 kb 16,249 33.83

>5 kb 721 5.26

>10 kb 42 0.83

Page 75: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

75

Table S24

Sheep segmental duplications from WSSD combined with WGAC (>95% identity)

analysis.

Cutoff blocks length (Mb)

>1 kb 7,912 25.77

>5 kb 1,097 13.39

>10 kb 434 8.85

Page 76: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

76

Table S27

Bovine PRD-SPRRII family genes in the EDC region.

NCBI Gene UniGene cluster Number of sequences Site(s) of expression

LOC100297713 Bt.98911 11 predominantly reticulum

LOC100300593 Bt.23535 142 predominantly rumen

Bt.99623 2 rumen

Bt.67198 30 predominantly rumen

Bt.99621 1 rumen

Bt.109358 46 rumen

LOC100301420 Bt.102924 1 rumen

LOC100848030 Bt.9678 504 predominantly rumen

LOC100848041 Bt.92420 31 predominantly rumen

LOC100848051 Bt.92418 47 predominantly rumen

Bt.99625 6 rumen

LOC100848091 Bt.20608 87 predominantly rumen

LOC100848464 not in UniGene

Page 77: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

77

Captions for additional figures and tables

Fig. S11

Robust sheep RH map versus the Oar v3.1 assembly.

Table S2

Sheep linkage map SM5.

Table S3

Robust sheep RH map.

Table S5

List of segmental duplications.

Table S6

Breakpoints between the sheep Oar v3.1 and cattle UMD3 assemblies defined by BACs.

Table S7

List of tissue samples used to generate the RNA-Seq data used in the Ensembl

annotation.

Table S9

Gene family expansions in ruminants.

Table S10

EDC region gene annotation sheep.

Table S11

Lipid gene expression in sheep skin.

Table S25

Gene name to Ensembl gene id translation.

Table S26

Genes with allele specific gene expression.

Page 78: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

78

References

1. R. R. Hofmann, Evolutionary steps of ecophysiological adaptation and diversification of ruminants: A comparative view of their digestive system. Oecologia 78, 443–457 (1989). doi:10.1007/BF00378733

2. M. J. Wolin, Fermentation in the rumen and human large intestine. Science 213, 1463–1468 (1981). doi:10.1126/science.7280665 Medline

3. T. J. Hackmann, J. N. Spain, Invited review: Ruminant ecology and evolution: perspectives useful to ruminant livestock research and production. J. Dairy Sci. 93, 1320–1334 (2010). doi:10.3168/jds.2009-2071 Medline

4. C. A. E. Strömberg, Evolution of grasses and grassland ecosystems. Annu. Rev. Earth Planet. Sci. 39, 517–544 (2011). doi:10.1146/annurev-earth-040809-152402

5. E. J. Edwards, C. P. Osborne, C. A. Strömberg, S. A. Smith, W. J. Bond, P. A. Christin, A. B. Cousins, M. R. Duvall, D. L. Fox, R. P. Freckleton, O. Ghannoum, J. Hartwell, Y. Huang, C. M. Janis, J. E. Keeley, E. A. Kellogg, A. K. Knapp, A. D. Leakey, D. M. Nelson, J. M. Saarela, R. F. Sage, O. E. Sala, N. Salamin, C. J. Still, B. Tipple; C4 Grasses Consortium, The origins of C4 grasslands: Integrating evolutionary and ecosystem science. Science 328, 587–591 (2010). doi:10.1126/science.1177216 Medline

6. E. N. Bergman, Energy contributions of volatile fatty acids from the gastrointestinal tract in various species. Physiol. Rev. 70, 567–590 (1990). Medline

7. K. A. Johnson, D. E. Johnson, Methane emissions from cattle. J. Anim. Sci. 73, 2483–2492 (1995). Medline

8. M. E. Stewart, in Biology of the Integument, J. Bereiter-Hahn, A. G. Matoltsy, K. S. Richards, Eds. (Springer, Berlin Heidelberg, 1986), vol. 2, chap. 43, pp. 824–832.

9. M. L. Schlossman, J. P. McCarthy, Lanolin and its derivatives. J. Am. Oil Chem. Soc. 55, 447–450 (1978). doi:10.1007/BF02911911

10. Materials and methods are available as supplementary material on Science Online

11. A. Clop, F. Marcq, H. Takeda, D. Pirottin, X. Tordoir, B. Bibé, J. Bouix, F. Caiment, J. M. Elsen, F. Eychenne, C. Larzul, E. Laville, F. Meish, D. Milenkovic, J. Tobin, C. Charlier, M. Georges, A mutation creating a potential illegitimate microRNA target site in the myostatin gene affects muscularity in sheep. Nat. Genet. 38, 813–818 (2006). doi:10.1038/ng1810 Medline

12. M. Kypriotou, M. Huber, D. Hohl, The human epidermal differentiation complex: Cornified envelope precursors, S100 proteins and the ‘fused genes’ family. Exp. Dermatol. 21, 643–649 (2012). doi:10.1111/j.1600-0625.2012.01472.x Medline

13. L. Wang, R. L. Baldwin, 6th, B. W. Jesse, Identification of two cDNA clones encoding small proline-rich proteins expressed in sheep ruminal epithelium. Biochem. J. 317, 225–233 (1996). Medline

Page 79: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

79

14. H. J. Song, G. Poy, N. Darwiche, U. Lichti, T. Kuroki, P. M. Steinert, T. Kartasova, Mouse Sprr2 genes: A clustered family of genes showing differential expression in epithelial tissues. Genomics 55, 28–42 (1999). doi:10.1006/geno.1998.5607 Medline

15. J. Deng, R. Pan, R. Wu, Distinct roles for amino- and carboxyl-terminal sequences of SPRR1 protein in the formation of cross-linked envelopes of conducting airway epithelial cells. J. Biol. Chem. 275, 5739–5747 (2000). doi:10.1074/jbc.275.8.5739 Medline

16. P. M. Steinert, E. Candi, T. Kartasova, L. Marekov, Small proline-rich proteins are cross-bridging proteins in the cornified cell envelopes of stratified squamous epithelia. J. Struct. Biol. 122, 76–85 (1998). doi:10.1006/jsbi.1998.3957 Medline

17. K. R. Feingold, Thematic review series: Skin lipids. The role of epidermal lipids in cutaneous permeability barrier homeostasis. J. Lipid Res. 48, 2531–2546 (2007). doi:10.1194/jlr.R700013-JLR200 Medline

18. D. Marshall, M. J. Hardman, K. M. Nield, C. Byrne, Differentially expressed late constituents of the epidermal cornified envelope. Proc. Natl. Acad. Sci. U.S.A. 98, 13031–13036 (2001). doi:10.1073/pnas.231489198 Medline

19. F. P. W. Radner, S. Grond, G. Haemmerle, A. Lass, R. Zechner, Fat in the skin: Triacylglycerol metabolism in keratinocytes and its role in the development of neutral lipid storage disease. Dermatoendocrinology 3, 77–83 (2011). doi:10.4161/derm.3.2.15472 Medline

20. D. Cheng, T. C. Nelson, J. Chen, S. G. Walker, J. Wardwell-Swanson, R. Meegalla, R. Taub, J. T. Billheimer, M. Ramaker, J. N. Feder, Identification of acyl coenzyme A:monoacylglycerol acyltransferase 3, an intestinal specific enzyme implicated in dietary fat absorption. J. Biol. Chem. 278, 13611–13614 (2003). doi:10.1074/jbc.C300042200 Medline

21. A. M. Hall, K. Kou, Z. Chen, T. A. Pietka, M. Kumar, K. M. Korenblat, K. Lee, K. Ahn, E. Fabbrini, S. Klein, B. Goodwin, B. N. Finck, Evidence for regulated monoacylglycerol acyltransferase expression and activity in human liver. J. Lipid Res. 53, 990–999 (2012). doi:10.1194/jlr.P025536 Medline

22. A. Kazantseva, A. Goltsov, R. Zinchenko, A. P. Grigorenko, A. V. Abrukova, Y. K. Moliaka, A. G. Kirillov, Z. Guo, S. Lyle, E. K. Ginter, E. I. Rogaev, Human hair growth deficiency is linked to a genetic defect in the phospholipase gene LIPH. Science 314, 982–985 (2006). doi:10.1126/science.1133276 Medline

23. A. Inoue, N. Arima, J. Ishiguro, G. D. Prestwich, H. Arai, J. Aoki, LPA-producing enzyme PA-PLA₁α regulates hair follicle development by modulating EGFR signalling. EMBO J. 30, 4248–4260 (2011). doi:10.1038/emboj.2011.296 Medline

24. G. Bobe, J. W. Young, D. C. Beitz, Invited review: Pathology, etiology, prevention, and treatment of fatty liver in dairy cows. J. Dairy Sci. 87, 3105–3124 (2004). doi:10.3168/jds.S0022-0302(04)73446-3 Medline

25. D. L. Ingle, D. E. Bauman, U. S. Garrigus, Lipogenesis in the ruminant: In vivo site of fatty acid synthesis in sheep. J. Nutr. 102, 617–623 (1972). Medline

Page 80: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

80

26. A. M. Crawford, K. G. Dodds, A. J. Ede, C. A. Pierson, G. W. Montgomery, H. G. Garmonsway, A. E. Beattie, K. Davies, J. F. Maddox, S. W. Kappes, R. T. Stone, T. C. Nguyen, J. M. Penty, E. A. Lord, J. E. Broom, J. Buitkamp, W. Schwaiger, J. T. Epplen, P. Matthew, M. E. Matthews, D. J. Hulme, K. J. Beh, R. A. McGraw, C. W. Beattie, An autosomal genetic linkage map of the sheep genome. Genetics 140, 703–724 (1995). Medline

27. B. P. Dalrymple, E. F. Kirkness, M. Nefedov, S. McWilliam, A. Ratnakumar, W. Barris, S. Zhao, J. Shetty, J. F. Maddox, M. O’Grady, F. Nicholas, A. M. Crawford, T. Smith, P. J. de Jong, J. McEwan, V. H. Oddy, N. E. Cockett; International Sheep Genomics Consortium, Using comparative genomics to reorder the human genome sequence into a virtual sheep genome. Genome Biol. 8, R152 (2007). doi:10.1186/gb-2007-8-7-r152 Medline

28. Y. Dong, M. Xie, Y. Jiang, N. Xiao, X. Du, W. Zhang, G. Tosser-Klopp, J. Wang, S. Yang, J. Liang, W. Chen, J. Chen, P. Zeng, Y. Hou, C. Bian, S. Pan, Y. Li, X. Liu, W. Wang, B. Servin, B. Sayre, B. Zhu, D. Sweeney, R. Moore, W. Nie, Y. Shen, R. Zhao, G. Zhang, J. Li, T. Faraut, J. Womack, Y. Zhang, J. Kijas, N. Cockett, X. Xu, S. Zhao, J. Wang, W. Wang, Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus). Nat. Biotechnol. 31, 135–141 (2013). doi:10.1038/nbt.2478 Medline

29. R. Geng, C. Yuan, Y. Chen, Exploring differentially expressed genes by RNA-Seq in cashmere goat (Capra hircus) skin during hair follicle development and cycling. PLOS ONE 8, e62704 (2013). doi:10.1371/journal.pone.0062704 Medline

30. R. A. Scholey, N. J. Evans, R. W. Blowey, J. P. Massey, R. D. Murray, R. F. Smith, W. E. Ollier, S. D. Carter, Identifying host pathogenic pathways in bovine digital dermatitis by RNA-Seq analysis. Vet. J. 197, 699–706 (2013). doi:10.1016/j.tvjl.2013.03.008 Medline

31. D. Takai, P. A. Jones, Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc. Natl. Acad. Sci. U.S.A. 99, 3740–3745 (2002). doi:10.1073/pnas.052410099 Medline

32. R. Li, C. Yu, Y. Li, T. W. Lam, S. M. Yiu, K. Kristiansen, J. Wang, SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009). doi:10.1093/bioinformatics/btp336 Medline

33. R. Li, W. Fan, G. Tian, H. Zhu, L. He, J. Cai, Q. Huang, Q. Cai, B. Li, Y. Bai, Z. Zhang, Y. Zhang, W. Wang, J. Li, F. Wei, H. Li, M. Jian, J. Li, Z. Zhang, R. Nielsen, D. Li, W. Gu, Z. Yang, Z. Xuan, O. A. Ryder, F. C. Leung, Y. Zhou, J. Cao, X. Sun, Y. Fu, X. Fang, X. Guo, B. Wang, R. Hou, F. Shen, B. Mu, P. Ni, R. Lin, W. Qian, G. Wang, C. Yu, W. Nie, J. Wang, Z. Wu, H. Liang, J. Min, Q. Wu, S. Cheng, J. Ruan, M. Wang, Z. Shi, M. Wen, B. Liu, X. Ren, H. Zheng, D. Dong, K. Cook, G. Shan, H. Zhang, C. Kosiol, X. Xie, Z. Lu, H. Zheng, Y. Li, C. C. Steiner, T. T. Lam, S. Lin, Q. Zhang, G. Li, J. Tian, T. Gong, H. Liu, D. Zhang, L. Fang, C. Ye, J. Zhang, W. Hu, A. Xu, Y. Ren, G. Zhang, M. W. Bruford, Q. Li, L. Ma, Y. Guo, N. An, Y. Hu, Y. Zheng, Y. Shi, Z. Li, Q. Liu, Y. Chen, J. Zhao, N. Qu, S. Zhao, F. Tian, X. Wang, H. Wang, L. Xu, X. Liu, T. Vinar, Y. Wang, T. W. Lam, S. M. Yiu, S. Liu, H. Zhang, D. Li, Y. Huang, X. Wang, G. Yang, Z. Jiang, J. Wang, N. Qin, L. Li, J. Li, L. Bolund, K. Kristiansen, G. K. Wong, M. Olson, X. Zhang, S. Li, H. Yang, J. Wang, J. Wang, The sequence and de novo assembly

Page 81: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

81

of the giant panda genome. Nature 463, 311–317 (2010). doi:10.1038/nature08696 Medline

34. P. Laurent, L. Schibler, A. Vaiman, J. Laubier, C. Delcros, G. Cosseddu, D. Vaiman, E. P. Cribiu, M. Yerle, A 12 000-rad whole-genome radiation hybrid panel in sheep: Application to the study of the ovine chromosome 18 region containing a QTL for scrapie susceptibility. Anim. Genet. 38, 358–363 (2007). doi:10.1111/j.1365-2052.2007.01607.x Medline

35. C. H. Wu, W. Jin, K. Nomura, T. Goldammer, T. Hadfield, B. P. Dalrymple, S. McWilliam, J. F. Maddox, N. E. Cockett, A radiation hybrid comparative map of ovine chromosome 1 aligned to the virtual sheep genome. Anim. Genet. 40, 435–455 (2009). doi:10.1111/j.1365-2052.2009.01857.x Medline

36. J. D. Storey, R. Tibshirani, Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U.S.A. 100, 9440–9445 (2003). doi:10.1073/pnas.1530509100 Medline

37. T. Faraut, S. de Givry, P. Chabrier, T. Derrien, F. Galibert, C. Hitte, T. Schiex, A comparative genome approach to marker ordering. Bioinformatics 23, e50–e56 (2007). doi:10.1093/bioinformatics/btl321 Medline

38. S. de Givry, M. Bouchez, P. Chabrier, D. Milan, T. Schiex, CARHTA GENE: Multipopulation integrated genetic and radiation hybrid mapping. Bioinformatics 21, 1703–1704 (2005). doi:10.1093/bioinformatics/bti222 Medline

39. B. Servin, T. Faraut, N. Iannuccelli, D. Zelenika, D. Milan, High-resolution autosomal radiation hybrid maps of the pig genome and their contribution to the genome sequence assembly. BMC Genomics 13, 585 (2012). doi:10.1186/1471-2164-13-585 Medline

40. B. Servin, S. de Givry, T. Faraut, Statistical confidence measures for genome maps: Application to the validation of genome assemblies. Bioinformatics 26, 3035–3042 (2010). doi:10.1093/bioinformatics/btq598 Medline

41. J. D. White, P. G. Allingham, C. M. Gorman, D. L. Emery, P. Hynd, J. Owens, A. Bell, J. Siddell, G. Harper, B. J. Hayes, H. D. Daetwyler, J. Usmar, M. E. Goddard, J. M. Henshall, S. Dominik, H. Brewer, J. H. J. van der Werf, F. W. Nicholas, R. Warner, C. Hofmyer, T. Longhurst, T. Fisher, P. Swan, R. Forage, V. H. Oddy, Design and phenotyping procedures for recording wool, skin, parasite resistance, growth, carcass yield and quality traits of the SheepGENOMICS mapping flock. Anim. Prod. Sci. 52, 157–171 (2012). doi:10.1071/AN11085

42. J. E. Miller, S. C. Bishop, N. E. Cockett, R. A. McGraw, Segregation of natural and experimental gastrointestinal nematode infection in F2 progeny of susceptible Suffolk and resistant Gulf Coast Native sheep and its usefulness in assessment of genetic variation. Vet. Parasitol. 140, 83–89 (2006). doi:10.1016/j.vetpar.2006.02.043 Medline

43. T. C. Matise, M. Perlin, A. Chakravarti, Automated construction of genetic linkage maps using an expert system (MultiMap): A human genome linkage map. Nat. Genet. 6, 384–390 (1994). doi:10.1038/ng0494-384 Medline

Page 82: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

82

44. P. Green, Construction and comparison of chromosome 21 radiation hybrid and linkage maps using CRI-MAP. Cytogenet. Cell Genet. 59, 122–124 (1992). doi:10.1159/000133221 Medline

45. A. V. Zimin, A. L. Delcher, L. Florea, D. R. Kelley, M. C. Schatz, D. Puiu, F. Hanrahan, G. Pertea, C. P. Van Tassell, T. S. Sonstegard, G. Marçais, M. Roberts, P. Subramanian, J. A. Yorke, S. L. Salzberg, A whole-genome assembly of the domestic cow, Bos taurus. Genome Biol. 10, R42 (2009). doi:10.1186/gb-2009-10-4-r42

46. R. L. Ge, Q. Cai, Y. Y. Shen, A. San, L. Ma, Y. Zhang, X. Yi, Y. Chen, L. Yang, Y. Huang, R. He, Y. Hui, M. Hao, Y. Li, B. Wang, X. Ou, J. Xu, Y. Zhang, K. Wu, C. Geng, W. Zhou, T. Zhou, D. M. Irwin, Y. Yang, L. Ying, H. Bao, J. Kim, D. M. Larkin, J. Ma, H. A. Lewin, J. Xing, R. N. Platt, 2nd, D. A. Ray, L. Auvil, B. Capitanu, X. Zhang, G. Zhang, R. W. Murphy, J. Wang, Y. P. Zhang, J. Wang, Draft genome sequence of the Tibetan antelope. Nat. Commun. 4, 1858 (2013). doi:10.1038/ncomms2860 Medline

47. G. E. Liu, M. Ventura, A. Cellamare, L. Chen, Z. Cheng, B. Zhu, C. Li, J. Song, E. E. Eichler, Analysis of recent segmental duplications in the bovine genome. BMC Genomics 10, 571 (2009). doi:10.1186/1471-2164-10-571 Medline

48. R. S. Harris, thesis, The Pennsylvania State University, University Park (2007).

49. D. M. Bickhart, Y. Hou, S. G. Schroeder, C. Alkan, M. F. Cardone, L. K. Matukumalli, J. Song, R. D. Schnabel, M. Ventura, J. F. Taylor, J. F. Garcia, C. P. Van Tassell, T. S. Sonstegard, E. E. Eichler, G. E. Liu, Copy number variation of individual cattle genomes using next-generation sequencing. Genome Res. 22, 778–790 (2012). doi:10.1101/gr.133967.111 Medline

50. E. T. Wang, R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S. F. Kingsmore, G. P. Schroth, C. B. Burge, Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008). doi:10.1038/nature07509 Medline

51. D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, S. L. Salzberg, TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013). doi:10.1186/gb-2013-14-4-r36 Medline

52. C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, L. Pachter, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010). doi:10.1038/nbt.1621 Medline

53. I. Birol, S. D. Jackman, C. B. Nielsen, J. Q. Qian, R. Varhol, G. Stazyk, R. D. Morin, Y. Zhao, M. Hirst, J. E. Schein, D. E. Horsman, J. M. Connors, R. D. Gascoyne, M. A. Marra, S. J. Jones, De novo transcriptome assembly with ABySS. Bioinformatics 25, 2872–2877 (2009). doi:10.1093/bioinformatics/btp367 Medline

54. J. Gao, K. Liu, H. Liu, H. T. Blair, G. Li, C. Chen, P. Tan, R. Z. Ma, A complete DNA sequence map of the ovine major histocompatibility complex. BMC Genomics 11, 466 (2010). doi:10.1186/1471-2164-11-466 Medline

Page 83: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

83

55. S. Kurtz, A. Phillippy, A. L. Delcher, M. Smoot, M. Shumway, C. Antonescu, S. L. Salzberg, Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004). doi:10.1186/gb-2004-5-2-r12 Medline

56. G. Parra, K. Bradnam, I. Korf, CEGMA: A pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007). doi:10.1093/bioinformatics/btm071 Medline

57. T. Wicker, F. Sabot, A. Hua-Van, J. L. Bennetzen, P. Capy, B. Chalhoub, A. Flavell, P. Leroy, M. Morgante, O. Panaud, E. Paux, P. SanMiguel, A. H. Schulman, A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 8, 973–982 (2007). doi:10.1038/nrg2165 Medline

58. N. Chen, Curr. Protoc. Bioinformatics, Chapter 4, Unit 4 10 (2004).

59. V. Curwen, E. Eyras, T. D. Andrews, L. Clarke, E. Mongin, S. M. Searle, M. Clamp, The Ensembl automatic gene annotation system. Genome Res. 14, 942–950 (2004). doi:10.1101/gr.1858004 Medline

60. UniProt Consortium, Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 41, (D1), D43–D47 (2013). doi:10.1093/nar/gks1068 Medline

61. A. Morgulis, E. M. Gertz, A. A. Schäffer, R. Agarwala, A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comput. Biol. 13, 1028–1040 (2006). doi:10.1089/cmb.2006.13.1028 Medline

62. G. Benson, Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999). doi:10.1093/nar/27.2.573 Medline

63. E. Birney, R. Durbin, Using GeneWise in the Drosophila annotation experiment. Genome Res. 10, 547–548 (2000). doi:10.1101/gr.10.4.547 Medline

64. J. E. Collins, S. White, S. M. J. Searle, D. L. Stemple, Incorporating RNA-seq data into the zebrafish Ensembl genebuild. Genome Res. 22, 2067–2078 (2012). doi:10.1101/gr.137901.112 Medline

65. H. Li, R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009). doi:10.1093/bioinformatics/btp324 Medline

66. G. S. Slater, E. Birney, Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005). doi:10.1186/1471-2105-6-31 Medline

67. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990). doi:10.1016/S0022-2836(05)80360-2 Medline

68. S. W. Burge, J. Daub, R. Eberhardt, J. Tate, L. Barquist, E. P. Nawrocki, S. R. Eddy, P. P. Gardner, A. Bateman, Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 41, (D1), D226–D232 (2013). doi:10.1093/nar/gks1005 Medline

69. A. Kozomara, S. Griffiths-Jones, miRBase: Integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res. 39, (Database), D152–D157 (2011). doi:10.1093/nar/gkq1027 Medline

Page 84: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

84

70. X. J. Yu, H. K. Zheng, J. Wang, W. Wang, B. Su, Detecting lineage-specific adaptive evolution of brain-expressed genes in human using rhesus macaque as outgroup. Genomics 88, 745–751 (2006). doi:10.1016/j.ygeno.2006.05.008 Medline

71. J. Ruan, H. Li, Z. Chen, A. Coghlan, L. J. Coin, Y. Guo, J. K. Hériché, Y. Hu, K. Kristiansen, R. Li, T. Liu, A. Moses, J. Qin, S. Vang, A. J. Vilella, A. Ureta-Vidal, L. Bolund, J. Wang, R. Durbin, TreeFam: 2008 Update. Nucleic Acids Res. 36, (Database), D735–D740 (2008). doi:10.1093/nar/gkm1005 Medline

72. R. C. Edgar, MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004). doi:10.1093/nar/gkh340 Medline

73. J. P. Huelsenbeck, F. Ronquist, MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17, 754–755 (2001). doi:10.1093/bioinformatics/17.8.754 Medline

74. Z. Yang, PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13, 555–556 (1997). Medline

75. M. J. Benton, P. C. J. Donoghue, Paleontological evidence to date the tree of life. Mol. Biol. Evol. 24, 26–53 (2007). doi:10.1093/molbev/msl150 Medline

76. T. De Bie, N. Cristianini, J. P. Demuth, M. W. Hahn, CAFE: A computational tool for the study of gene family evolution. Bioinformatics 22, 1269–1271 (2006). doi:10.1093/bioinformatics/btl097 Medline

77. A. M. Szalkowski, Fast and robust multiple sequence alignment with phylogeny-aware gap placement. BMC Bioinformatics 13, 129 (2012). doi:10.1186/1471-2105-13-129

78. X. Huang, A. Madan, CAP3: A DNA sequence assembly program. Genome Res. 9, 868–877 (1999). doi:10.1101/gr.9.9.868 Medline

79. D. L. Adelson, G. R. Cam, U. DeSilva, I. R. Franklin, Gene expression in sheep skin and wool (hair). Genomics 83, 95–105 (2004). doi:10.1016/S0888-7543(03)00210-6 Medline

80. Y. Benjamini, T. P. Speed, Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012). doi:10.1093/nar/gks001

81. M. Nothnagel, A. Wolf, A. Herrmann, K. Szafranski, I. Vater, M. Brosch, K. Huse, R. Siebert, M. Platzer, J. Hampe, M. Krawczak, Statistical inference of allelic imbalance from transcriptome data. Hum. Mutat. 32, 98–106 (2011). doi:10.1002/humu.21396 Medline

82. R. She, J. S. Chu, K. Wang, J. Pei, N. Chen, GenBlastA: Enabling BLAST to identify homologous gene sequences. Genome Res. 19, 143–149 (2009). doi:10.1101/gr.082081.108 Medline

83. E. L. Sonnhammer, G. von Heijne, A. Krogh, A hidden Markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6, 175–182 (1998). Medline

84. A. L. Archibald, N. E. Cockett, B. P. Dalrymple, T. Faraut, J. W. Kijas, J. F. Maddox, J. C. McEwan, V. Hutton Oddy, H. W. Raadsma, C. Wade, J. Wang, W. Wang, X. Xun; International Sheep Genomics Consortium, The sheep genome reference sequence: A

Page 85: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

85

work in progress. Anim. Genet. 41, 449–453 (2010). doi:10.1111/j.1365-2052.2010.02100.x Medline

85. X. Xu, W. Chen, R. Talbot, K. Worley, Y. Jiang, W. Barris, B. Dalrymple, J. Maddox, T. Farault, R. Brauning, M. Xie, W. Zhang, A. Archibald, J. Kijas, N. Cockett, J. McEwan, H. Oddy, F. Nicholas, K. Kristensen, J. Wang, W. Wang, Genome data from the sheep. GigaScience (2011); http://dx.doi.org/10.5524/100023

86. S. L. Salzberg, A. M. Phillippy, A. Zimin, D. Puiu, T. Magoc, S. Koren, T. J. Treangen, M. C. Schatz, A. L. Delcher, M. Roberts, G. Marçais, M. Pop, J. A. Yorke, GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012). doi:10.1101/gr.131383.111 Medline

87. J. W. Kijas, J. A. Lenstra, B. Hayes, S. Boitard, L. R. Porto Neto, M. San Cristobal, B. Servin, R. McCulloch, V. Whan, K. Gietzen, S. Paiva, W. Barendse, E. Ciani, H. Raadsma, J. McEwan, B. Dalrymple; International Sheep Genomics Consortium Members, Genome-wide analysis of the world’s sheep breeds reveals high levels of historic mixture and strong recent selection. PLOS Biol. 10, e1001258 (2012). doi:10.1371/journal.pbio.1001258 Medline

88. Q. Qiu, G. Zhang, T. Ma, W. Qian, J. Wang, Z. Ye, C. Cao, Q. Hu, J. Kim, D. M. Larkin, L. Auvil, B. Capitanu, J. Ma, H. A. Lewin, X. Qian, Y. Lang, R. Zhou, L. Wang, K. Wang, J. Xia, S. Liao, S. Pan, X. Lu, H. Hou, Y. Wang, X. Zang, Y. Yin, H. Ma, J. Zhang, Z. Wang, Y. Zhang, D. Zhang, T. Yonezawa, M. Hasegawa, Y. Zhong, W. Liu, Y. Zhang, Z. Huang, S. Zhang, R. Long, H. Yang, J. Wang, J. A. Lenstra, D. N. Cooper, Y. Wu, J. Wang, P. Shi, J. Wang, J. Liu, The yak genome and adaptation to life at high altitude. Nat. Genet. 44, 946–949 (2012). doi:10.1038/ng.2343 Medline

89. C. G. Elsik, R. L. Tellam, K. C. Worley, R. A. Gibbs, D. M. Muzny, G. M. Weinstock, D. L. Adelson, E. E. Eichler, L. Elnitski, R. Guigó, D. L. Hamernik, S. M. Kappes, H. A. Lewin, D. J. Lynn, F. W. Nicholas, A. Reymond, M. Rijnkels, L. C. Skow, E. M. Zdobnov, L. Schook, J. Womack, T. Alioto, S. E. Antonarakis, A. Astashyn, C. E. Chapple, H. C. Chen, J. Chrast, F. Câmara, O. Ermolaeva, C. N. Henrichsen, W. Hlavina, Y. Kapustin, B. Kiryutin, P. Kitts, F. Kokocinski, M. Landrum, D. Maglott, K. Pruitt, V. Sapojnikov, S. M. Searle, V. Solovyev, A. Souvorov, C. Ucla, C. Wyss, J. M. Anzola, D. Gerlach, E. Elhaik, D. Graur, J. T. Reese, R. C. Edgar, J. C. McEwan, G. M. Payne, J. M. Raison, T. Junier, E. V. Kriventseva, E. Eyras, M. Plass, R. Donthu, D. M. Larkin, J. Reecy, M. Q. Yang, L. Chen, Z. Cheng, C. G. Chitko-McKown, G. E. Liu, L. K. Matukumalli, J. Song, B. Zhu, D. G. Bradley, F. S. Brinkman, L. P. Lau, M. D. Whiteside, A. Walker, T. T. Wheeler, T. Casey, J. B. German, D. G. Lemay, N. J. Maqbool, A. J. Molenaar, S. Seo, P. Stothard, C. L. Baldwin, R. Baxter, C. L. Brinkmeyer-Langford, W. C. Brown, C. P. Childers, T. Connelley, S. A. Ellis, K. Fritz, E. J. Glass, C. T. Herzig, A. Iivanainen, K. K. Lahmers, A. K. Bennett, C. M. Dickens, J. G. Gilbert, D. E. Hagen, H. Salih, J. Aerts, A. R. Caetano, B. Dalrymple, J. F. Garcia, C. A. Gill, S. G. Hiendleder, E. Memili, D. Spurlock, J. L. Williams, L. Alexander, M. J. Brownstein, L. Guan, R. A. Holt, S. J. Jones, M. A. Marra, R. Moore, S. S. Moore, A. Roberts, M. Taniguchi, R. C. Waterman, J. Chacko, M. M. Chandrabose, A. Cree, M. D. Dao, H. H. Dinh, R. A. Gabisi, S. Hines, J. Hume, S. N. Jhangiani, V. Joshi, C. L. Kovar, L. R. Lewis, Y. S.

Page 86: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

86

Liu, J. Lopez, M. B. Morgan, N. B. Nguyen, G. O. Okwuonu, S. J. Ruiz, J. Santibanez, R. A. Wright, C. Buhay, Y. Ding, S. Dugan-Rocha, J. Herdandez, M. Holder, A. Sabo, A. Egan, J. Goodell, K. Wilczek-Boney, G. R. Fowler, M. E. Hitchens, R. J. Lozado, C. Moen, D. Steffen, J. T. Warren, J. Zhang, R. Chiu, J. E. Schein, K. J. Durbin, P. Havlak, H. Jiang, Y. Liu, X. Qin, Y. Ren, Y. Shen, H. Song, S. N. Bell, C. Davis, A. J. Johnson, S. Lee, L. V. Nazareth, B. M. Patel, L. L. Pu, S. Vattathil, R. L. Williams, Jr., S. Curry, C. Hamilton, E. Sodergren, D. A. Wheeler, W. Barris, G. L. Bennett, A. Eggen, R. D. Green, G. P. Harhay, M. Hobbs, O. Jann, J. W. Keele, M. P. Kent, S. Lien, S. D. McKay, S. McWilliam, A. Ratnakumar, R. D. Schnabel, T. Smith, W. M. Snelling, T. S. Sonstegard, R. T. Stone, Y. Sugimoto, A. Takasuga, J. F. Taylor, C. P. Van Tassell, M. D. Macneil, A. R. Abatepaulo, C. A. Abbey, V. Ahola, I. G. Almeida, A. F. Amadio, E. Anatriello, S. M. Bahadue, F. H. Biase, C. R. Boldt, J. A. Carroll, W. A. Carvalho, E. P. Cervelatti, E. Chacko, J. E. Chapin, Y. Cheng, J. Choi, A. J. Colley, T. A. de Campos, M. De Donato, I. K. Santos, C. J. de Oliveira, H. Deobald, E. Devinoy, K. E. Donohue, P. Dovc, A. Eberlein, C. J. Fitzsimmons, A. M. Franzin, G. R. Garcia, S. Genini, C. J. Gladney, J. R. Grant, M. L. Greaser, J. A. Green, D. L. Hadsell, H. A. Hakimov, R. Halgren, J. L. Harrow, E. A. Hart, N. Hastings, M. Hernandez, Z. L. Hu, A. Ingham, T. Iso-Touru, C. Jamis, K. Jensen, D. Kapetis, T. Kerr, S. S. Khalil, H. Khatib, D. Kolbehdari, C. G. Kumar, D. Kumar, R. Leach, J. C. Lee, C. Li, K. M. Logan, R. Malinverni, E. Marques, W. F. Martin, N. F. Martins, S. R. Maruyama, R. Mazza, K. L. McLean, J. F. Medrano, B. T. Moreno, D. D. Moré, C. T. Muntean, H. P. Nandakumar, M. F. Nogueira, I. Olsaker, S. D. Pant, F. Panzitta, R. C. Pastor, M. A. Poli, N. Poslusny, S. Rachagani, S. Ranganathan, A. Razpet, P. K. Riggs, G. Rincon, N. Rodriguez-Osorio, S. L. Rodriguez-Zas, N. E. Romero, A. Rosenwald, L. Sando, S. M. Schmutz, L. Shen, L. Sherman, B. R. Southey, Y. S. Lutzow, J. V. Sweedler, I. Tammen, B. P. Telugu, J. M. Urbanski, Y. T. Utsunomiya, C. P. Verschoor, A. J. Waardenberg, Z. Wang, R. Ward, R. Weikard, T. H. Welsh, Jr., S. N. White, L. G. Wilming, K. R. Wunderlich, J. Yang, F. Q. Zhao; Bovine Genome Sequencing and Analysis Consortium, The genome sequence of taurine cattle: A window to ruminant biology and evolution. Science 324, 522–528 (2009). doi:10.1126/science.1169588 Medline

90. E. Gootwine, Placental hormones and fetal-placental development. Anim. Reprod. Sci. 82-83, 551–566 (2004). doi:10.1016/j.anireprosci.2004.04.008 Medline

91. C. M. Wade, E. Giulotto, S. Sigurdsson, M. Zoli, S. Gnerre, F. Imsland, T. L. Lear, D. L. Adelson, E. Bailey, R. R. Bellone, H. Blöcker, O. Distl, R. C. Edgar, M. Garber, T. Leeb, E. Mauceli, J. N. MacLeod, M. C. Penedo, J. M. Raison, T. Sharpe, J. Vogel, L. Andersson, D. F. Antczak, T. Biagi, M. M. Binns, B. P. Chowdhary, S. J. Coleman, G. Della Valle, S. Fryc, G. Guérin, T. Hasegawa, E. W. Hill, J. Jurka, A. Kiialainen, G. Lindgren, J. Liu, E. Magnani, J. R. Mickelson, J. Murray, S. G. Nergadze, R. Onofrio, S. Pedroni, M. F. Piras, T. Raudsepp, M. Rocchi, K. H. Røed, O. A. Ryder, S. Searle, L. Skow, J. E. Swinburne, A. C. Syvänen, T. Tozaki, S. J. Valberg, M. Vaudin, J. R. White, M. C. Zody, E. S. Lander, K. Lindblad-Toh, Broad Institute Genome Sequencing Platform, Broad Institute Whole Genome Assembly Team, Genome sequence, comparative analysis, and population genetics of the domestic horse. Science 326, 865–867 (2009). doi:10.1126/science.1178158 Medline

Page 87: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

87

92. J. M. Kidd, S. Gravel, J. Byrnes, A. Moreno-Estrada, S. Musharoff, K. Bryc, J. D. Degenhardt, A. Brisbin, V. Sheth, R. Chen, S. F. McLaughlin, H. E. Peckham, L. Omberg, C. A. Bormann Chung, S. Stanley, K. Pearlstein, E. Levandowsky, S. Acevedo-Acevedo, A. Auton, A. Keinan, V. Acuña-Alonzo, R. Barquera-Lozano, S. Canizales-Quinteros, C. Eng, E. G. Burchard, A. Russell, A. Reynolds, A. G. Clark, M. G. Reese, S. E. Lincoln, A. J. Butte, F. M. De La Vega, C. D. Bustamante, Population genetic inference from personal genome data: Impact of ancestry and admixture on human genomic variation. Am. J. Hum. Genet. 91, 660–671 (2012). doi:10.1016/j.ajhg.2012.08.025 Medline

93. M. A. M. Groenen, A. L. Archibald, H. Uenishi, C. K. Tuggle, Y. Takeuchi, M. F. Rothschild, C. Rogel-Gaillard, C. Park, D. Milan, H. J. Megens, S. Li, D. M. Larkin, H. Kim, L. A. Frantz, M. Caccamo, H. Ahn, B. L. Aken, A. Anselmo, C. Anthon, L. Auvil, B. Badaoui, C. W. Beattie, C. Bendixen, D. Berman, F. Blecha, J. Blomberg, L. Bolund, M. Bosse, S. Botti, Z. Bujie, M. Bystrom, B. Capitanu, D. Carvalho-Silva, P. Chardon, C. Chen, R. Cheng, S. H. Choi, W. Chow, R. C. Clark, C. Clee, R. P. Crooijmans, H. D. Dawson, P. Dehais, F. De Sapio, B. Dibbits, N. Drou, Z. Q. Du, K. Eversole, J. Fadista, S. Fairley, T. Faraut, G. J. Faulkner, K. E. Fowler, M. Fredholm, E. Fritz, J. G. Gilbert, E. Giuffra, J. Gorodkin, D. K. Griffin, J. L. Harrow, A. Hayward, K. Howe, Z. L. Hu, S. J. Humphray, T. Hunt, H. Hornshøj, J. T. Jeon, P. Jern, M. Jones, J. Jurka, H. Kanamori, R. Kapetanovic, J. Kim, J. H. Kim, K. W. Kim, T. H. Kim, G. Larson, K. Lee, K. T. Lee, R. Leggett, H. A. Lewin, Y. Li, W. Liu, J. E. Loveland, Y. Lu, J. K. Lunney, J. Ma, O. Madsen, K. Mann, L. Matthews, S. McLaren, T. Morozumi, M. P. Murtaugh, J. Narayan, D. T. Nguyen, P. Ni, S. J. Oh, S. Onteru, F. Panitz, E. W. Park, H. S. Park, G. Pascal, Y. Paudel, M. Perez-Enciso, R. Ramirez-Gonzalez, J. M. Reecy, S. Rodriguez-Zas, G. A. Rohrer, L. Rund, Y. Sang, K. Schachtschneider, J. G. Schraiber, J. Schwartz, L. Scobie, C. Scott, S. Searle, B. Servin, B. R. Southey, G. Sperber, P. Stadler, J. V. Sweedler, H. Tafer, B. Thomsen, R. Wali, J. Wang, J. Wang, S. White, X. Xu, M. Yerle, G. Zhang, J. Zhang, J. Zhang, S. Zhao, J. Rogers, C. Churcher, L. B. Schook, Analyses of pig genomes provide insight into porcine demography and evolution. Nature 491, 393–398 (2012). doi:10.1038/nature11622 Medline

94. R. N. Kim, D. S. Kim, S. H. Choi, B. H. Yoon, A. Kang, S. H. Nam, D. W. Kim, J. J. Kim, J. H. Ha, A. Toyoda, A. Fujiyama, A. Kim, M. Y. Kim, K. H. Park, K. S. Lee, H. S. Park, Genome analysis of the domestic dog (Korean Jindo) by massively parallel sequencing. DNA Res. 19, 275–288 (2012). doi:10.1093/dnares/dss011 Medline

95. J. W. Kijas, D. Townley, B. P. Dalrymple, M. P. Heaton, J. F. Maddox, A. McGrath, P. Wilson, R. G. Ingersoll, R. McCulloch, S. McWilliam, D. Tang, J. McEwan, N. Cockett, V. H. Oddy, F. W. Nicholas, H. Raadsma; International Sheep Genomics Consortium, A genome wide survey of SNP variation reveals the genetic structure of sheep breeds. PLOS ONE 4, e4668 (2009). doi:10.1371/journal.pone.0004668 Medline

96. L. Iannuzzi, G. P. Di Meo, Chromosomal evolution in bovids: A comparison of cattle, sheep and goat G- and R-banded chromosomes and cytogenetic divergences among cattle, goat and river buffalo sex chromosomes. Chromosome Res. 3, 291–299 (1995). doi:10.1007/BF00713067 Medline

Page 88: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

88

97. J. F. Maddox, A presentation of the differences between the sheep and goat genetic maps. Genet. Sel. Evol. 37, (Suppl 1), S1–S10 (2005). doi:10.1186/1297-9686-37-S1-S1

98. T. Goldammer, R. M. Brunner, A. Rebl, C. H. Wu, K. Nomura, T. Hadfield, J. F. Maddox, N. E. Cockett, Cytogenetic anchoring of radiation hybrid and virtual maps of sheep chromosome X and comparison of X chromosomes in sheep, cattle, and human. Chromosome Res. 17, 497–506 (2009). doi:10.1007/s10577-009-9047-9 Medline

99. A. S. Van Laere, W. Coppieters, M. Georges, Characterization of the bovine pseudoautosomal boundary: Documenting the evolutionary history of mammalian sex chromosomes. Genome Res. 18, 1884–1895 (2008). doi:10.1101/gr.082487.108 Medline

100. R. L. Jirtle, J. R. Weidman, Imprinted and more equal. Am. Sci. 95, 143–149 (2007). doi:10.1511/2007.64.1019

101. I. M. Morison, J. P. Ramsay, H. G. Spencer, A census of mammalian imprinting. Trends Genet. 21, 457–465 (2005). doi:10.1016/j.tig.2005.06.008 Medline

102. N. E. Cockett, M. A. Smit, C. A. Bidwell, K. Segers, T. L. Hadfield, G. D. Snowder, M. Georges, C. Charlier, The callipyge mutation and other genes that affect muscle hypertrophy in sheep. Genet. Sel. Evol. 37, (Suppl 1), S65–S81 (2005). doi:10.1186/1297-9686-37-S1-S65 Medline

103. E. A. Glazov, S. McWilliam, W. C. Barris, B. P. Dalrymple, Origin, evolution, and biological role of miRNA cluster in DLK-DIO3 genomic region in placental mammals. Mol. Biol. Evol. 25, 939–948 (2008). doi:10.1093/molbev/msn045 Medline

104. T. M. Jermann, J. G. Opitz, J. Stackhouse, S. A. Benner, Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily. Nature 374, 57–59 (1995). doi:10.1038/374057a0 Medline

105. D. E. Dobson, E. M. Prager, A. C. Wilson, Stomach lysozymes of ruminants. I. Distribution and catalytic properties. J. Biol. Chem. 259, 11607–11616 (1984). Medline

106. H. R. Ibrahim, U. Thomas, A. Pellegrini, A helix-loop-helix peptide at the upper lip of the active site cleft of lysozyme confers potent antimicrobial activity with membrane permeabilization action. J. Biol. Chem. 276, 43767–43774 (2001). doi:10.1074/jbc.M106317200 Medline

107. B. J. Norris, V. A. Whan, A gene duplication affecting expression of the ovine ASIP gene is responsible for white and black sheep. Genome Res. 18, 1282–1293 (2008). doi:10.1101/gr.072090.107 Medline

108. B. P. Telugu, A. M. Walker, J. A. Green, Characterization of the bovine pregnancy-associated glycoprotein gene family: Analysis of gene sequences, regulatory regions within the promoter and expression of selected genes. BMC Genomics 10, 185 (2009). doi:10.1186/1471-2164-10-185 Medline

109. J. A. Green, S. Xie, X. Quan, B. Bao, X. Gan, N. Mathialagan, J. F. Beckers, R. M. Roberts, Pregnancy-associated bovine and ovine glycoproteins exhibit spatially and temporally

Page 89: Supplementary Materials forscience.sciencemag.org/content/sci/suppl/2014/06/04/344... · 2014. 6. 4. · 2 Materials and Methods 1.1 DNA sample preparation and sequencing The data

89

distinct expression patterns during pregnancy. Biol. Reprod. 62, 1624–1631 (2000). doi:10.1095/biolreprod62.6.1624 Medline

110. K. Koshi, K. Ushizawa, K. Kizaki, T. Takahashi, K. Hashizume, Expression of endogenous retrovirus-like transcripts in bovine trophoblastic cells. Placenta 32, 493–499 (2011). doi:10.1016/j.placenta.2011.04.002 Medline

111. R. Oko, C. R. Morales, A novel testicular protein, with sequence similarities to a family of lipid binding proteins, is a major component of the rat sperm perinuclear theca. Dev. Biol. 166, 235–245 (1994). doi:10.1006/dbio.1994.1310 Medline

112. W. S. Lagakos, X. Guan, S. Y. Ho, L. R. Sawicki, B. Corsico, S. Kodukula, K. Murota, R. E. Stark, J. Storch, Liver fatty acid-binding protein binds monoacylglycerol in vitro and in mouse liver cytosol. J. Biol. Chem. 288, 19805–19815 (2013). doi:10.1074/jbc.M113.473579 Medline


Recommended