Transcriptome Analysis of Extant Cotton Progenitors and Identification of Genome-Specific-Single Nucleotide
Polymorphism (GNP) Gyoungju Nah (Dr. Z. Jeffrey Chen Lab)
University of Texas at Austin
A and D genomes as Extant Parents of AADD allotetraploid
1. What is the difference in transcriptome between G. arboreum (AA) and G. raimondii (DD)? Needs AA and DD EST information
(Hovav et al. 2008)
~7MYA
~1MYA
01
2. How does A- and D- allelic expression contribute to fiber development in allopolyploids? Needs GNP information
TC67135 (cyclin D)
(Yang et al. 2006)
A –allele specific expression in fiber-bearing ovule
A D AD A A A A A A A
Allelic Expression During Cotton Fiber Development 02
G. raimondii (D5)
G. hirsutum (AADD)1
G. Barbadense (AADD)2
G. arboreum (A2)
African-Asian A-genome
New World D-genome
New World AD-genome
Allotetraploids
1cm
1cm
1cm
1cm
Work Flow
454/Roche Titanium sequencing: 1,699,776 reads from G. arboreum (AA) 1,464,815 reads from G. raimondii (DD)
Young leaves, roots, bolls, ovules, and fibers
Assembly of 454 reads (Chen lab) AA: 62,609 contigs (avr. 1,032 bp) DD: 34,908 contigs (avr. 1,107 bp)
Assembly of 454 reads (Udall lab) AA: 89,185 contigs (avr. 629 bp) DD: 68,984 contigs (avr. 676 bp)
After merge of Chen and Udall ESTs: A: 89,588 contigs (avr. 806 bp) D: 65,542 contigs (avr. 840 bp)
Merge from two labs increased both number and size of AA and DD ESTs
03
Contig Size Distribution of AA and DD EST libraries
Transcriptome size of A-subgenome is ~27% larger than that of D-subgenome The majority of A (88%) and D (84%) contigs range from 200-1,500bp
Num
ber o
f con
tigs
Contig size (bp)
A
D
10,000
30,000
50,000
70,000
90,000 89,588
65,542
04
AA and DD EST Coverage 05
Unmatched
Matched 100%
0%
50%
: Query : Subject
A CGI11
D CGI11
CGI11 A
CGI11 D
61.7% 70.1%
81.2% 82.4% 81.2% 82.4%
61.7% 70.1%
18.8% 17.6% 38.3% 29.9%
BlastN: e-10 CGI11 with 117,992 contigs from mixture of AA, DD, and AADD ESTs
New A and D ESTs include ~80% of entries in CGI11 New A and D ESTs provide additional ~30-38% ESTs that are not present in CGI11
Redundancy in AA and DD ESTs 06
A and D transcriptomes are highly redundant, indicating the presence of ~50% of isoforms and paralogs in the cotton genome
A before BlastN
D before BlastN
A after BlastN
D after BlastN
10,000
30,000
50,000
70,000
90,000
48.3% 42.7%
100%
100%
Num
ber o
f con
tigs
A D BlastN: e-100
Estimation of Diversification of AA and DD ESTs 07
Diversified
Conserved 100%
50%
0%
73.2% 80.8%
: Query : Subject
A D
D A
73.2% 80.8%
26.8% 19.2%
Reciprocal BlastN: e-10
Either one of the libraries does not cover the entire transcriptome This diversification was estimated as 27% in A and 19% in D
AA and DD ESTs Against Known Protein Databases
35,335 (39.4%)
846 (0.9%)
10,493 (11.7%)
42,914 (47.9%)
A
33,333 (50.9%) 24,581
(37.5%)
6,300 (9.6%)
1,328 (2%)
D
28.3 %
27.4 %
10.6
6.8 4.1
4.3 4.5
4.2
3.4 2.5 2.3 1 0.4 %
28.2 %
27.4 %
10.6
6.6 4.2
4.4 4.7
4.4
3.6 2.4 2.3 0.9 0.4 % (B)
(A)
Matched TAIR10 peptide (E-10) Matched Uniprot (E-10) Matched pfamA (E-05) Unmatched
08
D contains a higher portion of Ath protein homologs than A Both A and D ESTs are enriched with cellular process and metabolic process
miRNA Targets in AA and DD ESTs
127 115 106
(B) A (242/89,588 contigs) 0.27%
D (233/65,542 contigs) 0.36%
Freq
uenc
y of
miR
NA
(A) A
D
09
miRNA regulation in DD genome is higher than AA, suggesting that in allotetraploid, miRNA might play important role for D-allele regulation
Selection of High Quality GNPs
Position A Position B
High quality Low quality
GNP Selection (Criteria: >= 8X coverage, >=90% consensus, Q>=25)
Number of GNP-containing contigs is 11,000
10
Allele-Separable Genes-I 11
Cotton EST ID TAIR ID Gene NameUDcontig30230 AT1G48410.1 AGO1cuDContig2017 AT1G48410.1 AGO1cuDContig12969 AT1G48410.1 AGO1cuDContig19558 AT1G31280.1 AGO2cuDContig4863 AT2G27040.2 OCP11cuDContig7218 AT2G27040.2 OCP11cuDContig17255 AT2G27880.1 AGO5UDcontig10468 AT1G01040.2 SUS1cuDContig12083 AT3G03300.3 DCL2UDcontig8529 AT3G03300.3 DCL2cuDContig3778 AT1G14790.1 RDR1cuDContig17499 AT5G14620.1 DRM2cuDContig11889 AT4G19020.1 CMT2cuDContig7567 AT1G69770.1 CMT3cuDContig13379 AT1G77300.1 SDG8cuDContig637 AT1G73100.1 SUVH3UDcontig12385 AT2G22740.1 SUVH6cuDContig13480 AT3G12680.1 HUA1cuDContig19226 AT1G05460.1 SDE3cuDContig7543 AT1G01920.2 SET-domaincuDContig6339 AT1G05120.1 SNF2-domain
Epigenetic-associated genes Cotton EST ID TAIR ID Gene NameUDcontig31044 AT2G46830.1 CCA1CDcontig25809 AT1G01060.4 LHY1UDcontig11814 AT1G01060.4 LHY1UDcontig14538 AT1G01060.5 LHY1
Clock-related genes
Allele-Separable Genes-II
Cotton EST ID TAIR ID Gene NamecuDContig15019 AT1G22640.1 MYB3cuDContig7688 AT1G68670.1 MYB-domaincuDContig4930 AT1G74840.1 MYB-domaincuDContig5155 AT2G01060.1 MYB-domainUDcontig50103 AT2G03500.1 MYB-domaincuDContig13445 AT2G23290.1 AtMYB70cuDContig3509 AT2G38090.1 MYB-domaincuDContig517 AT2G38090.1 MYB-domaincuDContig13450 AT2G38090.1 MYB-domaincuDContig4163 AT2G47190.1 MYB2cuDContig13595 AT3G09600.1 MYB-domaincuDContig1160 AT3G10760.1 MYB-domaincuDContig3516 AT3G13040.2 MYB-domaincuDContig15142 AT3G18100.1 MYB4R1cuDContig5233 AT4G09460.1 AtMYB6UDcontig45109 AT4G32730.2 PC-MYB1cuDContig16746 AT4G32730.2 PC-MYB1cuDContig3991 AT4G37260.1 MYB73cuDContig770 AT4G38620.1 MYB4cuDContig16699 AT5G04760.1 MYB-domaincuDContig16484 AT5G15310.2 ATMYB16UDcontig12659 AT5G45420.1 MYB-domaincuDContig19287 AT5G52660.1 MYB-domaincuDContig5755 AT5G52660.2 MYB-domaincuDContig16730 AT5G67300.1 MYBR1
Myb-related genes Cotton EST ID TAIR ID Gene NamecuDContig1242 AT1G05010.1 EFEcuDContig2761 AT1G05010.1 EFEcuDContig4895 AT1G07890.8 MEE6cuDContig7679 AT1G12910.1 ATAN11UDcontig42026 AT1G62660.1 BFRUCT3cuDContig4618 AT2G01570.1 RGA1cuDContig14416 AT2G01570.1 RGA1cuDContig1899 AT2G28950.1 ATHEXP cuDContig3387 AT2G40610.1 EXP8cuDContig7169 AT3G43190.1 SUS4cuDContig12426 AT4G03010.1 Leucine-richcuDContig5067 AT4G22880.2 TT18cuDContig3717 AT5G13710.2 SMT1cuDContig389 AT5G24520.2 TTG1cuDContig5774 AT5G25610.1 RD22
Fiber-related genes
12
GNP Identification and Characterization 13
(A)
34,059
3,277
34,059
926
4,822 5,000
15,000
25,000
35,000
SNP Indel
34,059
926
4,822
Num
ber o
f con
tigs
bp/SNP
(B)
200
600
1,000
1,400
34,985
4,822
GNP=SNP+Indel
Freq
uenc
y
G. a
G. r
G. h
M 16 17 20 21 22 27 29 34 M M 39 40 41 44 45 46 47 48 M (A)
Exp#48 CDcontig15250 cuAContig1068
Exp#22 cuDContig11665 cuAContig4409
G. arboreum G. romandii G. hirsutum G. arboreum G. romandii G. hirsutum
(B)
GNP Experimental Validation 14
By X. Guan
A
D
AD
200bp
400bp
Conclusions
We generated AA and DD EST libraries from extant progenitors of allotetraploid AADD cotton, which provides an important genomic resource for cotton fiber research and crop improvement
Comparative analysis of AA and DD ESTs provided some new
insights into transcriptome divergence between G. arboreum (AA) and G. raimondii (DD) genomes
Analysis of miRNA targets in AA and DD ESTs suggests that
miRNA-mediated gene regulation plays a role in expression of target genes from A and D subgenomes in allopolyploids
We developed a pipeline of GNPs that can discriminate between a
large number of AA and DD ESTs (~11,000), including many involved in the fiber development and epigenetic pathways.
15
Acknowledgement
University of Texas at Austin Dr. Jeffrey Z. Chen Dr. Yuki Guan Brigham Young University Dr. Joshua Udall Texas A&M University Dr. David Stelly UT GSAF Dr. Scott Hunicke-Smith Texas Advanced Computing Center