+ All Categories
Home > Documents > doi: 10.1007/978-1-0716-0621-6 27

doi: 10.1007/978-1-0716-0621-6 27

Date post: 11-Apr-2022
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
20
Chapter 27 Bioinformatics Analysis of Plant Cell Wall Evolution Elisabeth Fitzek, Rhiannon Balazic, and Yanbin Yin Abstract In the past hundreds of millions of years, from green algae to land plants, cell walls have developed into a highly complex structure that is essential for plant growth and survival. Plant cell wall diversity and evolution can be directly investigated by chemically profiling polysaccharides and lignins in the cell walls of diverse plants and algae. With the increasingly low cost and high throughput of DNA sequencing technologies, cell wall evolution can also be studied by bioinformatics analysis of the occurrence of cell wall synthesis-related enzymes in the genomes and transcriptomes of different species. This chapter presents a bioinformatics workflow running on a Linux platform to process genomic data for such gene occurrence analysis. As a case study, cellulose synthase (CesA) and CesA-like (Csl) protein families are mined for in two newly sequenced organisms: the charophyte green alga Klebsormidium flaccidum (renamed as Klebsormidium nitens) and the fern Lygodium japonicum. Key words Cellulose synthesis, Hemicellulose synthesis, CesA, Csl, GT2, Plant cell walls 1 Introduction Celluloses, lignins, hemicelluloses, and pectins are essential build- ing components of plant cell walls. They provide developing plant cells with their shape, structural support, as well as acting barriers against insects and pathogens. Carbohydrate-active enzymes (CAZyme) are responsible for the biosynthesis, degradation and modification of cell wall components [1]. CAZymes consist of a total of six classes: glycosyltransferases (GTs), glycoside hydrolases (GHs), polysaccharide lyases (PLs), carbohydrate esterases (CEs), carbohydrate-binding modules (CBMs), and the newest member auxiliary activities (AAs). With the advent of the low-cost, high-throughput, and high accuracy next-generation sequencing technologies, ~100 plant genomes have been sequenced. Many of these genomes are from model organisms. For nonmodel organisms, a large number of transcriptomes are available in the public genomic databases such as the National Center for Biotechnology Information (NCBI)’s sequence read archive (SRA) database and transcriptome shotgun Zoe ¨ A. Popper (ed.), The Plant Cell Wall: Methods and Protocols, Methods in Molecular Biology, vol. 2149, https://doi.org/10.1007/978-1-0716-0621-6_27, © Springer Science+Business Media, LLC, part of Springer Nature 2020 483
Transcript
Page 1: doi: 10.1007/978-1-0716-0621-6 27

Chapter 27

Bioinformatics Analysis of Plant Cell Wall Evolution

Elisabeth Fitzek, Rhiannon Balazic, and Yanbin Yin

Abstract

In the past hundreds of millions of years, from green algae to land plants, cell walls have developed into ahighly complex structure that is essential for plant growth and survival. Plant cell wall diversity andevolution can be directly investigated by chemically profiling polysaccharides and lignins in the cell wallsof diverse plants and algae. With the increasingly low cost and high throughput of DNA sequencingtechnologies, cell wall evolution can also be studied by bioinformatics analysis of the occurrence ofcell wall synthesis-related enzymes in the genomes and transcriptomes of different species. This chapterpresents a bioinformatics workflow running on a Linux platform to process genomic data for such geneoccurrence analysis. As a case study, cellulose synthase (CesA) and CesA-like (Csl) protein families aremined for in two newly sequenced organisms: the charophyte green alga Klebsormidium flaccidum(renamed as Klebsormidium nitens) and the fern Lygodium japonicum.

Key words Cellulose synthesis, Hemicellulose synthesis, CesA, Csl, GT2, Plant cell walls

1 Introduction

Celluloses, lignins, hemicelluloses, and pectins are essential build-ing components of plant cell walls. They provide developing plantcells with their shape, structural support, as well as acting barriersagainst insects and pathogens. Carbohydrate-active enzymes(CAZyme) are responsible for the biosynthesis, degradation andmodification of cell wall components [1]. CAZymes consist of atotal of six classes: glycosyltransferases (GTs), glycoside hydrolases(GHs), polysaccharide lyases (PLs), carbohydrate esterases (CEs),carbohydrate-binding modules (CBMs), and the newest memberauxiliary activities (AAs).

With the advent of the low-cost, high-throughput, and highaccuracy next-generation sequencing technologies, ~100 plantgenomes have been sequenced. Many of these genomes are frommodel organisms. For nonmodel organisms, a large number oftranscriptomes are available in the public genomic databases suchas the National Center for Biotechnology Information (NCBI)’ssequence read archive (SRA) database and transcriptome shotgun

Zoe A. Popper (ed.), The Plant Cell Wall: Methods and Protocols, Methods in Molecular Biology, vol. 2149,https://doi.org/10.1007/978-1-0716-0621-6_27, © Springer Science+Business Media, LLC, part of Springer Nature 2020

483

Page 2: doi: 10.1007/978-1-0716-0621-6 27

assembly (TSA) database. Websites such as Phytozome host gen-omes spanning the plant kingdom from aquatic algae to early landplants to more advanced flowering plants [2].

Two approaches have been used to study the diversity andevolution of plant cell walls [3, 4]. The first one, polysaccharideprofiling approach, uses chemical and biochemical techniques toprobe the polysaccharide compositions in different plant and algaltaxonomic groups. These techniques can directly determine thecompositions and structures of polysaccharides in the walls. Never-theless, they are not suitable for large-scale sampling of a largeamount of organisms and tissues due to expensive labor, time andfinancial cost. The second approach, gene occurrence approach,performs data mining of genomes and transcriptomes to investigatethe presence/absence of the enzymes responsible for the synthesisof certain polysaccharides. Obviously it is an indirect approach andhas to rely on preexisting knowledge about what enzymes catalyzethe biosynthesis of what plant cell wall biopolymers. However, it ismuch cheaper and faster than the polysaccharide profilingapproach, with the DNA/RNA sequencing becoming increasinglyless expensive. After genomic sequence data is obtained, bioinfor-matics data mining techniques play a key role to identify ortholo-gous genes, as analyzing genomes and transcriptomes will needin-depth bioinformatics data analyses [5]. Overall these twoapproaches are complementary to each other for the study of cellwall evolution.

This chapter describes a bioinformatics protocol to analyzeCAZyme gene occurrence in two newly sequenced organismsKleb-sormidium flaccidum (charophyte green alga or CGA) and Lygo-dium japonicum (fern) [6, 7], with the focus on cellulose synthase(CesA) and hemicellulose backbone synthesis-related CesA-like(Csl) proteins. CesA and Csl proteins belong to the CAZyme familyGT2, with Csl proteins further divided into nine subgroups (CslA,CslB, CslC, CslD, CslE, CslF, CslG, CslH, CslJ, and CslK) [5, 8,9]. This protocol has been primarily used in our recent researchpapers [5, 9–11], and could be easily modified by replacing thequery and/or database and apply to other cell wall-related CAZymefamilies to study their occurrence in the genomic data to infer theevolution of cell walls.

2 Materials

2.1 Computing

Environment,

Workflow, and Project

Folder

Most bioinformatics analyses introduced here are Unix commandline operations. A terminal is used to type in commands and tomanipulate datasets. Single commands and command one-liners(commands connected with spaces and vertical bars in a singleline), shown in italicized Courier font throughout this paper, areused, for example, to list files (ls file), view files (less file) or count

484 Elisabeth Fitzek et al.

Page 3: doi: 10.1007/978-1-0716-0621-6 27

how many Fasta sequences are in a file (cat file | grep ’^>’ | wc -l)(note that the spaces and the vertical bars must exist). For morecomplex data processing needs, it is advantageous to write a smallscript in a text editor program such as Notepad++ (Windows) orgedit (Ubuntu Linux) and implement it as a command.

Unix operating systems (OS) such as Linux and Mac haveintegrated terminal and valuable preinstalled programming lan-guages (Perl, Python, etc.). If you have Windows OS, you mayinstall Cygwin, a Linux-like environment for Windows (https://www.cygwin.com/), in order to run Unix commands and pro-gramming languages.

The bioinformatic pipeline described below is performed on aLinux system Ubuntu 12.04.5 LTS running on a computer witheight CPU processors (64-bit) and 8 Gb of RAM.

In order to reproduce the analysis described in this chapter, thereader should have some medium Unix command line skills andbasic Perl or Python programming experiences.

The computational workflow (Fig. 1) will be detailed in thisprotocol chapter. In brief, protein sequences of species of interestserve as subject (or database) for sequence similarity searches (blastpand hmmsearch). Known CesA/Csl protein sequences or domainmodels will be used as query in the searches. In the end, anannotated phylogenetic tree will be generated and used to inferthe evolution of CesA/Csl protein families.

In the following sections, we will describe how to downloadand install the needed bioinformatics tools (Subheading 2.2), thedatabases (Subheading 2.3), and the query datasets (Subheading2.4).

On Ubuntu Linux computers, by default, all downloaded filesfrom the web browser are automatically put in the Downloadfolder. In our computer the absolute path of this folder is /home/elfitzek/Downloads/ (“elfitzek” is the user account of the firstauthor). The “path” is a very important concept in using allLinux systems. When you run a command in a terminal environ-ment, you must provide the correct path of the program or com-mand or file or folder so that the computer will be able to findit. We will install all tools in the tools folder (/home/elfitzek/project/tools/), all query data in the query folder (/home/elfit-zek/project/query/), and all database files in the database folder(/home/elfitzek/project/database/). We assume our readersalready have the knowledge of creating, moving, and copyingfiles/folders between different folders.

2.2 Bioinformatics

Tools

As shown in Fig. 1, we will use blastp (a command of the BLASTpackage) and hmmsearch (a command of the HMMER package) tosearch the protein sequence sets of the two organisms (K. flaccidumand L. japonicum) for CesA/Csl homologs. BLAST (blastp) takesprotein sequences as query to search against protein sequence

Bioinformatics Analysis of Plant Cell Wall Evolution 485

Page 4: doi: 10.1007/978-1-0716-0621-6 27

database, while HMMER (hmmsearch) takes HMM (hidden Mar-kov model) profiles (see Subheading 2.4 for details) as query tosearch against protein sequence database. In addition, we also needMAFFT and FastTree tools to build phylogenetic trees.

2.2.1 Download

and Install HMMER

(a) Download the latest version (v3.1b2) of HMMER fromhttp://hmmer.org/ and uncompress the .gz file [12] (seeNote 1).

(b) Copy the “binaries” folder within the “hmmer-3.1b2-linux-intel-x86_64” folder and transfer it into the tools folder andrename it as “hmmer.” The hmmsearch command will be inthis hmmer folder. The absolute path of this folder in ourcomputer is /home/elfitzek/project/tools/hmmer/.

CesA/Cslproteinsequences fromYin et al. 2009

Species of interests (Fasta protein sequences)

• L. japonicum• K. flaccidum

HMM profiles• GT2 (dbCAN)

• Cellulose_synt (Pfam)

blastp hmmsearch

E-value < 1e-10

BLASTPresults

Filter for IDs(subject)

E-value < 1e-10

HMMERresults

Filter for IDs(subject)

Combine IDs

Get Fastasequences

MAFFT

FastTree

Phylogenetic treeannotation

Fig. 1 Workflow of bioinformatics analysis of CesA/Csl protein families. Protein sequences (Fasta format) areretrieved from various sites (see Subheading 2). BLAST (the blastp command) package is used to obtain CesAand Csl homologs with E-value <1e�10 [19]. HMMER3 (the hmmsearch command) package is used toidentify GT2 and CesA domain-containing proteins with E-value <1e�10 [20]. The resulting hits of bothsearches are combined and subjected to multiple sequence alignment with MAFFT [13] and then phylogenetictree reconstruction using FastTree [14]. The phylogenetic tree of CesA/Csl hits is visualized and phylogramsare made using iTOL [21]

486 Elisabeth Fitzek et al.

Page 5: doi: 10.1007/978-1-0716-0621-6 27

2.2.2 Download

and Install BLAST

(a) Download the latest version (version 2.3.0) of stand-aloneBLAST package (ftp://ftp.ncbi.nih.gov:/blast/executables/LATEST/) and uncompress the .gz file (see Note 2).

(b) Copy the “bin” folder within the “ncbi-blast-2.3.0+” folderand transfer the “bin” folder into the tools folder and renameit as “blast.” The blastp command will be in this blast folder(absolute path: /home/elfitzek/project/tools/blast/).

2.2.3 Download

and Install the Multiple

Sequence Alignment Tool

MAFFT

For the multiple sequence alignment, a variety of tools are availablesuch as Clustal Ω, MAFFT, MUSCLE, etc. In this study MAFFT ischosen for its high speed and high accuracy [13].

(a) Download the latest version of MAFFT from http://mafft.cbrc.jp/alignment/software/linuxportable.html and install itaccording to Note 3 [13].

(b) Choose from “mafft-linux32” and “mafft-linux64” the cor-rect folder that matches your computer bit system and copy itto the tools folder. In our case, we choose mafft-linux64. Themafft.bat command in the mafft-linux64 folder (/home/elfit-zek/project/tools/mafft-linux64/) is the executable MAFFTprogram.

2.2.4 Download

and Install FastTree [14]

for Phylogenetic Analysis

(a) Download the latest version of FastTree executable program(Linux 64-bit executable (+SSE)) [14] from www.microbesonline.org/fasttree/#Install and place it in the tools folder(/home/elfitzek/project/tools/) (Note 4).

2.3 Genome Data

of Species of Interests

(Database)

Protein sequences of the two newly sequenced organisms,K. flaccidum and L. japonicum, are available at the individual sitesof the research groups who generated the sequence data.

2.3.1 Download Data

from K. Flaccidum Genome

Website

(a) Download the protein sequences of the charophyte green algalK. flaccidum from http://www.plantmorphogenesis.bio.titech.ac.jp/~algae_genome_project/klebsormidium/index.html following the “Download” link. The genome ofK. flaccidumwas published in 2014 and the protein sequenceswere predicted from this genome [7]. Choose “PredictedProtein” (red arrow in Fig. 2) and download it (see Note 5).Note that the species name has been changed to Klebsormi-dium nitensNIES-2285 later on the website, but we still usedthe old name Klebsormidium flaccidum in this book chapter.This is because when we wrote this chapter, the website hasnot been updated yet.

(b) Change the downloaded file name to “Kfaccidum_protein.fa”.

Bioinformatics Analysis of Plant Cell Wall Evolution 487

Page 6: doi: 10.1007/978-1-0716-0621-6 27

2.3.2 Download Data

from L. japonicum Genome

Website

(a) Open the website http://bioinf.mind.meiji.ac.jp/kanikusa/download.php, and download the compressed “lygodium_-predicted_protein_ver1.0RC.fasta.tar.gz” file (red arrow inFig. 3). Different from the other organisms used in thisstudy, these protein sequences were predicted from assembledtranscriptome data instead of genome data [6].

(b) Decompress the Fasta file and copy it to the database folder.

(c) In the Fasta file, the sequence headers contain found differentcodes: “F,” “N,” “T,” and “P,” meaning the sequences werepredicted from different bioinformatics methods. These acro-nyms are: F ¼ FrameDP, N ¼ Newbler, T ¼ Transdecoder,and P ¼ longest amino acids sequence predicted by in-housePerl script (personal communication with Kentaro Yano). Weonly select sequences containing “P” in their Fasta header forfurther analysis to avoid including redundancy in the analysis.This is done by using a self-developed Perl script (seeNote 6).The output Fasta file is named “Lygodium_filtered.fa” andput in the database folder.

2.4 CesA/Csl Protein

Sequences

and Domain Models

as Queries

As shown in Fig. 1, we will use both blastp and hmmsearchmethodsfor searching CesA/Csl homologs. The reason was explained in [9](seeNote 7). The query for blastp search is protein Fasta sequences,while for hmmsearch the query is HMMs (hidden Markov models).

Fig. 2Website of Klebsormidium flaccidum genome project [7]. Arrow points to the link that contains the Fastaprotein sequences of the whole proteome

488 Elisabeth Fitzek et al.

Page 7: doi: 10.1007/978-1-0716-0621-6 27

2.4.1 CesA/Csl

Sequences from Yin et al.

[9]

CesA/Csl proteins from fully sequenced plant and algae (e.g.,Arabidopsis thaliana and Oryza sativa) have been cataloged inYin et al. [9]. We will use them as the query for the blastp search.Here we show how to download these sequences from PubMedCentral website:

(a) Open the webpage: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3091534/, download the Additional file 3 (Fig. 4),open it using a plain text editor (gedit on our Linux computer,Notepad++ on Windows computer), and copy the A. thalianaandO. sativa sequences and the six CslA/C-like (renamed CslKlater) protein sequences of chlorophyte green algae, and pastethem into a new plain text editor and save as a new file: /home/elfitzek/project/query/GT2-query.fa.

2.4.2 GT2 and CesA

HMM Profiles

Hidden Markov model (HMM) profiles are widely used to repre-sent protein domains or families (see Note 8). CslA and CslCproteins contain strong GT2 domain signals (Pfam model ID:Glycos_transf_2), while other CesA/Csl proteins contain strongCesA domain signals (Pfam model ID: Cellulose_synt).

To download these two HMM profiles:

Fig. 3 Website of Lygodium japonicum transcriptome project. Arrow points to the link that contains the Fastaprotein sequences predicted from the assembled transcriptome

Bioinformatics Analysis of Plant Cell Wall Evolution 489

Page 8: doi: 10.1007/978-1-0716-0621-6 27

(a) Go to the Pfam website and download the Glycos_transf_2.hmm file (https://pfam.xfam.org/family/Glycos_transf_2)(Fig. 5a), and rename it as GT2.hmm.

(b) Go to the Pfam website and download the Cellulose_synt.hmm file (http://pfam.xfam.org/family/Cellulose_synt)(Fig. 5b).

(c) Transfer the two HMM profiles files to the query folder (/home/elfitzek/project/query/).

3 Methods

With the tools, query, and database sets ready, we will search forCesA/Csl homologs in the two newly sequenced organisms. Bynow the project folder contains (folders end with “/”):

/home/elfitzek/project/tools/

– hmmer/

– blast/

– mafft-linux64/

– FastTree

/home/elfitzek/project/query/

Fig. 4 PubMeD web link of the full text of Yin et al. [9]. Scroll down to look for the Additional file 3 in theSupplementary Material section and download it

490 Elisabeth Fitzek et al.

Page 9: doi: 10.1007/978-1-0716-0621-6 27

– GT2-query.fa

– GT2.hmm

– Cellulose_synt.hmm

/home/elfitzek/project/database/

– Kfaccidum_protein.fa

– Lygodium_filtered.fa

We will create another folder called “analysis” and put it underthe project folder. All analysis result files will be written to /home/elfitzek/project/analysis/.

3.1 Run hmmsearch

in a Terminal

(a) Open a terminal and change directory to the analysis folder:

cd /home/elfitzek/project/analysis/

(b) Run hmmsearch (/home/elfitzek/project/tools/hmmer/) ina terminal on the two hmm files (/home/elfitzek/project/query/) and the two .fa files (/home/elfitzek/project/data-base/) (four times in total) using the command one-liners(Fig. 6):

Fig. 5 (a) The GT2 page of the Pfam website. (b) The cellulose synthase family page of the Pfam website. Thedownload links are marked with red arrows

Bioinformatics Analysis of Plant Cell Wall Evolution 491

Page 10: doi: 10.1007/978-1-0716-0621-6 27

Please note there are spaces between different parts in theone-liner:– ../tools/hmmer/hmmsearch: since we are in the anal-

ysis folder, we have to go one level back (..) to find the toolsfolder, and then locate the hmmsearch command

– --domtblout kf.gt2.hmm.dm: the parameter (--domtblout) is to define the tabular output file name (kf.gt2.hmm.dm) written to the current folder (analysis)

– ../query/GT2.hmm: locate the query HMM profile

– ../database/Kfaccidum_protein.fa: locate theprotein sequence database

– > kf.gt2.hmm.out: direct (“>”) the complete output toa file (kf.gt2.hmm.out) in the current folder (analysis)

More detailed information about this command one-linecould be found in the HMMER user guide: http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf. It is also use-ful to run the below command to get the option list:

../tools/hmmer/hmmsearch –h

(c) The analysis folder now should have four .hmm.dm files andfour .hmm.out files. The .hmm.dm files will be further ana-lyzed to retrieve CesA/Csl homologs.

3.2 Run blastp

in a Terminal

(a) In a terminal, change directory to the analysis folder:

cd /home/elfitzek/project/analysis/

(b) In order to run blastp, we have to format the two .fa files in thedatabase folder (/home/elfitzek/project/database/) usingthe makeblastdb command (/home/elfitzek/project/tools/blast/) (Fig. 7):

Running the below command will print the option list onthe screen:

../tools/blast/makeblastdb -help

Fig. 6 Command one-liners to initiate hmmsearch

492 Elisabeth Fitzek et al.

Page 11: doi: 10.1007/978-1-0716-0621-6 27

In the database folder, for each .fa files, there will be sixadditional files generated with different suffixes (see Note 9).

(c) Run blastp (/home/elfitzek/project/tools/blast/) using theGT2-query.fa file as query (/home/elfitzek/project/query/)and the two .fa files (/home/elfitzek/project/database/) asdatabase using the command one-liners (Fig. 8):

Running the below command will print all the options onthe screen (see Note 10):

../tools/blast/blastp –help

(d) The analysis folder now should have two .blast.out files, whichwill be further analyzed to retrieve CesA/Csl homologs.

3.3 Extract

Significant Hits

Now the analysis folder should contain the following files:

/home/elfitzek/project/analysis/

– kf.gt2.hmm.dm

– kf.cesa.hmm.dm

– kf.blast.out

– lj.gt2.hmm.dm

– lj.cesa.hmm.dm

– lj.blast.out

1. Save the IDs of the significant hits of blastp results (.blast.outfiles) using a command one-liner. An example input file (blastpoutput) and this one-liner are explained in details in Fig. 9 (seeNote 11).

2. Save the IDs of the significant hits of hmmsearch results (.hmm.dm files) using a command one-liner. An example input file(hmmsearch output) and this one-liner are explained in detailsin Fig. 10 (see Notes 11 and 12).

Fig. 7 Command one-liners to make blastable databases

Fig. 8 Command one-liners to run blastp search

Bioinformatics Analysis of Plant Cell Wall Evolution 493

Page 12: doi: 10.1007/978-1-0716-0621-6 27

3.3.1 Combine

Significant Hits from

the Three Methods

Now the analysis folder should contain the following .id files:/home/elfitzek/project/analysis/

– kf.gt2.hmm.dm.1e-10.id

– kf.cesa.hmm.dm.1e-10.id

– kf.blast.out.1e-10.id

– lj.gt2.hmm.dm.1e-10.id

cat kf.blast.out | awk '$11<1e-10' | cut -f2 | sort -u > kf.blast.out.1e-10.id

1 2 3 95 6 7 8 10 11 12

1. Print the file content2. Keep lines with 11th column < 1e-103. Keep just the 2nd column4. Remove duplicate5. Direct the output into a file

1. Query ID; 2. Subject ID; 3. % Identity; 4. Alignment Length; 5. Mismatch Count; 6. Gap Open Count; 7. Query Start; 8. Query End; 9. Subject Start; 10. Subject End; 11. Evalue; 12. Bit-score

1 2 3 4 5

cat lj.blast.out | awk '$11<1e-10' | cut -f2 | sort -u > lj.blast.out.1e-10.id

Fig. 9 Top: The tabular space delimited format of blastp output; Bottom: Command one-liners to parse thistabular format file to extract hit IDs with significant E-values <1e�10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

cat kf.cesa.hmm.dm | grep -v '^#' | awk '$13<1e-10' | awk '{print $1}' | sort -u > kf.cesa.hmm.dm.1e-10.idcat kf.gt2.hmm.dm | grep -v '^#' | awk '$13<1e-10' | awk '{print $1}' | sort -u > kf.gt2.hmm.dm.1e-10.idcat lj.cesa.hmm.dm | grep -v '^#' | awk '$13<1e-10' | awk '{print $1}' | sort -u > lj.cesa.hmm.dm.1e-10.idcat lj.gt2.hmm.dm | grep -v '^#' | awk '$13<1e-10' | awk '{print $1}' | sort -u > lj.gt2.hmm.dm.1e-10.id

1. Print the file content2. Remove the header lines3. Keep lines with 13th column < 1e-104. Keep just the 1st column5. Remove duplicate6. Direct the output into a file

1 2 3 4 5 6

Fig. 10 Top: The regular space delimited format of hmmsearch output; Bottom: Command one-liners to parsethis file to extract hit IDs with significant E-values <1e�10

494 Elisabeth Fitzek et al.

Page 13: doi: 10.1007/978-1-0716-0621-6 27

– lj.cesa.hmm.dm.1e-10.id

– lj.blast.out.1e-10.id

We run the follow command one-liner to combine the hitsfrom the three different queries/methods and remove duplicates(Fig. 11):

3.4 Prepare Fasta

Sequence File

for Alignment

3.4.1 Extract Fasta

Sequences

of the Significant Hits

in the Two Searched

Organisms

In the BLAST package, there is a command called blastdbcmd thatcan take a given list of IDs and retrieve their Fasta sequences from aFasta database file. This can only be possible if we have applied the-parse_seqids option when we run makeblastdb command (as wedid in Fig. 7). We ran the following commands (Fig. 12) to extractthe Fasta sequences of our significant hits in the two organisms.

Running the below command will print all the options on thescreen:

../tools/blast/blastdbcmd –help

3.4.2 Combine Fasta

Sequences from the Query

Organisms

and the Database

Organisms

cat kf.all.id.fa lj.all.id.fa ../query/GT2-query.fa > all.

species.fa

3.5 Build Alignment

and Phylogeny

3.5.1 Run MAFFT to Build

Multiple Sequence

Alignment Using

the Command One-Liner

../tools/mafft-linux64/mafft.bat --maxiterate 1000 --local-

pair all.species.fa > all.species.fa.l

Running the below command will print some help informationon the screen:

../tools/mafft-linux64/mafft.bat

Fig. 11 Command one-liners to combine significant hits and remove duplicates.Asterisk is the Unix wildcard character to represent any characters

Fig. 12 Command one-liners to extract Fasta sequences

Bioinformatics Analysis of Plant Cell Wall Evolution 495

Page 14: doi: 10.1007/978-1-0716-0621-6 27

An online version of MAFFT can be found at: http://www.ebi.ac.uk/Tools/msa/mafft/, where users can upload a Fasta formatfile for alignment remotely.

3.5.2 Run FastTree

to Build Phylogeny Using

the Command One-Liner

../tools/FastTree all.species.fa.l > all.species.fa.l.nwk

This uses the default parameters for protein phylogeny recon-struction. Running the below command will print all the optionson the screen:

../tools/FastTree

3.6 Creating

and Annotating

Phylograms with iTOL

iTOL is web application that allows users to upload a Newickformat phylogeny tree file for making publishable phylograms.Moreover, it offers very useful utilities to annotate the tree graphssuch as coloring branches and leaves automatically by uploadingcolor definition files. To make these color definition files, we willrun the command one-liners as shown in Fig. 13. After all the steps,we get two definition files, one for defining branch color (all.species.br.txt) and the other for defining leaf colors (all.species.lab.txt). Then we are ready to upload these files to iTOL websitefor making the tree graph:

(a) Open http://itol.embl.de/upload.cgi in a web browser.

#1: make color definition file to specify the branch colorscat ../query/GT2-query.fa | grep '>' | sed 's/>//' | grep '^AT' | awk '{print $1,"branch","#0000ff","normal","1"}' > at.all.id.brcat ../query/GT2-query.fa | grep '>' | sed 's/>//' | grep '^LOC' | awk '{print $1,"branch","#ff0000","normal","1"}' > os.all.id.brcat ../query/GT2-query.fa | grep '>' | sed 's/>//' | grep ’like' | awk '{print $1,"branch","#ffa500","normal","1"}' > alg.all.id.brcat kf.all.id | awk '{print $1,"branch","#00ff00","normal","1"}' > kf.all.id.brcat lj.all.id | awk '{print $1,"branch","#00ffff","normal","1"}' > lj.all.id.br

#2: make color definition file to specify the leaf colorscat ../query/GT2-query.fa | grep '>' | sed 's/>//' | grep '^LOC' | awk '{print $1,"label","#ff0000","normal","1"}' > os.all.id.labcat ../query/GT2-query.fa | grep '>' | sed 's/>//' | grep '^AT' | awk '{print $1,"label","#0000ff","normal","1"}' > at.all.id.labcat ../query/GT2-query.fa | grep '>' | sed 's/>//' | grep ’like' | awk '{print $1,”label","#ffa500","normal","1"}' > alg.all.id.labcat kf.all.id | awk '{print $1,"label","#00ff00","normal","1"}' > kf.all.id.labcat lj.all.id | awk '{print $1,"label","#00ffff","normal","1"}' > lj.all.id.lab

#3: combine all species into one filecat *.br > all.species.br.txtcat *.lab > all.species.lab.txt

#4: open the two files in a plain text editor and add the three lines at the topTREE_COLORSSEPARATOR SPACEDATA

# the top 10 lines of all.species.br.txt and all.species.lab.txt look like:TREE_COLORSSEPARATOR SPACEDATAkfl00004_0480 branch #00ff00 normal 1kfl00007_0020 branch #00ff00 normal 1kfl00025_0050 branch #00ff00 normal 1kfl00029_0240 branch #00ff00 normal 1kfl00032_0270 branch #00ff00 normal 1kfl00053_0150 branch #00ff00 normal 1kfl00053_0170 branch #00ff00 normal 1

TREE_COLORSSEPARATOR SPACEDATAkfl00004_0480 label #00ff00 normal 1kfl00007_0020 label #00ff00 normal 1kfl00025_0050 label #00ff00 normal 1kfl00029_0240 label #00ff00 normal 1kfl00032_0270 label #00ff00 normal 1kfl00053_0150 label #00ff00 normal 1kfl00053_0170 label #00ff00 normal 1

Fig. 13 Command one-liners to make color definition files

496 Elisabeth Fitzek et al.

Page 15: doi: 10.1007/978-1-0716-0621-6 27

(b) Choose the newick file (all.species.fa.l.nwk) from the com-puter and upload it or simply drag the file to the “Tree text”area to upload.

(c) The un-colored tree graph will be shown in a tree browser.

(d) Drag the two color definition file into the tree browser (seeNote 13), then we will see the tree branches and leaves arecolored (Fig. 14).

3.7 Interpretation

of the Phylogeny

The last and probably most important step in this protocol is tomanually inspect the phylogeny to make meaningful/interestingevolutionary interpretations. Figure 15 includes CesA/Csl homo-logs from ten organisms: two newly sequenced organisms:K. flaccidum (genome) and L. japonicum (transcriptome), twomodel organisms (A. thaliana, O. sativa), and six chlorophytegreen algae (Micromonas pusilla CCMP1545, Micromonas strainRCC299, Ostreococcus lucimarinus, Ostreococcus tauri, Chlamydo-monas reinhardtii, and Volvox carteri f. nagariensis). The interpre-tation of this phylogeny is largely based on the groupings ofK. flaccidum and L. japonicum proteins into known CesA/Cslclades according to the already annotated A. thaliana, O. sativa,and chlorophyte algal proteins.

With regard to K. flaccidum, we can make the following inter-pretations from the phylogeny, most of which are in agreementwith findings reported in our paper [5]: (1) no CslA orthologs arefound in the CGA (charophyte green alga) K. flaccidum; (2) threeK. flaccidum proteins are monophyletically clustered with CslCproteins of land plants; (3) there is a single protein of

Fig. 14 iTOL tree browser and control panel (right-top corner). Selecting different options in the control panelwill change the tree graph simultaneously

Bioinformatics Analysis of Plant Cell Wall Evolution 497

Page 16: doi: 10.1007/978-1-0716-0621-6 27

K. flaccidum is clustered with chlorophyte CslK proteins, morespecifically monophyletically clustered withMicromonas andOstreo-coccus proteins (support value ¼ 99%), which is very interestingbecause CslK was thought to be chlorophyte specific; (4) fourK. flaccidum proteins are clustered with land plant CesAs with asupport value ¼ 95%; (5) there is one K. flaccidum protein clus-tered within CslD clade but with very long branch, suggesting thisis not a reliable clustering; another K. flaccidum protein is basal toall CesA/CslD/CslF proteins. For the last point, we and othershave found that some Penium and Spirogyra CGA have CesAs andColeochaete CGA have CslD [5, 15].

For the fern L. japonicum, we can conclude that it has CesA,CslD, CslA, and CslC proteins. Although no L. japonicum proteinsare found in CslB/H/E/G clades, they are likely to have not beenexpressed in the transcriptome data that we searched. In fact, wehave previously shown that some other fern species do haveexpressed proteins in these clades [5].

Lastly, we see that there are some proteins from K. flaccidumand L. japonicum not clustered with any of the CesA and Csl cladesin the phylogeny. They are either non-CesA/Csl GT2 proteins thatare homologous to Arabidopsis dolichyl phosphateβ-glucosyltransferase (AT2G39630) and dolichol phosphate

Blue: A. thalianaRed: O. sativaGreen: K. flaccidumCyan: L. japonicumOrange: Chlorophytes

CesA

CslD CslF

CslA

CslC

CslB/H/E/G

Other GT2

CslK

Fig. 15 Phylogeny of CesA/Csl proteins from A. thaliana, O. sativa, K. flaccidum, L. japonicum, and sixchlorophyte green algae. Nodes that have >80% support values are indicated with light blue circles

498 Elisabeth Fitzek et al.

Page 17: doi: 10.1007/978-1-0716-0621-6 27

mannose synthase (AT1G20575), or even bacteria linear CesA-likeproteins. For the latter, it could be tested by including publishedbacterial, plant and algal linear CesA proteins in the phylogeneticanalysis [16–18].

4 Notes

1. In Ubuntu Linux, this can be done by right-clicking the .gz fileand select “Extract Here” from the menu. Or open a terminal,go to the download folder (/home/elfitzek/Downloads/),and run the command: tar xvf hmmer-3.1b2-linux-intel-

x86_64.tar.gz

This will create a folder called “hmmer-3.1b2-linux-intel-x86_64”.

2. In Ubuntu Linux, this can be done by right-clicking the .gz fileand select “Extract Here” from the menu. Or open a terminal,go to the download folder, and run the command:

tar xvf ncbi-blast-2.3.0+-x64-linux.tar.gz

This will create a folder called “ncbi-blast-2.3.0+”.

3. Open a terminal, go to the download folder, and run thecommand:

tar xvf mafft-7.273-linux.tgz

This will release two folders, “mafft-linux32” and “mafft-linux64”. The latter is what we need for our computer.

4. In order to run this program, we need to change the permis-sion of the downloaded FastTree executable program with thebelow command in a terminal:

chmod 777 /home/elfitzek/project/tools/FastTree

5. In the webpage, right click on the link in Fig. 2 and select“Copy Link Address.” Open a terminal and move to the data-base folder (/home/elfitzek/project/database/), type thewget command and paste the copied web link (there is aspace after wget):

Bioinformatics Analysis of Plant Cell Wall Evolution 499

Page 18: doi: 10.1007/978-1-0716-0621-6 27

wget http://www.plantmorphogenesis.bio.titech.ac.jp/~algae_-

genome_project/klebsormidium/kf_download/131203_kfl_initial_-

genesets_v1.0_AA.fasta

6. The Perl script is written as a one-liner:

cat lygodium_predicted_potein_ver1.0RC.fasta | perl -e

’@a=&lt;&gt;;for($i=0;$i&lt;=$#a;$i=$i+2){print $a[$i].$a[$i

+1] if $a[$i]=~/_P_/;}’ &gt; Lygodium_filtered.fa

We do not provide the explanation of this one-liner as it isbeyond the scope of this chapter.

7. Briefly, this will make sure all CesA/Csl proteins and theirclosely related GT2 proteins to be identified.

8. Well-known protein family/domain HMM profile databasesinclude Pfam, PIRSF and SMART, which are very popular forfunctional annotation of newly sequenced proteomes.

9. These are the index files that will be used for blastp search. Theyhave to be in the same folder as the Fasta file and the file namemust not be changed.

10. Particularly the “-outfmt 6” parameter specifies the blast out-put in tabular format, which is very easy to parse.

11. E-value threshold can be changed andmay affect the result verysignificantly. Here we use 1e-10 because we wanted to be veryconservative in keeping significant hits. There is no universally“best” E-value. People usually explore different E-values andmake decision based on their own experience. For this chapter,we have tried E-value <1e�2, E-value <1e�5, and E-value<1e�10. Different E-value thresholds do not change any ofour conclusion made in the interpretation step (Subheading3.7).

12. The hmmsearch output reports three E-values (Fig. 10): (a) theE-value refers to the full sequence; (b) the c-E-value is knownas conditional E-value; and (c) the i-E-value is independent E-value. We choose to use the i-E-value to parse the file to extractsignificant hits. The detailed explanation of hmmsearch outputcould be found in the HMMER user guide: http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf.

13. Using the Control panel in the tree browser, we can display thetree in circular/normal/unrooted modes, show/hide boot-strap values, and export the tree graph as an image file indifferent formats (e.g., PDF, PNG). The more detailed userguide is available in the help page of iTOL: http://itol.embl.de/help.cgi.

500 Elisabeth Fitzek et al.

Page 19: doi: 10.1007/978-1-0716-0621-6 27

Acknowledgments

E.F. is supported by the Research & Artistry Award of NorthernIllinois University and partially supported by the National Insti-tutes of Health (1R15GM114706) to Y.Y. R.B. was a UniversityHonors ProgramUndergraduate Student of Northern Illinois Uni-versity. We acknowledge the Department of Computer Science ofNIU for providing free access to the Linux computing cluster Gaeaand the Yin lab members for helpful discussions.

References

1. Lombard V, Golaconda Ramulu H, Drula E,Coutinho PM, Henrissat B (2014) Thecarbohydrate-active enzymes database (CAZy)in 2013. Nucleic Acids Res 42:D490–D495

2. Goodstein DM, Shu S, Howson R,Neupane R, Hayes RD, Fazo J, Mitros T,Dirks W, Hellsten U, Putnam N et al (2012)Phytozome: a comparative platform for greenplant genomics. Nucleic Acids Res 40:D1178–D1186

3. Popper Z, Michel G, Herve C, Domozych DS,Willats WG, Tuohy MG, Kloareg B, StengelDB (2011) Evolution and diversity of plantcell walls: from algae to flowering plants.Annu Rev Plant Biol 62:567–590

4. Fangel JU, Ulvskov P, Knox JP, MikkelsenMD, Harholt J, Popper ZA, Willats WG(2012) Cell wall evolution and diversity.Front Plant Sci 3:152

5. Yin Y, Johns MA, Cao H, Rupani M (2014) Asurvey of plant and algal genomes and tran-scriptomes reveals new insights into the evolu-tion and function of the cellulose synthasesuperfamily. BMC Genomics 15:1–15

6. Aya K, Kobayashi M, Tanaka J, Ohyanagi H,Suzuki T, Yano K, Takano T, Matsuoka M(2014) De novo transcriptome assembly of afern, lygodium japonicum, and a web resourcedatabase, ljtrans DB. Plant Cell Physiol 56:e5–e5

7. Hori K, Maruyama F, Fujisawa T, Togashi T,Yamamoto N, Seo M, Sato S, Yamada T,Mori H, Tajima N et al (2014) Klebsormidiumflaccidum genome reveals primary factors forplant terrestrial adaptation. Nat Commun5:3978

8. Richmond TA, Somerville CR (2000) The cel-lulose synthase superfamily. Plant Physiol124:495–498

9. Yin Y, Huang J, Xu Y (2009) The cellulosesynthase superfamily in fully sequenced plantsand algae. BMC Plant Biol 9:99

10. Taujale R, Yin Y (2015) Glycosyltransferasefamily 43 is also found in early eukaryotes andhas three subfamilies in Charophycean greenalgae. PLoS One 10:e0128409

11. Yin Y, Chen H, Hahn MG, Mohnen D, Xu Y(2010) Evolution and function of the plant cellwall synthesis-related glycosyltransferase family8. Plant Physiol 153:1729–1746

12. Finn RD, Clements J, Eddy SR (2011)HMMERweb server: interactive sequence sim-ilarity searching. Nucleic Acids Res 39:W29–W37

13. Katoh K, Standley DM (2013) MAFFT multi-ple sequence alignment software version 7:improvements in performance and usability.Mol Biol Evol 30:772–780

14. Price MN, Dehal PS, Arkin AP (2009) Fas-tTree: computing large minimum evolutiontrees with profiles instead of a distance matrix.Mol Biol Evol 26:1641–1650

15. MikkelsenMD,Harholt J, Ulvskov P, JohansenIE, Fangel JU, Doblin MS, Bacic A, WillatsWG (2014) Evidence for land plant cell wallbiosynthetic mechanisms in charophyte greenalgae. Ann Bot 114:1217–1236

16. Harholt J, Sorensen I, Fangel J, Roberts A,Willats WG, Scheller HV, Petersen BL, BanksJA, Ulvskov P (2012) The glycosyltransferaserepertoire of the spikemoss Selaginella moel-lendorffii and a comparative study of its cellwall. PLoS One 7:e35846

17. Michel G, Tonon T, Scornet D, Cock JM,Kloareg B (2010) The cell wall polysaccharidemetabolism of the brown alga Ectocarpus sili-culosus. Insights into the evolution of

Bioinformatics Analysis of Plant Cell Wall Evolution 501

Page 20: doi: 10.1007/978-1-0716-0621-6 27

extracellular matrix polysaccharides in eukar-yotes. New Phytol 188:82–97

18. Roberts E, Roberts AW (2009) A cellulosesynthase (Cesa) gene from the red alga Por-phyra Yezoensis (Rhodophyta). J Phycol45:203–212

19. Altschul SF, Gish W, Miller W, Myers EW, Lip-man DJ (1990) Basic local alignment searchtool. J Mol Biol 215:403–410

20. Eddy SR (2011) Accelerated profile HMMsearches. PLoS Comput Biol 7:e1002195

21. Letunic I, Bork P (2011) Interactive tree of lifev2: online annotation and display of phyloge-netic trees made easy. Nucleic Acids Res 39:W475–W478

502 Elisabeth Fitzek et al.


Recommended