BIOINFORMATICS IN AQUACULTURE - AquaFunc:...

BIOINFORMATICS IN AQUACULTURE

Aleksei KrasnovAKVAFORSK (Ås, Norway)Bergen, September 21, 2007

Research area

Functional genomics of salmonidsMajor in diseases, stress and toxicity

Experience is in

- Sequence analyses and annotation- Construction and use of microarrays- Management of gene expression data- Microarray statistics

Genome-wide level: sequence databases and servers, large microarray platforms and gene expression warehouses

Single-gene studies in fish health, welfare, nutrition, reproduction etc

Automatic computer-forced analyses have (almost) reached the limitCollaboration between bioinformatic experts and biologists is poor

Limited use of genomic information

Domain-specific databaseslinked to tools (modules)for data analyses and annotations

Number of genes

Acc

urac

e, r

esol

utio

n

SEQUENCES (mRNA)

1. Processing and storage of primary data (GB)Salmo – 435031Oncorhynchus – 290946 (September 7, 2007)

2. Clustering (Unigene, TIGR, GRASP etc)Unique sequences (Unigene)

3. Construction of contigs (TIGR, GRASP)Contigs are problematic (error prone) due to• Duplicated genes• Multi-gene families with conserved paralogs• Splicing variantsUnigene deliberately declines from built of contigs

IDENTIFICATION AND ANNOTATIONBlastn / blastx search across nucleotide / protein databases• Identification fully depends on the reference set • Search across large databases (e.g. Swissprot, Uniprot) gives best hits to fish proteins

with obscure names and poor annotations • Sequence banks change continuously• Many genes have multiple names, nomenclature is available only for human• No rules how to deal with low similarities and uncertain homologies

Curated inspection is required

Functional annotation (Gene Ontology - GO) • Transfer from putative homologs• Problems in GO• Phylogenetic conservation of function is not granted

Structural annotation, assignment to multi-gene families, search for domains (Interpro, Ensembl)

• Has not been accomplished for salmonids at the genome level (?)• Blast is insufficient, more sophisitcated approaches are required (e.g. Hidden Markov

Models – HMM)

GENE INDICES (TIGR, GRASP)

+The Gene Index provides- Blastn across salmon or rainbow trout sequences- Sequences and reading frames- Links to Genbank - Positions of EST in contigs - GO annotations- Links to pathways

Shall we wait until developers will introduce everything we need?No database / server is able to meet all requirementsWhat to do with new sequences (EST)?

- However- Limited possibilities for search and retrieve of data, especially in a multi-gene format- Many important annotation are missing- Not adapted for comparative genomic studiesetc...

Our first experience in development of software(written by Petri Pehkonen, student of computer science department as a practical exercise)

- Stand-alone blast operates with user-specificed sequence databases - Parser, forms for selection of matches, export of data and iterative searches- Database (MS Access) is used for Gene Index

Small and simple effort in development of software helped to resolve many problems

• Analysis, structural and functional annotations of EST• Search for members multi-gene families and functional

classes• Design of microarrays• Linking to web databases, e.g. Harvester knowledge base

Small and simple effort in development of software helped to resolve many problems

• Analysis, structural and functional annotations of EST• Search for members multi-gene families and functional

classes• Design of microarrays• Linking to web databases, e.g. Harvester knowledge base

Much more can be done if several groups will join efforts!

What do we need?

What do we need? Personal opinion

Diverse databases / pipe-lines (adapted for research areas, projects, labs personal preferences)

MUST be designed by biologistsOnly USERS know what data and

analyses they needDesign of database is a time-consuming

taskThe rest is done easily by computer

people

Toolkit for common useBasic sequences analyses tools adapted

for database (blast, sequence alignment, translation, synonymous / non synonymous substitutions etc)

ParsersInterfaces and forms to launchapplication, to import, query, format

and export data

Task requires joint effort

AREA FOR COLLABORATION:IMMUNOGENOMICS

A simple question: what immune genes have been (not) identified in salmonid fish?

- Interferon-dependent genes?- Immunoglobulins?- Homologs to surface antigens (CD)?

A simple solution: - Retrieve all immune proteins by linking GO to Swissprot

/ Uniprot (use SRS at EBI)

SRS is database of databases, an extremely useful and powerful source

IMMUNOGENOMICSA task: - Retrieve all immune proteins by linking GO to Swissprot / Uniprot (use SRS at

EBI)- Run blast

False positives: - Uncertain homology- Errors in annotationsFalse negatives: - Many important genes with immune functions are not annotated in GO- Many fish homologs are not recognized with blast due to low sequence

conservation, more powerful methods are required (e.g. HMM)To identify fish immune we must compile a set of reference genes and

use advanced methods of sequence analyses

To use results we need more precise and extensive annotations (Immune Ontology), description of each gene

From EST to microarraysGenome-wide platforms (GRASP, TRAITS)+ Good for screening and search for markers- Lack of spot replicates means low accuracy- High quality annotation of large numbers of genes is VERY

problematic

Medium-size, specialized platforms (e.g. 1.8 K immunochip)- Many important genes are missing, especially those with unknown

functions+ Spot replicates ensure high sensitivity and accuracy+ Coverage of functions that are most important for a research area+ High quality annotations are feasible

JVI considers manuscripts that include microarrays and similar parallel profiling analyses of viral or cellular gene expression. However, such manuscripts will be published only if they provide novel insight into the biology of the virus or the infected cell or if they form the basis for additional experiments that provide such insights

JVI, guide for authors, 2007

Most important and challenging task is to make sense from gene expression data

STANDARD MICROARRAY MANUSCRIPTINCLUDES

- Lists of genes divided into clusters- GO analyses (enriched / depleted classes in the list and / or clusters)- Genes placed on the maps of pathway

- Such results are produced easily and do not have any value per se- Papers without important biological findings should not be published

- Statistics / data mining is a useful subsidary tool - Researchers should not be confined to any particular method or software- Results are always noisy and should be taken with great caution

Fish_Chips.exe

Fish_Chips includes:

- Simple ontology of samples (~ 300)- Easy and flexible querying by experiments, genes, parameters

(expression ratio, log-ER, ranks etc)

To find pros and cons of data mining procedures it is important to have samples as a database with strong and flexible querying.Solution: relational database with utilities for data querying and analyses

Simple ontolog

Fish_Chips.exe

Fish_Chips.exe

Fish_Chips includes:

- Simple ontology of samples- Easy and flexible querying by experiments, genes, parameters

(expression ratio, log-ER, ranks etc)- Sequences, annotations (GO, KEGG), links to web databases

Built-in statistical analyses (comparison by GO classes, t-test)- Direct link to Statistica (ANOVA, exact Fisher’s test, mean

expression profiles)- Formatting of data for external applications (cluster analysis)

Cluster analysis – why?Lack of choice

In microarray analyses number of measurements is greater than number of samples

Cluster analysis – one of few methods that can work with such data

Extremely simple, entirely formal – no theory behindMany technical and biological problems

Genes members of cluster are co-regulated. Is that true?! - Clusters are found in any data set- Different procedures produce different clusters, clusters must be checked for strength- Similar expression profiles are often observed only in small data sets, most clusters are destroyed by addition of samples- Even strong clusters do not necessarily have bilogical siginificanceExample: Genes that were up- or down-regulated in only one outlier form a strong and highly significant cluster

Classification of samples. Clustering helps to see the structure of experiment and interaction of factors

T

LE

ILE

IHE

I

HE

T

LE

ILE

IHE

I

HE

Separate and combined effects of estrogen and parasitic kidney disease on hepatic gene expression in rainbow trout (Helmut Segner, Univ. Bern)

Infection and

infection + estrogen

Estrogen

When two challengesare combined response to pathogen is greater

Euclidian distance metric, Ward’s method (available only in statistical packages)

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5C

Log

(Exp

ress

ion

ratio

)

Study groups

T LE HE I ILE IHE

Beta E2 = - 0.199Beta PKD = - 0.630

Complement component C3Complement component C5Complement component C9 Complement factor Bf-1Complement factor H Properdin C type lectin receptor BToll-like receptor 20a CeruloplasminEndothelial leukocyte adhesion molecuIg kappa chain V-IV region B17-2ProfilinCC chemokine SCYA110-2Adenosine kinase 2Acute phase proteinG1/S-specific cyclin D2Histone deacetylase 4 Fibroblast growth factor-20 Bone morphogenetic protein 8-likeMetallothionein-ILStress 70 protein chaperoneThioredoxin-like protein 4A

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2DL

og (E

xpre

ssio

n ra

tio)

Study groups

T LE HE I ILE IHE

Beta E2 = - 0.741Beta PKD = 0.125

Bcl2-associated X proteinC3a anaphylatoxin chemotactic receptoCC chemokine SCYA110-1Cytokine inducible SH2-containing proteEgl nine homolog 2 Hemopexin Ig kappa chain V-I region WEAInterferon-related regulator 2-1Liver-expressed antimicrobial peptide 2Membrane-type mosaic serine proteaseMyelin basic protein-1 NAD-dependent deacetylase sirtuin 5Peptidyl-prolyl cis-trans isomerase 2-2Semaphorin 7A

Induced with parasite, no response to estrogen

Induced with parasite, suppressed with estrogen

Finding transcriptionmodules enhancesresolution of analyses

- Differemtially expressed geneswere clustered- Cluster members were checkedfor correlation to mean expres-sion profile (r > 0.7)- Multiple regression evaluated effects of factors (p < 0.05) and their interaction

Date post:	04-May-2018
Category:	Documents
Upload:	leduong
View:	218 times
Download:	3 times

BIOINFORMATICS IN AQUACULTURE - AquaFunc:...

Documents