BIOINFORMATICS IN AQUACULTURE
Aleksei KrasnovAKVAFORSK (Ås, Norway)Bergen, September 21, 2007
Research area
Functional genomics of salmonidsMajor in diseases, stress and toxicity
Experience is in
- Sequence analyses and annotation- Construction and use of microarrays- Management of gene expression data- Microarray statistics
Genome-wide level: sequence databases and servers, large microarray platforms and gene expression warehouses
Single-gene studies in fish health, welfare, nutrition, reproduction etc
Automatic computer-forced analyses have (almost) reached the limitCollaboration between bioinformatic experts and biologists is poor
Limited use of genomic information
Domain-specific databaseslinked to tools (modules)for data analyses and annotations
Number of genes
Acc
urac
e, r
esol
utio
n
SEQUENCES (mRNA)
1. Processing and storage of primary data (GB)Salmo – 435031Oncorhynchus – 290946 (September 7, 2007)
2. Clustering (Unigene, TIGR, GRASP etc)Unique sequences (Unigene)
3. Construction of contigs (TIGR, GRASP)Contigs are problematic (error prone) due to• Duplicated genes• Multi-gene families with conserved paralogs• Splicing variantsUnigene deliberately declines from built of contigs
IDENTIFICATION AND ANNOTATIONBlastn / blastx search across nucleotide / protein databases• Identification fully depends on the reference set • Search across large databases (e.g. Swissprot, Uniprot) gives best hits to fish proteins
with obscure names and poor annotations • Sequence banks change continuously• Many genes have multiple names, nomenclature is available only for human• No rules how to deal with low similarities and uncertain homologies
Curated inspection is required
Functional annotation (Gene Ontology - GO) • Transfer from putative homologs• Problems in GO• Phylogenetic conservation of function is not granted
Structural annotation, assignment to multi-gene families, search for domains (Interpro, Ensembl)
• Has not been accomplished for salmonids at the genome level (?)• Blast is insufficient, more sophisitcated approaches are required (e.g. Hidden Markov
Models – HMM)
GENE INDICES (TIGR, GRASP)
+The Gene Index provides- Blastn across salmon or rainbow trout sequences- Sequences and reading frames- Links to Genbank - Positions of EST in contigs - GO annotations- Links to pathways
Shall we wait until developers will introduce everything we need?No database / server is able to meet all requirementsWhat to do with new sequences (EST)?
- However- Limited possibilities for search and retrieve of data, especially in a multi-gene format- Many important annotation are missing- Not adapted for comparative genomic studiesetc...
Our first experience in development of software(written by Petri Pehkonen, student of computer science department as a practical exercise)
- Stand-alone blast operates with user-specificed sequence databases - Parser, forms for selection of matches, export of data and iterative searches- Database (MS Access) is used for Gene Index
Small and simple effort in development of software helped to resolve many problems
• Analysis, structural and functional annotations of EST• Search for members multi-gene families and functional
classes• Design of microarrays• Linking to web databases, e.g. Harvester knowledge base
Small and simple effort in development of software helped to resolve many problems
• Analysis, structural and functional annotations of EST• Search for members multi-gene families and functional
classes• Design of microarrays• Linking to web databases, e.g. Harvester knowledge base
Much more can be done if several groups will join efforts!
What do we need?
What do we need? Personal opinion
Diverse databases / pipe-lines (adapted for research areas, projects, labs personal preferences)
MUST be designed by biologistsOnly USERS know what data and
analyses they needDesign of database is a time-consuming
taskThe rest is done easily by computer
people
Toolkit for common useBasic sequences analyses tools adapted
for database (blast, sequence alignment, translation, synonymous / non synonymous substitutions etc)
ParsersInterfaces and forms to launchapplication, to import, query, format
and export data
Task requires joint effort
AREA FOR COLLABORATION:IMMUNOGENOMICS
A simple question: what immune genes have been (not) identified in salmonid fish?
- Interferon-dependent genes?- Immunoglobulins?- Homologs to surface antigens (CD)?
A simple solution: - Retrieve all immune proteins by linking GO to Swissprot
/ Uniprot (use SRS at EBI)
SRS is database of databases, an extremely useful and powerful source
IMMUNOGENOMICSA task: - Retrieve all immune proteins by linking GO to Swissprot / Uniprot (use SRS at
EBI)- Run blast
False positives: - Uncertain homology- Errors in annotationsFalse negatives: - Many important genes with immune functions are not annotated in GO- Many fish homologs are not recognized with blast due to low sequence
conservation, more powerful methods are required (e.g. HMM)To identify fish immune we must compile a set of reference genes and
use advanced methods of sequence analyses
To use results we need more precise and extensive annotations (Immune Ontology), description of each gene
From EST to microarraysGenome-wide platforms (GRASP, TRAITS)+ Good for screening and search for markers- Lack of spot replicates means low accuracy- High quality annotation of large numbers of genes is VERY
problematic
Medium-size, specialized platforms (e.g. 1.8 K immunochip)- Many important genes are missing, especially those with unknown
functions+ Spot replicates ensure high sensitivity and accuracy+ Coverage of functions that are most important for a research area+ High quality annotations are feasible
JVI considers manuscripts that include microarrays and similar parallel profiling analyses of viral or cellular gene expression. However, such manuscripts will be published only if they provide novel insight into the biology of the virus or the infected cell or if they form the basis for additional experiments that provide such insights
JVI, guide for authors, 2007
Most important and challenging task is to make sense from gene expression data
STANDARD MICROARRAY MANUSCRIPTINCLUDES
- Lists of genes divided into clusters- GO analyses (enriched / depleted classes in the list and / or clusters)- Genes placed on the maps of pathway
- Such results are produced easily and do not have any value per se- Papers without important biological findings should not be published
- Statistics / data mining is a useful subsidary tool - Researchers should not be confined to any particular method or software- Results are always noisy and should be taken with great caution
Fish_Chips.exe
Fish_Chips includes:
- Simple ontology of samples (~ 300)- Easy and flexible querying by experiments, genes, parameters
(expression ratio, log-ER, ranks etc)
To find pros and cons of data mining procedures it is important to have samples as a database with strong and flexible querying.Solution: relational database with utilities for data querying and analyses
Simple ontolog
Fish_Chips.exe
Fish_Chips.exe
Fish_Chips includes:
- Simple ontology of samples- Easy and flexible querying by experiments, genes, parameters
(expression ratio, log-ER, ranks etc)- Sequences, annotations (GO, KEGG), links to web databases
Built-in statistical analyses (comparison by GO classes, t-test)- Direct link to Statistica (ANOVA, exact Fisher’s test, mean
expression profiles)- Formatting of data for external applications (cluster analysis)
Cluster analysis – why?Lack of choice
In microarray analyses number of measurements is greater than number of samples
Cluster analysis – one of few methods that can work with such data
Extremely simple, entirely formal – no theory behindMany technical and biological problems
Genes members of cluster are co-regulated. Is that true?! - Clusters are found in any data set- Different procedures produce different clusters, clusters must be checked for strength- Similar expression profiles are often observed only in small data sets, most clusters are destroyed by addition of samples- Even strong clusters do not necessarily have bilogical siginificanceExample: Genes that were up- or down-regulated in only one outlier form a strong and highly significant cluster
Classification of samples. Clustering helps to see the structure of experiment and interaction of factors
T
LE
ILE
IHE
I
HE
T
LE
ILE
IHE
I
HE
Separate and combined effects of estrogen and parasitic kidney disease on hepatic gene expression in rainbow trout (Helmut Segner, Univ. Bern)
Infection and
infection + estrogen
Estrogen
When two challengesare combined response to pathogen is greater
Euclidian distance metric, Ward’s method (available only in statistical packages)
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5C
Log
(Exp
ress
ion
ratio
)
Study groups
T LE HE I ILE IHE
Beta E2 = - 0.199Beta PKD = - 0.630
Complement component C3Complement component C5Complement component C9 Complement factor Bf-1Complement factor H Properdin C type lectin receptor BToll-like receptor 20a CeruloplasminEndothelial leukocyte adhesion molecuIg kappa chain V-IV region B17-2ProfilinCC chemokine SCYA110-2Adenosine kinase 2Acute phase proteinG1/S-specific cyclin D2Histone deacetylase 4 Fibroblast growth factor-20 Bone morphogenetic protein 8-likeMetallothionein-ILStress 70 protein chaperoneThioredoxin-like protein 4A
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2DL
og (E
xpre
ssio
n ra
tio)
Study groups
T LE HE I ILE IHE
Beta E2 = - 0.741Beta PKD = 0.125
Bcl2-associated X proteinC3a anaphylatoxin chemotactic receptoCC chemokine SCYA110-1Cytokine inducible SH2-containing proteEgl nine homolog 2 Hemopexin Ig kappa chain V-I region WEAInterferon-related regulator 2-1Liver-expressed antimicrobial peptide 2Membrane-type mosaic serine proteaseMyelin basic protein-1 NAD-dependent deacetylase sirtuin 5Peptidyl-prolyl cis-trans isomerase 2-2Semaphorin 7A
Induced with parasite, no response to estrogen
Induced with parasite, suppressed with estrogen
Finding transcriptionmodules enhancesresolution of analyses
- Differemtially expressed geneswere clustered- Cluster members were checkedfor correlation to mean expres-sion profile (r > 0.7)- Multiple regression evaluated effects of factors (p < 0.05) and their interaction