+ All Categories
Home > Documents > STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room...

STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room...

Date post: 22-Dec-2015
Category:
Upload: myles-lawrence
View: 213 times
Download: 1 times
Share this document with a friend
Popular Tags:
75
STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email: [email protected] er. 23-01-09-1
Transcript
Page 1: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

STBC2023 – Introduction to Bioinformatics

Introduction to Sequence Analysis

M. Firdaus RaihRoom 1166, Bangunan Sains Biologi

Phone: 0389215961 Email: [email protected]. 23-01-09-1

Page 2: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Pre-session Questions• What are nucleic acids? • What types of nucleic acids are there?• What functions do nucleic acids have?• What sort of information do nucleotide sequences carry?

• What can be done with DNA sequences?• What can be done with RNA sequences?• Is molecular structure important for RNA sequences?• What is a sequence alignment?• What is the relationship of an alignment with regard to

biological function?• Is extracting the encoded information for protein synthesis

the only sequence analysis which can be done?

Page 3: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Learning objectives• Know the basic chemistry and able to understand the diverse

functions of nucleic acids.

• Able to associate the structure of bio-macromolecules to their function.

• Able to generally list potential analyses for nucleic acid sequence data and the applications for those analyses based on an understanding of the functions of nucleic acids.

• Able to formulate a strategy and present processes involved in the analysis of nucleic acid sequence data.

• Able to comprehend the basic concepts involved in sequence alignments in general and aligning nucleic acids specifically as well as the relationship between an alignment to a sequence’s biological function.

Page 4: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Nucleic Acids: Chemistry and Molecular StructureWhat are nucleic acids?

Page 5: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Nucleic Acids: Chemistry and Molecular StructureWhat are nucleic acids?• Nucleic acids = polymer of nucleotides 2 types

• DNA – deoxyribonucleic acids• RNA – ribonucleic acids

Page 6: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Nucleic Acids: Chemistry and Molecular StructureWhat are nucleic acids?• Nucleic acids = polymer of nucleotides 2 types

• DNA – deoxyribonucleic acids• RNA – ribonucleic acids

What is a nucleotide?

Page 7: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Nucleic Acids: Chemistry and Molecular StructureWhat are nucleic acids?• Nucleic acids = polymer of nucleotides 2 types

• DNA – deoxyribonucleic acids• RNA – ribonucleic acids

What is a nucleotide?• Nucleotide = nucleoside + 1 phosphate group• Nucleoside = nitrogenous base + sugar (ribose)

Page 8: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Nucleic Acids: Chemistry and Molecular Structure

What is the basic difference between RNA and DNA (in terms of chemistry)?

RNA DNAClick here for animation

Page 9: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Nucleic Acids: Chemistry and Molecular Structure

How can the nucleotide polymer be represented?

Page 10: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Nucleic Acids: Chemistry and Molecular Structure

How can the nucleotide polymer be represented?

5’ACTG3’

3’TGAC5’

=

===

==

Seq 1. ACTGSeq 2. TGAC

What can be done with such sequence data?How is the analysis related to biological function?

Hydrogen bonded base interactions and base stacking interactions result in stable structures of DNA / RNA.

Page 11: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Nucleic Acids: Biological Functions

• What is/are the function(s) of DNA?

• What is/are the function(s) of RNA?

Page 12: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Nucleic Acids: Biological Functions

What is the function of DNA?– Storage of genetic information– Proteins such as transcription factors also interact directly

with DNA as part of regulatory pathways– Total genetic content of an organism = genome– Genes are part of genomes– So… what is a gene?

Page 13: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Nucleic Acids: Biological Functions

What is the function of DNA?– Storage of hereditary information in genes.– What is a gene?

While sequencing of the human genome surprised us with how many protein-coding genes there are, it did not fundamentally change our perspective on what a gene is. In contrast, the complex patterns of dispersed regulation and pervasive transcription uncovered by the ENCODE project, together with non-genic conservation and the abundance of noncoding RNA genes, have challenged the notion of the gene. To illustrate this, we review the evolution of operational definitions of a gene over the past century--from the abstract elements of heredity of Mendel and Morgan to the present-day ORFs enumerated in the sequence databanks. We then summarize the current ENCODE findings and provide a computational metaphor for the complexity. Finally, we propose a tentative update to the definition of a gene: A gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products. Our definition side-steps the complexities of regulation and transcription by removing the former altogether from the definition and arguing that final, functional gene products (rather than intermediate transcripts) should be used to group together entities associated with a single gene. It also manifests how integral the concept of biological function is in defining genes.

Page 14: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Nucleic Acids: Biological Functions

What are the functions of RNA?

Page 15: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Nucleic Acids: Biological Functions

What are the functions of RNA?– Information storage and transfer

• Genomes of RNA viruses• mRNA

– Protein synthesis• tRNA• Peptidyl transferase

– Catalysis• ribozymes

– Regulatory• Small ncRNAs / microRNAs• Riboswitches

Also see: The RNA World hypothesis – first coined by Walter Gilbert 1986, Nature

Page 16: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

DNA (Genes): From Sequence to Function

How does a gene sequence correlate to biological function?

Page 17: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

DNA (Genes): From Sequence to Function

How does a gene sequence correlate to biological function?

Let’s first look at:

Information about the amino acid sequence is contained within the nucleic acids sequence.

Is that the only analysis that can be done for DNA sequences?What other analyses, if any, can be done for DNA sequences?

Page 18: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Potential Analyses for DNA Sequences

What can be done with DNA sequences?– Genome projects: DNA sequencing data need to be

assembled into complete genomes.– Genes need to be identified / predicted.– Comparisons of specific nucleotide level variations.– Identification and analysis of specific nucleotide sequence

level motifs and patterns.– Identification and analysis of polymorphisms.

Page 19: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Potential Analyses for DNA Sequences

What can be done with DNA sequences?Genome projects: DNA sequencing data need to be assembled into complete genomes.– Genome sequencing generate fragments of sequences .– These fragments need to be assembled into genes, chromosomes and

finally the complete genome.– Assembly is done by analyzing for contiguous sequences (contigs).– Contigs are basically found by aligning the short DNA sequences to

one another and finding where there are overlaps. – More on this topic will be covered in the Genomics course in Year 3.– After the genome is assembled, the genes need to be identified.

Page 20: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Potential Analyses for DNA Sequences

What can be done with DNA sequences?From sequence data, genes need to be predicted.– Several methods to gene prediction:

• Searching by signal – analysis of sequence signals which specify a gene.

• Searching by content – analysis of regions showing compositional bias that has been correlated to coding regions.

• Homology based prediction – comparison against known gene sequence. [involve sequence alignments]

• Comparative gene prediction – comparing sequences of interest against anonymous genomic sequences. [involve sequence alignments]

– The prediction of eukaryotic genes from genomic DNA data is appreciably more difficult than that of prokaryotic. Why?

Page 21: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Potential Analyses for DNA Sequences

What can be done with DNA sequences?From sequence data, genes need to be predicted.– Several methods to gene prediction:

• Searching by signal – analysis of sequence signals which specify a gene.

• Searching by content – analysis of regions showing compositional bias that has been correlated to coding regions.

• Homology based prediction – comparison against known gene sequence. [involve sequence alignments]

• Comparative gene prediction – comparing sequences of interest against anonymous genomic sequences. [involve sequence alignments]

– For this session, we will focus on methods which involve sequence alignments.

Page 22: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Potential Analyses for DNA Sequences

What can be done with DNA sequences?– Comparisons of specific nucleotide level variations.

• Enable differentiation at individual level or close relationships ie. Between strains of the same species.

– Phylogenetic analysis.

Page 23: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Potential Analyses for DNA Sequences

What can be done with DNA sequences?– Identification and analysis of specific nucleotide sequence

level motifs, patterns.– This will be discussed further in the following lecture.– Examples:

• PCR Primer design• Searching / mapping restriction sites

Go to the corresponding BLAST exercise NOW or proceed to the next slide.

Page 24: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Potential Analyses for DNA Sequences

What can be done with DNA sequences?– Identification and analysis of polymorphisms.– This will be discussed further in the following lecture.– Examples:

• SNPs – single nucleotide polymorphisms (more on SNPs)

Go to the corresponding BLAST exercise NOWor proceed to the next slide.

Page 25: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Sequence AlignmentsWhat is a sequence alignment?A way of arranging or ‘aligning’ the similarities between sequences.

Examples:

Gaps (-) are inserted to optimize alignments.They represent ‘indel’ mutations.

Easy to align short sequences manually. But what about longer sequences? How can those be aligned? In order to understand this further, let’s look at a method which we can visualize and track the alignment. This method is called a dot plot.

Page 26: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Sequence AlignmentsWhat is a dot plot?A plot where two sequences are written along the top row and leftmost column of a two-dimensional matrix and a dot is placed at any point where the characters in the appropriate columns match.

Parts of the two sequences where the match is continuous can be traced as a diagonal line region where the sequences are aligned.

A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal; can occur when a protein consists of multiple similar structural domains.

A dot plot is not able to detect divergence or substitutions/mutations which we know can occur.

Page 27: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Sequence AlignmentsDot plot for two DNA sequencesComplete the dot plot for the two DNA sequences provided below. An example can be seen below.Seq1: CGATCGCGTAATCGGTGATCGGCSeq2: CGGTATCGGTGATCGATCGCA

Questions: 1. Which stretch of these sequences can best be aligned to each other? (Answer)2. Can this alignment be extended? (Answer)3. Can you identify a repetitive sequence of 4 bases which keep occurring in both sequences? (Answer)

4. What can you attribute all the other plotted dots to? (Answer)

Page 28: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Sequence AlignmentsDot plot for two DNA sequences

Questions:

1. Which stretch of these sequences can best be aligned to each other?Answer: Longest continuous diagonal line from your dot plot. ATCGGTGATCG

2. Can this alignment be extended? Answer: Yes, it can be extended as shown below. 2 nucleotides are not aligned and may possibly be

substitutions.

3. Can you identify a repetitive sequence of 4 bases which keep occurring in both sequences?Answer: ATCG, this can be deduced from the repeating short diagonal lines.

4. What can you attribute all the other plotted dots to?Answer: They are the result of random sequence similarities.

Page 29: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Computational Sequence Alignments• We’ve looked at manual alignments for short sequences and the dot

plot… However, manual alignments cannot be done for lengthy and highly variable sequences. Therefore for long variable sequences, computer aided alignments need to be done.

• How can computer aided alignments be done?• To enable computer aided alignment, algorithms called dynamic

programming algorithms are used.

• Two common dynamic programming algorithms approach alignment differently, via:1. Local alignments: Smith-Waterman algorithm2. Global alignments: Needleman-Wunsch algorithm

Page 30: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Computational Sequence AlignmentsWhat is the difference between global and local alignments?• Local alignments: Smith-Waterman algorithm• Global alignments: Needleman-Wunsch algorithm

• The Smith-Waterman algorithm is currently the most used because real biological sequences are usually similar in localized portions and not over entire lengths.– Examples:

• genes from different organisms with similar exons, different intron structures• Proteins share only certain domains

• Alignments can have gaps which represent mutations. The ability to add gaps is required as sequence diverge.

• So how do we know that an alignment is meaningful?

Page 31: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Computational Sequence AlignmentsHow do we know that an alignment is meaningful?• Insertions and deletions are slow evolutionary processes, therefore addition of gaps

MUST be controlled to avoid large proportions of matches by inserting large numbers of gaps.

• Gap penalties are given to control addition of gaps. The penalty system can be constant or proportional.

• Scores are given for matches, while penalties are given for addition of gaps.

• The alignment algorithm then carries out alignments in order to get the best score.

• Like the dot plot, a simple system as above does not seem to fully consider divergence (ie. point mutations) – only deletions and insertions seem to be considered.

• How can we get around this problem?

Page 32: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Computational Sequence AlignmentsHow do we know that an alignment is meaningful? (…cont.)• Point mutations can result in change as opposed to deletion or insertion.• A matrix called a substitution matrix can be used to model the possible changes and provide

quantitative values to changes arising from point mutations.

• The values for substitution cantake into consideration similarity such as physico-chemicalproperties for amino acids or transition mutations for nucleic acids.

• But there is still probabilitythat a search result is randomespecially for large databases.How can we be certain the alignment achieved is the expected result?

Amino acid substitution matrix example

Nucleic acid substitution matrix example

Page 33: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Computational Sequence AlignmentsHow do we know that an alignment is meaningful? (…cont.)• How can we be certain the alignment achieved is the expected result?• The alignments produced are statistically evaluated.

• As an example, for the BLAST program, a value called the Expectation (E) value is given.

• The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance.

• The lower the E value, the more significant the score.

Page 34: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Sequence AlignmentsWhat is the rationale in doing an alignment?• Proteins perform most cellular functions.• The structure of a protein is an important determinant of its

function. • If proteins share a similar structure, then it may also share a

similar function. • We know that sequences with 30% similarity, share a similar

fold (Chothia & Lesk 1986).

Page 35: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Sequence AlignmentsWhat is the rationale of doing an alignment?• If proteins share a similar function, then it may also share a

similar structure.

Heme site

Page 36: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Sequence Alignments

What is the rationale in doing an alignment?• If proteins share a similar structure, then it may also

share a similar sequence. • But our interest here are NUCLEIC ACID sequences…• So what is the relevance?

Page 37: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Sequence Alignments

What is the rationale in doing an alignment?• If proteins share a similar structure, then it may also

share a similar sequence. • But our interest here are NUCLEIC ACID sequences…• So what is the relevance?

Page 38: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Sequence Database Searching

What is a sequence database?

What are we searching for, and how do we search for something in sequence databases?

Page 39: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Sequence Database SearchingWhat is a sequence database?– A collection of biological macromolecular sequences.– Can be sequences organized into organisms, protein families,

sources etc.– Example: NCBI GenBank.

What are we searching for, and how do we search for something in sequence databases?

Page 40: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Sequence Database SearchingWhat is a sequence database?– A collection of biological macromolecular sequences– Can be sequences organized into organisms, protein families, sources etc.– Example: NCBI GenBank

What are we searching for, and how do we search for something in sequence databases?– We are searching for sequence similarity.– We can search for sequence similarity by comparing an input (query) sequence

against sequences in the database.– This comparison is done by aligning the query sequences to the database sequences

one tool we can use is BLAST.– How is this alignment relevant biologically?

Page 41: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Sequence Database SearchingWhat is BLAST?– Basic Local Alignment Search Tool– Implements heuristics to approximate the Smith-Waterman

algorithm and search for high scoring alignments.– The alignment scores are then statistically evaluated – one

example is the E value discussed previously.– BLAST is actually a family of programs.

Page 42: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Sequence Database SearchingWhat is BLAST?– Basic Local Alignment Search Tool– Implements heuristics to approximate the Smith-Waterman

algorithm and search for high scoring alignments.– The alignment scores are then statistically evaluated – one

example is the E value discussed previously.– BLAST is actually a family of programs.

Page 43: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

BLAST

How do we use BLAST?

Page 44: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

BLAST

How do we use BLAST?

(1) Select the BLAST program

(2) Input the sequence (query)

(3) Choose the database to search

(4) Choose optional parameters

Then click “BLAST”

Page 45: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

BLAST

Is that it?

Page 46: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

BLAST

Is that it?... YES and NO…. Let’s look at some considerations and strategies for BLAST searching.

Page 47: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

BLAST

Some considerations and strategies:– Input sequence and search database – what is it that you’re

really interested in? Finding similarity alone or identifying homologs? Finding homologs only or perhaps trying to find out if genes with similar sequences encode for proteins with available structures? The answer to these types of questions influence the type of search program you should use and the database to search in.

Proteinvs.

Nucleotide?

Page 48: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

BLAST

Some considerations and strategies:– Are you interested in something quite specific?

Page 49: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

BLAST

Some considerations and strategies:– Did you forget to turn something on/off?

• Sequence filters – Low-complexity regions have fewer sequence characters in them because of repeats of the same sequence character or pattern. These sequences produce artificially high-scoring alignments that do not accurately convey sequence relationships in sequence similarity searches. Regions of low complexity or repetitive sequences may be readily visualized in a dot matrix analysis of a sequence against itself. Low-complexity regions with a repeat occurrence of the same residue can appear on the matrix as horizontal and vertical rows of dots representing repeated matches of one residue position in one copy of the sequence against a series of the same residue in the second copy. Repeats of a sequence pattern appear in the same matrix as short diagonals of identity that are offset from the main diagonal. Such sequences should be excluded from sequence similarity searches.

Page 50: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

BLAST

Some considerations and strategies:– Did you forget to turn something on/off?

• Options and parameter settings

Page 51: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Output of BLAST Searches• What are the components of a BLAST search output • Example: blastn vs blastx (GenBank AF390557)

This section: overview of the output alignments

blastn

blastx

Page 52: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Output of BLAST Searches• What are the components of a BLAST search output • Example: blastn vs blastx (GenBank AF390557)

This section: list of hits (alignments)

Read more about interpreting the output.

blastn blastx

Page 53: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Output of BLAST Searches• What are the components of a BLAST search output [See video] • Example: blastn vs blastx (GenBank AF390557)

This section: the alignments

blastn blastx

Page 54: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Output of BLAST Searches• To be a significant match, a database sequence that is listed in the program output should have a small E (expect value) and

a reasonable alignment with the query sequence (or translations of protein-encoding DNA sequences should have these same features).

• The E of the alignment score between the sequences gives the statistical chance that an unrelated sequence in the database or a random sequence could have achieved such a score with the query sequence, given as many sequences as there are in the database. The smaller the E, the more significant the alignment. A cutoff value in the range of 0.01-0.05 may be used (Pearson 1996). In genome comparisons, a more stringent cutoff score (10-100-10-20) may be used to find sequences that align very well with the query sequence. However, the alignment should also be examined for absence of repeats of the same residue or residue pattern because these patterns tend to give false high alignment scores.

• Filtering of low-complexity regions from the query sequence in a database search helps to reduce the number of false positives. The alignment should also be examined for reasonable amino acid substitutions and for the appearance of a believable alignment.

• To gain further confidence that the alignment between the query and database sequences is significant, either the query sequence or the matched database sequence may be shuffled many times, and each random sequence may be realigned with the other unshuffled sequence to obtain a score distribution for a set of unrelated sequences. This distribution may then be used to evaluate the significance of the true alignment score.

.

Page 55: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

BLAST

Carrying out a BLAST search:

1. Select and copy the sequence from the GenBank database here.

1. Go to the BLAST page and carry out database searches using the above sequence.– First carry out a search against a nucleotide database.

• Which BLAST programs can you use? Name two possibilities. (Answer)

– Next carry out a search against a protein database• Which BLAST program should you use? (Answer)• (i) Can you further narrow down the search? (ii) Also take for example if you were to search

for genes which code for proteins which have representative 3D structures; how would you conduct such a search? (Answer)

Page 56: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

BLASTAnswers to questions on carrying out a BLAST search:• First carry out a search against a nucleotide database.

– Which BLAST programs can you use? Name two possibilities. Answer: blastn and tblastx. tblastn is not a correct answer because it uses a protein query although the database searched is a

nucleotide database; the input sequence AF390557 is a DNA sequence.

• Next carry out a search against a protein database– Which BLAST program should you use?Answer: blastx

– (i) Can you further narrow down the search? (i) Also take for example if you were to search for genes which code for proteins which have representative 3D structures; how would you conduct such a search?

Answer: (ii) Yes, searches returning a very large number of hits can still be narrowed down. A carefully annotated protein sequence database (e.g., PIR, SwissProt) will provide a more manageable output list of matched sequences, and these proteins have probably been observed in the laboratory; i.e., the genes do produce a protein product in cells. However, investigators may also wish to expand the search to include predicted genes from gene annotations of genomic sequences that are frequently entered into the DNA sequence translation databases (e.g., DNA sequences in the GenBank DNA sequence databases automatically translated into protein sequences and placed in the GenPept protein sequence database). To compare a protein or predicted protein sequence to EST sequences, the ESTs should be translated into all six possible reading frames. (ii) Such a search can be carried out by choosing PDB as the database option. This will limit the blastx search to only protein sequences which have known 3D structures in the PDB.

Page 57: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

BLASTCarrying out a BLAST search:

Retrieve the sequence provided and use it for your BLAST search. See the GenBank page here for the sequence. Change the format of the view to FASTA by selecting FASTA from the dropdown menu marked ‘Display’ (see here). Use this sequence for a BLAST search.

Questions - Identify the sequence which is used. What is this DNA usually used for? (Answer)

- Search for suitable primers to use for PCR. Which program can you use? (Answer)- Identify restriction sites which can be found on this DNA. How many fragments will a digestion with the restriction enzyme BsaI generate? In order to answer this question, you will need to draw on any general web skills you already have to find the appropriate resources. BLAST is not the tool to use in such a case. (Answer)

Page 58: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

BLASTCarrying out a BLAST search:

Questions - Identify the sequence which is used. What is this DNA usually used for?

Answer: pBR322 plasmid, It is used a cloning vector for protein (IG-lambda) expression.

- Search for suitable primers to use for PCR. Which program can you use? What is the largest product size from a possible primer pair found using a default search? Answer: The Primer-BLAST program can be used. The largest possible product is 986bp.

- Identify restriction sites which can be found on this DNA. How many fragments will a digestion with the restriction enzyme BsaI generate? Answer: One such tool which can be used is NEBcutter. Cutting the pBR322 sequence with BsaI will generate 3 fragments of DNA due to cleavage at 2 sites in the sequence.

Page 59: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

SNPs• SNPs (pronounced snips) is a DNA sequence variation which occurs when a single nucleotide — A, T, C, or G — in the genome (or

other shared sequence) differs between members of a species (or between paired chromosomes in an individual) and they comprise the largest known class of human genetic variation.

• SNPs may occur:– within coding sequences of genes, – non-coding regions of genes, or – in the intergenic regions between genes.

• SNPs within a coding sequence will not necessarily change the amino acid sequence of the protein that is produced, due to degeneracy of the genetic code (refer to the codon table discussed earlier) such changes result in silent mutations (synonymous).

• Non-synonymous changes can result in:– Mis-sense change different amino acid coded– Nonsense change premature STOP codon

• Why are SNPs important? If the changes result in non-functional gene products or no gene products, a diseased state may be a possible the end result.

• How can we find SNPS? Methods of discovering SNPs in sequence data: the easiest and most used method is to align two sequences from the DNA of two individuals and look for high quality sequence differences.

.

Page 60: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

BLASTCarrying out a BLAST search:

1. Select and copy the sequence from this link.

1. Go to the BLAST page and carry out a search for SNPs on the above sequence.– Observe the output. How is it different from previous BLAST searches you have carried out.

Correlate the output to what you know about SNPs.

Page 61: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Ribonucleic Acids

• RNA molecules play crucial roles in molecular biology.

• Known functions include:– Information storage– Catalysis– Regulatory roles– Protein synthesis

• Diversity of functions associated to ‘RNA World’ hypothesis

• Potential applications– Molecular scaffolding (nanotechnology)– Drug targets (riboswitches/ribosomes)– RNA interference (RNAi)

The Economist, June 16th-22nd 2007

Page 62: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

RNA : From Sequence to Function

What is a crucial determinant of functionality for functional RNAs?

Page 63: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

RNA : From Sequence to Function

What is a crucial determinant of functionality for functional RNAs?

For functional RNAs, like for proteins, the 3D structure is crucial for biological function.

Page 64: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

RNA Structure

What are the major factors involved in stabilizing the structure of RNA?

– Base stacking and hydrogen bonding contribute to

the stabilization of nucleic acid structure/ RNA structure.

– RNA bases can form hydrogen bonds with each other resulting in interactions between:

• complementary pairings in the canonical Watson Crick interactions

• non-canonical interactions

– Hydrogen bonded base interactions are therefore are crucial elements of a nucleic acid’s 3D structure.

Page 65: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

RNA Base Interactions

eg. Purine-pyrimidine base pairs (10)

after I. Tinoco, Jr. In Appendix 1 of: “The RNA World” (R. F. Gesteland, J. F. Atkins, Eds.), Cold Spring Harbor Laboratory Press, 1993, pp. 603-607.

+

32 pairs

Page 66: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

RNA Structure

• Base stacking and hydrogen bonding contribute to the stabilization of nucleic acid structure/ RNA structure.

• RNA bases can form hydrogen bonds with each other resulting in interactions between:– complementary pairings in the canonical*

Watson Crick interactions – non-canonical interactions

• Hydrogen bonded base interactions are therefore are crucial elements of a nucleic acid’s 3D structure

• 3 levels of RNA structure:– Primary sequence, secondary structure, tertiary

structure.

*… from the Arabic word Qanun which in context here is better suited as the word ‘rule’ as opposed to the literal meaning of ‘law’.

Page 67: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

RNA Structure

How do we get from sequence to structure?

How can we predict the structure of RNA?

Page 68: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

RNA Structure

How do we get from sequence to structure?• Complex (non helical) RNA structures are not easy to predict.

Reliable structural information are sourced from X-ray crystal structures.

• Commonly, only the secondary structure level interactions are predicted to give some insights into what the functional structure may look like.

• However such methods lack the detail which an actual structure model is able to give, such as the exact orientation of bases and specific atomic interactions which are occurring.

• Such interaction data is important because we know that RNA bases can be involved in non-canonical interactions which are different from the canonical Watson-Crick interactions.

How can we predict the secondary structure of RNA?• Several programs which calculate the thermodynamics of

folding (energies of the base interactions) can be used. • One such program is mfold by Michael Zuker.• Assessment of reliability can be done using multiple

alignments and comparisons to other predictions and known structures.

Page 69: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

RNA Secondary Structure Prediction – the mfold program

Predicting the secondary structure of non-coding RNA• Copy the sequence here as input for the mfold program. All other parameters can

be left at default settings.

• Questions:– How many paired bases are you able to observe in the predicted structure? (Answer)– How many bases are unpaired? (Answer)– Name the two types of structures where these unpaired bases can be found. What type

of secondary structure do you think can be observed for regions with canonical Watson-Crick base pairing? (Answer)

– Are you able to observe any base pairings which are non-canonical (non Watson-Crick)? If yes, how many? (Answer)

– Having answered the previous two questions, are you really able to differentiate a canonical vs a non-canonical pairing from the secondary structure diagram alone? (Answer)

Page 70: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

RNA Secondary Structure Prediction – the mfold programPredicting the secondary structure of non-coding RNA

– How many paired bases are you able to observe in the predicted structure? Answer: 29 pairs, 58 paired bases.

– How many bases are unpaired? Answer: 27

– Name the two types of structures where these unpaired bases can be found. What type of secondary structure do you think can be observed for regions with canonical Watson-Crick base pairing?

Answer: Unpaired bases are found in bulges and loops. Regions with canonical pairings as in Watson-Crick are most likely helical.

– Are you able to observe any base pairings which are non-canonical (non Watson-Crick)? If yes, how

many?Answer: 4

– Having answered the previous two questions, are you really able to differentiate a canonical vs a non-canonical pairing from the secondary structure diagram alone?

Answer: No, not really. Although a GU base pair is obviously non-canonical, GC and AU base pairs which may possibly be non-canonical cannot be determined from the secondary structure alone.

Page 71: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Analyses for RNA sequence dataIs predicting the secondary structure the only analyses we can do for RNA sequence data?

Page 72: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Analyses for RNA sequence dataIs predicting the secondary structure the only analyses we can do for RNA sequence data?– NO.

– Genomic data can be analysed for the presence of the numerous types of known non-coding or functional RNA as well as possibly novel or yet to be discovered functional RNA sequences.

– This appreciably more difficult than the problem of predicting genes. Why?

– Currently there are no widely used or general use methods.

– Such investigations are still highly exploratory and currently remain in the domain of experts in the field.

Page 73: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Post-session Questions• What are nucleic acids? • What types of nucleic acids are there?• What functions do nucleic acids have?• What sort of information do nucleotide sequences carry?

• What can be done with DNA sequences?• What can be done with RNA sequences?• Is molecular structure important for RNA sequences?• What is a sequence alignment?• What is the relationship of an alignment with regard to

biological function?• Is extracting the encoded information for protein synthesis

the only sequence analysis which can be done?

Page 74: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Self Study and Self Assessment

• The self study module for this series of lectures on analyses of nucleotide sequences are available for download from SPIN. Format of the file (this file) is powerpoint show (.pps).

• The self assessment quiz is accessible from within the SPIN interface.

• Both these materials are for self assessment and self study use and DOES NOT contribute to your final grades for this course.

• Also explore the references and texts listed in the course information file and reading list.

• Explore resources made available via the self-study material.

Page 75: STBC2023 – Introduction to Bioinformatics Introduction to Sequence Analysis M. Firdaus Raih Room 1166, Bangunan Sains Biologi Phone: 0389215961 Email:

Further Reading

Recommended Textbook (Lesk, 2nd Ed.)• Basics – Chapter 1

– Pages 1-59

• Sequence alignments – Chapter 5, Chapter 1– Pages 242-270– Pages 21-59

Other Textbooks • Baxevanis & Oullette, 3rd edition

– Chapters 5-7

• Pevsner


Recommended