+ All Categories
Home > Documents > Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and...

Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and...

Date post: 21-Jan-2016
Category:
Upload: melvyn-berry
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
31
Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome Analysis By David W. Mount)
Transcript
Page 1: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

Computer Storage of Sequences

CSE730: Seminar on “Information Retrieval of Biomedical

Text and Data”

(Chapter 2 of Bioinformatics: Sequence and Genome Analysis

By David W. Mount)

Page 2: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

Outline

Storing DNA/Protein sequences into computer files or databases.

Related information placed in the database along with the sequence in a number of sequence data formats.

Online public access Databases for sequence retrieval.

Page 3: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

Nucleotide SequenceNomenclature Committee of the International Union of Biochemistry

Code Nucleic Acid(s) Code Nucleic Acid(s)

A Adenine M A or C (amino)

C Cytosine R A or G (purine)

G Guanine W A or T (weak)

T Thymine S C or G (strong)

U Uracil Y C or T (pyrimidine)

K G or T (keto)

V A or C or G

H A or C or T

D A or G or T

B C or G or T

N A or G or C or T (any)

Page 4: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

Protein SequenceCode Amino acid Code Amino acid

A Alanine N Asparagine

B Asparagine P Proline

C Cysteine Q Glutamine

D Aspartic acid R Arginine

E Glutamic acid S Serine

F Phenylalanine T Threonine

G Glycine V Valine

H Histidine W Tryptophan

I Isoleucine X Unknown

K Lysine Y Tyrosine

L Leucine Z Glutamine

M Methionine

Adapted from IUPAC-IUB (1969,1972, 1983)

Page 5: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

Sequence Formats

Sequence is stored as ASCII text (i.e. sequence of A,G,C,T…) along with annotations.

Different sequence formats recognized by different sequence analyzer programs.

Sequence Format includes accessory information, gene names, source organism, investigator name, references, and the actual sequence.

Page 6: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

Sequence Formats (continued)

FASTA GenBank Flat File format PIR/CODATA format EMBL sequence entry format Intelligenetics sequence entry format GCG (Genetics Computer Group) sequence

entry format. ASN.1 XML

Page 7: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

Databases

NCBI

GenBank at the National Center of Biotechnology Information (NCBI), National Library of Medicine, Washington, DC

NBRF

Protein Information Resource (PIR) database at the National Biomedical Research Foundation in Washington, DC

Page 8: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

Databases (continued)

SwissProtThe SwissProt protein sequence database at ISREC, Swiss Institute for Experimental Cancer Research.

EMBLEuropean Molecular Biology Laboratory (EMBL) Outstation at Hixton, England

DDBJDNA DataBank of Japan (DDBJ) at Mishima, Japan

Page 9: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

Databases on Internet

NCBI http://www.ncbi.nlm.nih.govPIR

http://www-nbrf.georgetown.edu/pirwww

SwissProt http://www.expasy.ch/cgi-bin/sprot-search-de

EMBL http://www.ebi.ac.uk/embl/index.htmlDDBJ http://www.ddbj.nig.ac.jp/

Page 10: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

NCBI

National resource for molecular biology information.

Maintains comprehensive databases for variety of Biotech related information.

Develops and manages access to a range of databases and softwares for scientific and medical communities.

Page 11: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

NCBI : Integrated Databases

Literature Databases Pubmed PubMed Central OMIM PROW BookShelf

Page 12: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

NCBI : Integrated Databases (continued)

Nucleotide Databases GenBank EST Database GSS Database SNPs Database RefSeq STS Database

Page 13: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

NCBI : Integrated Databases (continued)

Entrez Databases Pubmed Protein Sequence Database Nucleotide Sequence Database Taxonomy OMIM

Page 14: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

GenBank

GenBank is the NIH genetic sequence database.

Annotated collection of all publicly available DNA sequences.

GenBank is a part of an international collaboration of sequence databases along with EMBL and DDBJ.

Page 15: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

GenBank DNA Sequence Format

DNA sequence in GenBank is formatted into distinct attributes as following

Locus: locus name, sequence length, division, date

Definition: description of entry

Accession: unique accession number

Version: version of sequence

Keywords: keywords for cross referencing

Page 16: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

GenBank DNA Sequence Format(continued)

Source: source organism of DNA

Organism: description of organism References: authors, title, journal, Medline, etc

Features: information about sequence

Base count: number of bases in sequence

Origin: sequence data begin following origin.

Genebank sample

Page 17: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

NCBI : Tools

Tools for Data Retrieval and submission Text Term Searching Sequence Similarity Searching Taxonomy Searching Sequence Submission

Page 18: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

NCBI : ENTREZ

Entrez is a search and retrieval system that integrates information from databases at NCBI.

These databases include nucleotide sequences, protein sequences, macromolecular structures, whole genomes, and MEDLINE, PubMed. Etc.

Entrez

Page 19: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

NCBI : BLAST

BLAST: Basic Local Alignment Search Tool It is a set of similarity search programs designed

to explore available sequence databases. It uses a heuristic algorithm which is able to

detect relationships among sequences which share only isolated regions of similarity

Q-BLAST: It is a queuing system to BLAST that allows users to retrieve results at their convenience and format their results.

Page 20: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

NCBI : BLAST (continued)

Access to BLAST serviceWeb-BLASTStandalone BLASTNetwork BLASTBLAST URL API

Page 21: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

NCBI : BLAST (continued)

BLAST Programs Blastp : Compares amino acid sequence against

protein sequence Database Blastn : Compares nucleotide sequence against

nucleotide sequence Database Blastx :Compares nucleotide query sequence against

protein sequence Database Tblastn : Compares protein query sequence against

nucleotide sequence Database

BLAST

Page 22: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

NBRF :PIR

Protein Information Resource

3 Major Databases:PSD (Protein Sequence Database)iProClassPIR-NREF

(Nonredundant REFerence protein database)

Page 23: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

PIR: PSD

The PIR, in collaboration with MIPS and JIPID, produces and distributes the PIR-International Protein Sequence Database (PSD) .

Comprehensive and expertly annotated protein sequence database.

The primary sources of PSD data are sequences from GenBank/EMBL/DDBJ translations, published literature, and direct submission to PIR-International.

Page 24: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

PIR: PSD (continued)

The PIR-PSD data is available in XML format and NBRF, PIR/CODATA formats. The sequence file is available in FASTA format.

Also available at PIR UNIX FTP server. Address: ftp://ftp.pir.georgetown.edu/pir_databases/psd/

Page 25: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

CODATA format

CODATA format has approximately the same information as a GenBank or EMBL sequence file, but is slightly differently formatted and has different field names.

Also called PIR format, used by NBRF.

CODATA Sample

Page 26: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

PIR: iProClass

The iProClass database provides comprehensive descriptions of all proteins and serves as a framework for data integration in a distributed networking environment.

Very user-friendly description.

Page 27: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

PIR: NREF (Non-redundant REFerence protein database) Comprehensive: Containing all sequences from PIR-PSD, Swiss-

Prot, TrEMBL, RefSeq, GenPept, and updated bi-weekly.

Non-Redundant: Clustered by sequence identity and taxonomy at the species level.

Source Attribution: Containing protein IDs and names from associated databases (with hypertext links), in addition to protein sequence, taxonomy, and bibliography.

The current version (July 2002) consists of more than 809,000 non-redundant PIR-PSD, SwissProt and TrEMBL proteins organized with more than 36,200 PIR superfamilies, 145,340 families, and links to over 50 molecular biology databases.

Page 28: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

Swiss-Prot

Swiss-Prot is a protein knowledgebase established in 1986.

Maintained collaboratively, by the Department of Medical Biochemistry of the University of Geneva (now the Swiss Institute of Bioinformatics) and the EMBL Data Library.

Swiss-Prot Sequence Entry Example

Page 29: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

Sequence Format ConversionREADSEQ: Sequence Format Conversion program.http://bimas.dcrt.nih.gov/molbio/readseq/

Can convert to/from: ASN.1 FASTA CODATA GCG EMBL format GenBank format and many other formats

Page 30: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

References

http://www.ncbi.nlm.nih.gov http://www-nbrf.georgetown.edu/pirwww http://www.expasy.ch/cgi-bin/sprot-search-de http://www.ebi.ac.uk/embl/index.html http://www.ddbj.nig.ac.jp/

Page 31: Computer Storage of Sequences CSE730: Seminar on “Information Retrieval of Biomedical Text and Data” (Chapter 2 of Bioinformatics: Sequence and Genome.

Thank You

Presented by:Hemal Patel &

Jeetal Shah


Recommended