بسم الله الرحمن الرحیم

transcript

Lecturer: Dr. Farkhondeh Poursina, PhD

poursina@med.mui.ac.ir

Using NCBI Resources for Gene Discovery

National Center for Biotechnology Information (NCBI)National Library of MedicineNational Institutes of Health

http://www.ncbi.nlm.nih.gov/

PRIMARY BIOLOGICAL DATABASES

Nucleic acid & Protein

EMBL(European Molecular Biology Laboratory) DDBJ (DNA Data Bank of Japan)

GenBank (NCBI, The National Center for Biotechnology Information)

EMBL/GENBANK/DDJB These 3 db contain mainly the same information (few

differences in the format) Serve as archives containing all sequences (single

genes, ESTs, complete genomes, etc.)

derived from:Genome projects and sequencing centers Individual scientists

Non-confidential data are exchanged daily Currently: 2.5 x107 sequences, over 3.2 x1010 bp; Sequences from > 50,000 different species;

THE ‘PERFECT’ DATABASE

Comprehensive, but easy to search.

Annotated, but not “too annotated”.

A simple, easy to understand structure.

Cross-referenced.

Minimum redundancy.

Easy retrieval of data.

THE NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION

Created in 1988 as a part of theNational Library of Medicine at NIH(National Institutes of Health)

– Establish public databases– Research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information

Bethesda,MD

WEB ACCESS: WWW.NCBI.NLM.NIH.GOV

New HomepageCommon footerCommon footer

New pages!New pages!

TYPES OF MOLECULAR DATABASES(SEQUENCE) AT NCBI

Primary Databases Original submissions by experimentalists Content controlled by the submitter

Examples: GenBank, Trace, SRA, SNP, GEO

Derivative Databases Derived from primary data

Curated/expert review(Content controlled by third party (NCBI)compilation and correction of data Examples: NCBI Protein, Refseq, RefSNP, UniGene, Homologene,

Structure, Conserved Domain

PRIMARY VS. DERIVATIVE SEQUENCE DATABASES

GenBankGenBank

SequencingSequencingCentersCenters

ATTCCGAGA

ATTCC GAGA

TTGACAAT

TGACTA

ACGTGC

TTGACA

CGTGAATTGACTA

TATAGCCG

ACGTGC

ACGTGCACGTGC

TTGACA

ATTGACTA

ATTGACTAATTGACTA

ATTGACTA

TATAGCCG

TATAGCCGTATAGCCGTATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG

ATTCC GAGA

ATTCC LabsLabs

AlgorithmsAlgorithms

UniGene

CuratorsCurators

RefSeq

GenomeAssembly

TATAGCCGAGCTCCGATACCGATGACAA

Updated continually by NCBI

Updated ONLY by submitters

THE PROBLEM

Rapidly growing databases with complex and changing relationships

Rapidly changing interfaces to match the above

Result Many people don’t know:

Where to begin Where to click on a Web page Why it might be useful to click there

DERIVATIVE SEQUENCE DATABASES

ENTREZ

FINDING RELEVANT INFORMATION IN NCBI

DATABASES

YOU CAN SEARCH DNA SEQUENCE DATABASE

Retrieve known sequences by ENTREZ http://www.ncbi.nlm.nih.gov/Entrez/ Click – Nucleotide OR

Accession number Keyword search

Entrez is Internally Cross-linked DNA and protein sequences are linked to other similar sequences Medline citations are linked to other citations that contain similar keywords 3-D structures are linked to similar

structures

DATABASES CONTAIN MORE THAN JUST DNA &PROTEIN SEQUENCES

Retrieve all sequences for an organism or taxon

Starting with an organism or taxon name...

How to: Download the complete genome for an organism

Starting at the Genomes

How to: Find transcript sequences for a gene

Starting with ... A GENE NAME, PRODUCT NAME, OR

SYMBOL How to: Obtain genomic sequence

for/near a gene, marker, transcript or protein

Starting with... A GENE NAME OR SYMBOL

ENTREZ TIP: START SEARCHES IN GENE

Other Entrez DBs

HomoloGene

Entrez Protein

UniGene

Homologene:Gene Neighbors

How to: Display genomic annotation graphically

Starting with... A NUCLEOTIDE RECORD (e.g.

NC_000001)

BY APPLYING LIMITS, THERE ARE NOW JUST TWO ENTRIES

Precise Results

A TRADITIONAL GENBANK RECORD

Locus Field Molecule Type

Genbank Division

Modification DateDefinition Line

Taxonomy

GI (GenInfo)

Submission Field

ACCESSION NOACCESSION VERSSION

Molecular weight

TRADITIONAL GENBANK RECORD

ACCESSION U07418

VERSION U07418.1 GI:466461

ACCESSION U07418

VERSION U07418.1 GI:466461

Accession•Stable•Reportable•Universal

VersionTracks changes in sequenceVersionTracks changes in sequence

GI numberNCBI internal useGI numberNCBI internal use

the sequence is the datathe sequence is the data

Coding sequenceCoding sequence

What is an accession number?

An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.

Examples (all for retinol-binding protein, RBP4):

X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)

N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)

NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record protein

Feature Table

GenPept Record

Genomic DNA Sequence

FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS

GENPEPT: GENBANK CDS TRANSLATIONS

>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

REFSEQ

• Reference Sequences− Nucleotide sequences and protein translation− Curated by NCBI or NCBI-approved programs.

• Difference between GenBank and RefSeq − GenBank has raw data and duplicated records− Metadata in GenBank can be incomplete− RefSeq annotated, curated and non-redundant. − NCBI takes best sequences from GenBank and curates for RefSeq records

SELECTED REFSEQ ACCESSION NUMBERS

mRNAs and Proteins

NM_123456 Curated mRNANP_123456 Curated ProteinNR_123456 Curated non-coding RNAXM_123456 Predicted mRNAXP_123456 Predicted Protein XR_123456 Predicted non-coding RNAGene RecordsNG_123456 Reference Genomic SequenceChromosomeNC_123455 Microbial replicons, organelle genomes, human chromosomesAC_123455 Alternate assembliesAssembliesNT_123456 Contig NW_123456 WGS Supercontig

only 1 RefSeq

over 100,000nucleotide entriesfor HIV-1

HOW TO SAVE? Choose FASTA from the Display drop-down

menu Transform the content of this window into plain

text by choosing Text from the drop-down menu located on the far right of the menu bar.

Save the FASTA sequence by using the following protocol:

a. In the Edit menu of your Web browser, click Select All and then

click Copy. b. Open a default Word document and, in the Edit

menu of Word, click Paste. c. Finally, save your document as dUTPaseDNA.txt by

choosing the Save as type option text only (*.txt).

FASTA FORMAT DESCRIPTION

• FASTA is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985

• Popular Format and commonly used• A sequence in FASTA format begins with a

single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.

شکوه

ریاضی،ف

رانک کاظ می

بسم الله الرحمن الرحیم

Documents