Post on 15-Jan-2016
description
transcript
بسم الله الرحمن الرحیم
Lecturer: Dr. Farkhondeh Poursina, PhD
poursina@med.mui.ac.ir
1392
Using NCBI Resources for Gene Discovery
National Center for Biotechnology Information (NCBI)National Library of MedicineNational Institutes of Health
http://www.ncbi.nlm.nih.gov/
PRIMARY BIOLOGICAL DATABASES
Nucleic acid & Protein
EMBL(European Molecular Biology Laboratory) DDBJ (DNA Data Bank of Japan)
GenBank (NCBI, The National Center for Biotechnology Information)
EMBL/GENBANK/DDJB These 3 db contain mainly the same information (few
differences in the format) Serve as archives containing all sequences (single
genes, ESTs, complete genomes, etc.)
derived from:Genome projects and sequencing centers Individual scientists
Non-confidential data are exchanged daily Currently: 2.5 x107 sequences, over 3.2 x1010 bp; Sequences from > 50,000 different species;
THE ‘PERFECT’ DATABASE
Comprehensive, but easy to search.
Annotated, but not “too annotated”.
A simple, easy to understand structure.
Cross-referenced.
Minimum redundancy.
Easy retrieval of data.
THE NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION
Created in 1988 as a part of theNational Library of Medicine at NIH(National Institutes of Health)
– Establish public databases– Research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information
Bethesda,MD
WEB ACCESS: WWW.NCBI.NLM.NIH.GOV
New HomepageCommon footerCommon footer
New pages!New pages!
TYPES OF MOLECULAR DATABASES(SEQUENCE) AT NCBI
Primary Databases Original submissions by experimentalists Content controlled by the submitter
Examples: GenBank, Trace, SRA, SNP, GEO
Derivative Databases Derived from primary data
Curated/expert review(Content controlled by third party (NCBI)compilation and correction of data Examples: NCBI Protein, Refseq, RefSNP, UniGene, Homologene,
Structure, Conserved Domain
PRIMARY VS. DERIVATIVE SEQUENCE DATABASES
GenBankGenBank
SequencingSequencingCentersCenters
GA
GAGA
ATT
ATTCCGAGA
ATT
ATTCC
AT
GAGA
ATTCC GAGA
ATTCC
TTGACAAT
TGACTA
ACGTGC
TTGACA
CGTGAATTGACTA
TATAGCCG
ACGTGC
ACGTGCACGTGC
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTAATTGACTA
ATTGACTA
TATAGCCG
TATAGCCGTATAGCCGTATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG
CATT
GAGA
ATTCC GAGA
ATTCC LabsLabs
AlgorithmsAlgorithms
UniGene
CuratorsCurators
RefSeq
GenomeAssembly
TATAGCCGAGCTCCGATACCGATGACAA
Updated continually by NCBI
Updated ONLY by submitters
THE PROBLEM
Rapidly growing databases with complex and changing relationships
Rapidly changing interfaces to match the above
Result Many people don’t know:
Where to begin Where to click on a Web page Why it might be useful to click there
DERIVATIVE SEQUENCE DATABASES
ENTREZ
FINDING RELEVANT INFORMATION IN NCBI
DATABASES
YOU CAN SEARCH DNA SEQUENCE DATABASE
Retrieve known sequences by ENTREZ http://www.ncbi.nlm.nih.gov/Entrez/ Click – Nucleotide OR
Accession number Keyword search
Entrez is Internally Cross-linked DNA and protein sequences are linked to other similar sequences Medline citations are linked to other citations that contain similar keywords 3-D structures are linked to similar
structures
DATABASES CONTAIN MORE THAN JUST DNA &PROTEIN SEQUENCES
Retrieve all sequences for an organism or taxon
Starting with an organism or taxon name...
How to: Download the complete genome for an organism
Starting at the Genomes
How to: Find transcript sequences for a gene
Starting with ... A GENE NAME, PRODUCT NAME, OR
SYMBOL How to: Obtain genomic sequence
for/near a gene, marker, transcript or protein
Starting with... A GENE NAME OR SYMBOL
ENTREZ TIP: START SEARCHES IN GENE
Other Entrez DBs
HomoloGene
Entrez Protein
Gene
UniGene
BLink
Homologene:Gene Neighbors
How to: Display genomic annotation graphically
Starting with... A NUCLEOTIDE RECORD (e.g.
NC_000001)
BY APPLYING LIMITS, THERE ARE NOW JUST TWO ENTRIES
Precise Results
A TRADITIONAL GENBANK RECORD
Locus Field Molecule Type
Genbank Division
Modification DateDefinition Line
Taxonomy
GI (GenInfo)
Submission Field
ACCESSION NOACCESSION VERSSION
Molecular weight
TRADITIONAL GENBANK RECORD
ACCESSION U07418
VERSION U07418.1 GI:466461
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession•Stable•Reportable•Universal
Accession•Stable•Reportable•Universal
VersionTracks changes in sequenceVersionTracks changes in sequence
GI numberNCBI internal useGI numberNCBI internal use
the sequence is the datathe sequence is the data
Coding sequenceCoding sequence
What is an accession number?
An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)
N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)
NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record protein
DNA
RNA
Page 27
Feature Table
GenPept Record
Genomic DNA Sequence
FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
GENPEPT: GENBANK CDS TRANSLATIONS
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
REFSEQ
• Reference Sequences− Nucleotide sequences and protein translation− Curated by NCBI or NCBI-approved programs.
• Difference between GenBank and RefSeq − GenBank has raw data and duplicated records− Metadata in GenBank can be incomplete− RefSeq annotated, curated and non-redundant. − NCBI takes best sequences from GenBank and curates for RefSeq records
SELECTED REFSEQ ACCESSION NUMBERS
mRNAs and Proteins
NM_123456 Curated mRNANP_123456 Curated ProteinNR_123456 Curated non-coding RNAXM_123456 Predicted mRNAXP_123456 Predicted Protein XR_123456 Predicted non-coding RNAGene RecordsNG_123456 Reference Genomic SequenceChromosomeNC_123455 Microbial replicons, organelle genomes, human chromosomesAC_123455 Alternate assembliesAssembliesNT_123456 Contig NW_123456 WGS Supercontig
only 1 RefSeq
over 100,000nucleotide entriesfor HIV-1
HOW TO SAVE? Choose FASTA from the Display drop-down
menu Transform the content of this window into plain
text by choosing Text from the drop-down menu located on the far right of the menu bar.
Save the FASTA sequence by using the following protocol:
a. In the Edit menu of your Web browser, click Select All and then
click Copy. b. Open a default Word document and, in the Edit
menu of Word, click Paste. c. Finally, save your document as dUTPaseDNA.txt by
choosing the Save as type option text only (*.txt).
FASTA FORMAT DESCRIPTION
• FASTA is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985
• Popular Format and commonly used• A sequence in FASTA format begins with a
single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.
شکوه
ریاضی،ف
رانک کاظ می
53