+ All Categories
Home > Documents > بسم الله الرحمن الرحیم

بسم الله الرحمن الرحیم

Date post: 15-Jan-2016
Category:
Upload: jered
View: 35 times
Download: 0 times
Share this document with a friend
Description:
بسم الله الرحمن الرحیم. Using NCBI Resources for Gene Discovery. Lecturer: Dr. Farkhondeh Poursina , PhD [email protected] 1392. National Center for Biotechnology Information (NCBI) National Library of Medicine National Institutes of Health http://www.ncbi.nlm.nih.gov/. - PowerPoint PPT Presentation
Popular Tags:
53
ن م ح ر ل ه ا ل ل م ا س ب م ی ح ر ل ا
Transcript
Page 1: بسم الله الرحمن الرحیم

بسم الله الرحمن الرحیم

Page 2: بسم الله الرحمن الرحیم

Lecturer: Dr. Farkhondeh Poursina, PhD

[email protected]

1392

Using NCBI Resources for Gene Discovery

National Center for Biotechnology Information (NCBI)National Library of MedicineNational Institutes of Health

http://www.ncbi.nlm.nih.gov/

Page 3: بسم الله الرحمن الرحیم

PRIMARY BIOLOGICAL DATABASES

Nucleic acid & Protein

EMBL(European Molecular Biology Laboratory) DDBJ (DNA Data Bank of Japan)

GenBank (NCBI, The National Center for Biotechnology Information)

Page 4: بسم الله الرحمن الرحیم

EMBL/GENBANK/DDJB These 3 db contain mainly the same information (few

differences in the format) Serve as archives containing all sequences (single

genes, ESTs, complete genomes, etc.)

derived from:Genome projects and sequencing centers Individual scientists

Non-confidential data are exchanged daily Currently: 2.5 x107 sequences, over 3.2 x1010 bp; Sequences from > 50,000 different species;

Page 5: بسم الله الرحمن الرحیم

THE ‘PERFECT’ DATABASE

Comprehensive, but easy to search.

Annotated, but not “too annotated”.

A simple, easy to understand structure.

Cross-referenced.

Minimum redundancy.

Easy retrieval of data.

Page 6: بسم الله الرحمن الرحیم

THE NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION

Created in 1988 as a part of theNational Library of Medicine at NIH(National Institutes of Health)

– Establish public databases– Research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information

Bethesda,MD

Page 7: بسم الله الرحمن الرحیم

WEB ACCESS: WWW.NCBI.NLM.NIH.GOV

New HomepageCommon footerCommon footer

New pages!New pages!

Page 8: بسم الله الرحمن الرحیم

TYPES OF MOLECULAR DATABASES(SEQUENCE) AT NCBI

Primary Databases Original submissions by experimentalists Content controlled by the submitter

Examples: GenBank, Trace, SRA, SNP, GEO

Derivative Databases Derived from primary data

Curated/expert review(Content controlled by third party (NCBI)compilation and correction of data Examples: NCBI Protein, Refseq, RefSNP, UniGene, Homologene,

Structure, Conserved Domain

Page 9: بسم الله الرحمن الرحیم

PRIMARY VS. DERIVATIVE SEQUENCE DATABASES

GenBankGenBank

SequencingSequencingCentersCenters

GA

GAGA

ATT

ATTCCGAGA

ATT

ATTCC

AT

GAGA

ATTCC GAGA

ATTCC

TTGACAAT

TGACTA

ACGTGC

TTGACA

CGTGAATTGACTA

TATAGCCG

ACGTGC

ACGTGCACGTGC

TTGACA

TTGACA

CGTGA

CGTGA

CGTGA

ATTGACTA

ATTGACTAATTGACTA

ATTGACTA

TATAGCCG

TATAGCCGTATAGCCGTATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG

CATT

GAGA

ATTCC GAGA

ATTCC LabsLabs

AlgorithmsAlgorithms

UniGene

CuratorsCurators

RefSeq

GenomeAssembly

TATAGCCGAGCTCCGATACCGATGACAA

Updated continually by NCBI

Updated ONLY by submitters

Page 10: بسم الله الرحمن الرحیم

THE PROBLEM

Rapidly growing databases with complex and changing relationships

Rapidly changing interfaces to match the above

Result Many people don’t know:

Where to begin Where to click on a Web page Why it might be useful to click there

Page 11: بسم الله الرحمن الرحیم

DERIVATIVE SEQUENCE DATABASES

Page 12: بسم الله الرحمن الرحیم

ENTREZ

FINDING RELEVANT INFORMATION IN NCBI

DATABASES

Page 13: بسم الله الرحمن الرحیم

YOU CAN SEARCH DNA SEQUENCE DATABASE

Retrieve known sequences by ENTREZ http://www.ncbi.nlm.nih.gov/Entrez/ Click – Nucleotide OR

Accession number Keyword search

Page 14: بسم الله الرحمن الرحیم

Entrez is Internally Cross-linked DNA and protein sequences are linked to other similar sequences Medline citations are linked to other citations that contain similar keywords 3-D structures are linked to similar

structures

Page 15: بسم الله الرحمن الرحیم

DATABASES CONTAIN MORE THAN JUST DNA &PROTEIN SEQUENCES

Page 16: بسم الله الرحمن الرحیم

Retrieve all sequences for an organism or taxon

Starting with an organism or taxon name...

How to: Download the complete genome for an organism

Starting at the Genomes

Page 17: بسم الله الرحمن الرحیم

How to: Find transcript sequences for a gene

Starting with ... A GENE NAME, PRODUCT NAME, OR

SYMBOL How to: Obtain genomic sequence

for/near a gene, marker, transcript or protein

Starting with...   A GENE NAME OR SYMBOL

Page 18: بسم الله الرحمن الرحیم

ENTREZ TIP: START SEARCHES IN GENE

Other Entrez DBs

HomoloGene

Entrez Protein

Gene

UniGene

BLink

Homologene:Gene Neighbors

Page 19: بسم الله الرحمن الرحیم

How to: Display genomic annotation graphically

Starting with... A NUCLEOTIDE RECORD (e.g.

NC_000001)

Page 20: بسم الله الرحمن الرحیم
Page 21: بسم الله الرحمن الرحیم
Page 22: بسم الله الرحمن الرحیم
Page 23: بسم الله الرحمن الرحیم
Page 24: بسم الله الرحمن الرحیم
Page 25: بسم الله الرحمن الرحیم
Page 26: بسم الله الرحمن الرحیم
Page 27: بسم الله الرحمن الرحیم

BY APPLYING LIMITS, THERE ARE NOW JUST TWO ENTRIES

Page 28: بسم الله الرحمن الرحیم
Page 29: بسم الله الرحمن الرحیم
Page 30: بسم الله الرحمن الرحیم
Page 31: بسم الله الرحمن الرحیم

Precise Results

Page 32: بسم الله الرحمن الرحیم
Page 33: بسم الله الرحمن الرحیم
Page 34: بسم الله الرحمن الرحیم

A TRADITIONAL GENBANK RECORD

Locus Field Molecule Type

Genbank Division

Modification DateDefinition Line

Taxonomy

GI (GenInfo)

Submission Field

ACCESSION NOACCESSION VERSSION

Molecular weight

Page 35: بسم الله الرحمن الرحیم

TRADITIONAL GENBANK RECORD

ACCESSION U07418

VERSION U07418.1 GI:466461

ACCESSION U07418

VERSION U07418.1 GI:466461

Accession•Stable•Reportable•Universal

Accession•Stable•Reportable•Universal

VersionTracks changes in sequenceVersionTracks changes in sequence

GI numberNCBI internal useGI numberNCBI internal use

the sequence is the datathe sequence is the data

Coding sequenceCoding sequence

Page 36: بسم الله الرحمن الرحیم

What is an accession number?

An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.

Examples (all for retinol-binding protein, RBP4):

X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)

N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)

NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record protein

DNA

RNA

Page 27

Page 37: بسم الله الرحمن الرحیم

Feature Table

GenPept Record

Genomic DNA Sequence

Page 38: بسم الله الرحمن الرحیم

FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS

GENPEPT: GENBANK CDS TRANSLATIONS

>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

Page 39: بسم الله الرحمن الرحیم
Page 40: بسم الله الرحمن الرحیم

REFSEQ

• Reference Sequences− Nucleotide sequences and protein translation− Curated by NCBI or NCBI-approved programs.

• Difference between GenBank and RefSeq − GenBank has raw data and duplicated records− Metadata in GenBank can be incomplete− RefSeq annotated, curated and non-redundant. − NCBI takes best sequences from GenBank and curates for RefSeq records     

Page 41: بسم الله الرحمن الرحیم

SELECTED REFSEQ ACCESSION NUMBERS

mRNAs and Proteins

NM_123456 Curated mRNANP_123456 Curated ProteinNR_123456 Curated non-coding RNAXM_123456 Predicted mRNAXP_123456 Predicted Protein XR_123456 Predicted non-coding RNAGene RecordsNG_123456 Reference Genomic SequenceChromosomeNC_123455 Microbial replicons, organelle genomes, human chromosomesAC_123455 Alternate assembliesAssembliesNT_123456 Contig NW_123456 WGS Supercontig

Page 42: بسم الله الرحمن الرحیم

only 1 RefSeq

over 100,000nucleotide entriesfor HIV-1

Page 43: بسم الله الرحمن الرحیم

HOW TO SAVE? Choose FASTA from the Display drop-down

menu Transform the content of this window into plain

text by choosing Text from the drop-down menu located on the far right of the menu bar.

Save the FASTA sequence by using the following protocol:

a. In the Edit menu of your Web browser, click Select All and then

click Copy. b. Open a default Word document and, in the Edit

menu of Word, click Paste. c. Finally, save your document as dUTPaseDNA.txt by

choosing the Save as type option text only (*.txt).

Page 44: بسم الله الرحمن الرحیم
Page 45: بسم الله الرحمن الرحیم

FASTA FORMAT DESCRIPTION

• FASTA is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985

• Popular Format and commonly used• A sequence in FASTA format begins with a

single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.

Page 46: بسم الله الرحمن الرحیم
Page 47: بسم الله الرحمن الرحیم
Page 48: بسم الله الرحمن الرحیم
Page 49: بسم الله الرحمن الرحیم
Page 50: بسم الله الرحمن الرحیم
Page 51: بسم الله الرحمن الرحیم
Page 52: بسم الله الرحمن الرحیم
Page 53: بسم الله الرحمن الرحیم

شکوه

ریاضی،ف

رانک کاظ می

53


Recommended