+ All Categories
Home > Documents > Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic...

Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic...

Date post: 21-Dec-2015
Category:
View: 217 times
Download: 2 times
Share this document with a friend
Popular Tags:
147
Data Sequences and Other Stuff
Transcript
Page 1: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Data

Sequences

and

Other Stuff

Page 2: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Sequence Data

Page 3: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Nucleic Acid and Protein Sequences

Sources of Genetic Sequences User GCG supplied databases

Flat File Oracle Relational Database

NCBI supplied databases Other databases

Page 4: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Sequence Databases

Genbank EMBL DDBJ

NCBI PIR Swiss-Prot Swiss-Prot TrEMBL

Page 5: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Genbank

Primary nucleic acid sequence database Maintained by NCBI

National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov

Current Release 122, 2/2001 11,720,120,326 bases 10,896,781 sequences

Page 6: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.
Page 7: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Species 1995 1996 1997 1998 1999 2000 2001 Increase(since 1995)

Increase(12 months)

all: 16109 23119 32880 43516 61952 87751 95168 490% 40.9%

Viruses: 1845 2122 2678 2968 3573 4428 4857 163% 32.4%

Bacteria: 2939 3847 6091 8711 14322 22758 24878 746% 53.3%

Archaea: 162 235 385 555 1015 1709 1906 1076% 68.8%

Eukaryota: 10366 15901 22596 29926 41420 56961 61571 493% 37.4%

How Many Organisms Are In The Sequence Databases?(April 1, 2001)

Page 8: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Other NCBI Databases

HTGS EST STS GSS RefSeq Unigene Genomic

Page 9: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

HTGS

High Throughput Genomic Sequences ‘Unfinished' DNA sequences generated by the high-

throughput sequencing centers Phase 0

Single-few pass reads of a single clone (not contigs) Phase 1

Unfinished, may be unordered, unoriented contigs, with gaps Phase 2

Unfinished, ordered, oriented contigs, with or without gaps Phase 3

Primary division (Genbank) Finished, no gaps (with or without annotations)

Page 10: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

EST

Expressed Sequence Tags “Single-pass" cDNA sequences Generally representative of the 3’ ends of

cDNAs More “full-length” ESTs now available

Page 11: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

STS

Sequence Tagged Sites Sequence and mapping data Short genomic landmark sequences

Page 12: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

GSS

Genome Survey Sequences Similar to the EST division, except that its

sequences are genomic in origin, rather than cDNA Random “single pass read” genome survey

sequences. Cosmid/BAC/YAC end sequences Exon trapped genomic sequences alu PCR sequences

Page 13: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

RefSeq

NCBI Reference Sequence project Provides reference sequence standards

for the naturally occurring molecules from chromosomes to mRNAs to proteins

Stable reference point for: mutation analysis gene expression studies polymorphism discovery

Page 14: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

RefSeq…

Curated RefSeq transcripts and proteins

Genome Annotation contigs, transcripts, and proteins

Complete Genomes genomes, chromosomes, and proteins

Page 15: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Unigene

Experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters Each UniGene cluster contains sequences that

represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.

Includes EST and cDNA sequences Includes human, rat, mouse, cow and zebrafish

Page 16: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

HomoloGene

Curated and calculated orthologs and homologs for genes represented in UniGene and LocusLink

Includes human, mouse, rat, zebrafish, cow and drosophila

Page 17: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

LocusLink

Provides a single query interface to curated sequence and descriptive information about genetic loci Nomenclature Aliases Sequence accessions Phenotypes EC numbers MIM numbers UniGene clusters Homology Map locations Web sites

Page 18: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

EMBL and DDBJ

European Molecular Biology Laboratory Hinxton, UK http://www.ebi.ac.uk/

DNA Data Bank of Japan Mishima, Japan http://www.ddbj.nig.ac.jp/

Page 19: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Coordination with Genbank

Prevents duplication Genbank enters sequences from U.S.

journals and researchers EMBL handles European data DDBJ handles Asian data Data exchanged daily

Page 20: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Sequence submissions

Sequences entered from journals Sequences submitted by individual

researchers BankIt

NCBI WWW Site Sequin

Multi-platform program

Page 21: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Sequence Names

DO NOT rely on names to find particular sequences

Few conventions Organism

Hum: Human Mus: mouse Eco: E. coli Syn: synthetic

Page 22: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Last Letter(s)

Sometimes gives useful information cg: Complete genome Viruses

Page 23: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Other Letters

Specifies a particular sequence vsvcg

Vesicular stomatitis virus (Indiana serotype) complete genome

Page 24: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

EMBL File Names

Ec: E. coli Hs: Human

Page 25: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Locus name

Names are short, fairly non-descriptive, and can change from one release to another vsvcg

The complete sequence for the virus VSV

Most “mnemonic” names already taken Genbank now using accession numbers

as locus names

Page 26: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Accession Numbers

Each sequence submitted to a database is assigned a unique primary accession number

Accession numbers do not change If a sequence is merged with another, a new

accession number is assigned, and the original number becomes a secondary accession number

Accession numbers may include version numbers AO2428.2

Page 27: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Accession Numbers

Using GCG to access sequences via their accession number

Data Library:Accession Number Flatfile - vi:JO2428 RDB - gcgnuc: JO2428

Page 28: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

The Sequence Record

Different for each database Locus (Name) Accession Number Keywords Description Properties References The Sequence

Page 29: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% typedata ge:humcftrm!!NA_SEQUENCE 1.0LOCUS HUMCFTRM 6129 bp mRNA PRI 15-DEC-1989DEFINITION Human cystic fibrosis mRNA, encoding a presumed transmembrane conductance regulator (CFTR).ACCESSION M28668NID g180331KEYWORDS cystic fibrosis; transmembrane conductance regulator.SOURCE Human, cDNA to mRNA. ORGANISM Homo sapiens Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 6129) AUTHORS Riordan,J.R., Rommens,J.M., Kerem,B., Alon,N., Rozmahel,R., Grzelczak,Z., Zielenski,J., Lok,S., Plavsic,N., Chou,J.-L., Drumm,M.L., Iannuzzi,M.C., Collins,F.S. and Tsui,L.-C. TITLE Identification of the cystic fibrosis gene: Cloning and characterization of complementary DNA JOURNAL Science 245, 1066-1073 (1989) MEDLINE 89368940

Page 30: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

COMMENT A three base-pair deletion spanning positions 1654-1656 is observed in cDNAs from cystic fibrosis patients.FEATURES Location/Qualifiers source 1. .6129 /organism="Homo sapiens" /db_xref="taxon:9606" CDS 133. .4575 /note="cystic fibrosis transmembrane conductance regulator" /codon_start=1 /db_xref="PID:g180332" /translation="MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVD SADNLSEKLEREWDRELASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLL LNRFSKDIAILDDLLPLTIFDFIQLLLIVIGAIAVVAVLQPYIFVATVPVIVAFIMLR AYFLQTSQQLKQLESEGRSPIFTHLVTSLKGLWTLRAFGRQPYFETLFHKALNLHTAN WFLYLSTLRWFQMRIEMIFVIFFIAVTFISILTTGEGEGRVGIILTLAMNIMSTLQWA VNSSIDVDSLMRSVSRVFKFIDMPTEGKPTKSTKPYKNGQLSKVMIIENSHVKKDDIW PSGGQMTVKDLTAKYTEGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLLN TEGEIQIDGVSWDSITLQQWRKAFGVIPQKVFIFSGTFRKNLDPYEQWSDQEIWKVAD EVGLRSVIEQFPGKLDFVLVDGGCVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDP VTYQIIRRTLKQAFADCTVILCEHRIEAMLECQQFLVIEENKVRQYDSIQKLLNERSL FRQAISPSDRVKLFPHRNSSKCKSKPQIAALKEETEEEVQDTRL"BASE COUNT 1886 a 1181 c 1330 g 1732 tORIGIN

Page 31: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

HUMCFTRM Length: 6129 April 13, 1998 13:00 Type: N Check: 6781 .. 1 AATTGGAAGC AAATGACATC ACAGCAGGTC AGAGAAAAAG GGTTGAGCGG 51 CAGGCACCCA GAGTAGTAGG TCTTTGGCAT TAGGAGCTTG AGCCCAGACG 101 GCCCTAGCAG GGACCCCAGC GCCCGAGAGA CCATGCAGAG GTCGCCTCTG 151 GAAAAGGCCA GCGTTGTCTC CAAACTTTTT TTCAGCTGGA CCAGACCAAT 201 TTTGAGGAAA GGATACAGAC AGCGCCTGGA ATTGTCAGAC ATATACCAAA 251 TCCCTTCTGT TGATTCTGCT GACAATCTAT CTGAAAAATT GGAAAGAGAA 301 TGGGATAGAG AGCTGGCTTC AAAGAAAAAT CCTAAACTCA TTAATGCCCT 351 TCGGCGATGT TTTTTCTGGA GATTTATGTT CTATGGAATC TTTTTATATT 401 TAGGGGAAGT CACCAAAGCA GTACAGCCTC TCTTACTGGG AAGAATCATA 451 GCTTCCTATG ACCCGGATAA CAAGGAGGAA CGCTCTATCG CGATTTATCT

Page 32: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% typedata -ref GB_PR:HUMIFNRF1A

!!NA_SEQUENCE 1.0LOCUS HUMIFNRF1A 7721 bp DNA PRI 10-NOV-1992DEFINITION Homo sapiens interferon regulatory factor 1 gene, complete cds.ACCESSION L05072NID g184648KEYWORDS interferon regulatory factor 1.SOURCE Homo sapiens Placenta DNA. ORGANISM Homo sapiens Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 7721) AUTHORS Cha,Y., Sims,S.H., Romine,M.F., Kaufmann,M. and Deisseroth,A.B. TITLE Human interferon regulatory factor 1: intron/exon organization JOURNAL DNA Cell Biol. 11, 605-611 (1992) MEDLINE 93000481

Page 33: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

FEATURES Location/Qualifiers source 1. .7721 /organism="Homo sapiens" /db_xref="taxon:9606" /tissue_type="Placenta" /map="5q23-q31" exon 1. .219 /gene="IRF1" /note="putative" /number=1 5'UTR join(1. .219,1279. .1287) /gene="IRF1" gene join(1. .219,1279. .1287) /gene="IRF1" intron 220. .1278 /gene="IRF1" /number=1 exon 1279. .1374 /gene="IRF1" /number=2 CDS join(1288. .1374,2738. .2837,3630. .3806,3916. .3965, 4073. .4202,4386. .4508,5040. .5089,6248. .6383,6670. .6794) /gene="IRF1" /codon_start=1 /product="interferon regulatory factor 1" /db_xref="PID:g184649" /translation="MPITRMRMRPWLEMQINSNQIPGLIWINKEEMIFQIPWKHAAKH GWDINKDACLFRSWAIHTGRYKAGEKEPDPKTWKANFRCAMNSLPDIEEVKDQSRNKG SSAVRVYRMLPPLTKNQRKERKSKSSRDAKSKAKRKSCGDSSPDTFSDGLSSSTLPDD HSSYTVPGYMQDLEVEQALTPALSPCAVSSTLPDWHIPVEVVPDSTSDLYNFQVSPMP STSEATTDEDEEGKLPEDIMKLLEQSEWQPTNVDGKGYLLNEPGVQPTSVYGDFSCKE EPEIDSPGGDIGLSLQRVFTDLKNMDATWLDSLLTPVRLPSIQAIPCAP"

Page 34: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

intron 1375. .2737 /gene="IRF1" /number=2 exon 2738. .2837 /gene="IRF1" /number=3 intron 2838. .3629 /gene="IRF1" /number=3 exon 3630. .3806 /gene="IRF1" /number=4 intron 3807. .3915 /gene="IRF1" /number=4 exon 3916. .3965 /gene="IRF1" /number=5 intron 3966. .4072 /gene="IRF1" /number=5

...

exon 5040. .5089 /gene="IRF1" /number=8 intron 5090. .6247 /gene="IRF1" /number=8 exon 6248. .6383 /gene="IRF1" /number=9 intron 6384. .6669 /gene="IRF1" /number=9 exon 6670. .7656 /gene="IRF1" /number=10 3'UTR 6795. .7656BASE COUNT 1750 a 1946 c 2253 g 1772 tORIGIN

Page 35: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% typedata -ref est:hum091226f!!NA_SEQUENCE 1.0LOCUS HUM091226F 152 bp mRNA EST 02-APR-1996DEFINITION Homo sapiens retinal fovea EST HFV091226 sequence.ACCESSION L48850NID g1254959KEYWORDS EST; expressed sequence tag.SOURCE Homo sapiens (clone: EST HFV091226) age normalized retinal foveae cDNA to mRNA. ORGANISM Homo sapiens Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (sites) AUTHORS Adams,M.D., Kerlavage,A.R., Fields,C. and Venter,J.C. TITLE 3,400 new expressed sequence tags identify diversity of transcripts in human brain JOURNAL Nature Genet. 4 (3), 256-267 (1993) MEDLINE 93364420REFERENCE 2 (sites) AUTHORS Liew,C.C., Hwang,D.M., Fung,Y.W., Laurenssen,C., Cukerman,E., Tsui,S. and Lee,C.Y. TITLE A catalogue of genes in the cardiovascular system as identified by expressed sequence tags JOURNAL Proc. Natl. Acad. Sci. U.S.A. 91 (22), 10645-10649 (1994) MEDLINE 95024171REFERENCE 3 (bases 1 to 152) AUTHORS Bernstein,S.L., Borst,D.E., Neuder,M.E. and Wong,P. TITLE Characterization of a human fovea cDNA library and regional differential gene expression in the human retina JOURNAL Genomics 32 (3), 301-308 (1996)

Page 36: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

FEATURES Location/Qualifiers source 1. .152 /organism="Homo sapiens" /note="Expressed sequence tags (first pass sequencing) from randomly selected bacteriophage clones (mRNA-cDNA) from human retinal fovea. The library is age normalized from ten sets of donor foveae 2-79 years old. /db_xref="taxon:9606" /clone="EST HFV091226" /dev_stage="age normalized" /tissue_type="retinal foveae" mRNA <1. .>152 /standard_name="EST HFV091226"BASE COUNT 31 a 42 c 41 g 36 t 2 othersORIGIN

Page 37: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% typedata -ref sts:humswx153!!NA_SEQUENCE 1.0LOCUS HUMSWX153 192 bp DNA STS 24-MAY-1993DEFINITION Human chromosome X STS sWXD153; single read.ACCESSION L15212NID g292645KEYWORDS STS; primer; sequence tagged site.SOURCE Homo sapiens DNA. ORGANISM Homo sapiens Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 192) AUTHORS Kere,J., Nagaraja,R., Mumm,S.R., Ciccodicola,A., D'Urso,M. and Schlessinger,D. TITLE Mapping human chromosomes by walking with sequence-tagged sites from end fragments of yeast artificial chromosome inserts JOURNAL Genomics 14, 241-248 (1992) MEDLINE 93052321

Page 38: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

COMMENT Submitted by: David Schlessinger, Center for Genetics in Medicine, Washington University School of Medicine, Box 8232 4566 Scott Avenue, St. Louis, MO 63110, USA e-mail: [email protected] Primer A: TAAAGGGATCGCCAAGGAC Primer B: CTTACTCATTTGCTGGATTCTC STS size: 85bp Template: 600 ng/100ul Primer: 40 pmoles/100ul dNTPs: 100 uM MgCl2: 1.5 mM KCl: 100 mM TrisHCl: 10 mM Taq Polymerase: 0.125 U NH4Cl: 5 mM pH: 8.6 Total Vol: 5 ul PCR Profile: Denaturation: 94 degrees C for 1.00 minute(s) Annealing: 55 degrees C for 2.00 minute(s) Polymerization: 72 degrees C for 2.00 minute(s) PCR Cycles: 35 Thermal Cycler: P-E.

Page 39: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

FEATURES Location/Qualifiers source 1. .192 /organism="Homo sapiens" /db_xref="taxon:9606" /map="Xq13-q24" STS 60. .144 /standard_name="sWXD153" primer_bind 60. .78 primer_bind complement(123. .144)BASE COUNT 72 a 26 c 60 g 29 t 5 othersORIGINanalyze%

Page 40: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Swiss-Prot

http://www.expasy.ch/sprot/ Protein Database University of Geneva Arranged by protein function Release 39.15 March 19, 2001 94,152 entries Provides annotated protein records

Page 41: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Swiss-Prot Names

Protein_Species Allows easier comparisons when studying

evolutionary relationships H1b_Human

Human histone 1b

Page 42: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Swiss-Prot Names

Vgl*_* Viral glycoproteins

VGLG_HRSVL Viral GLycoprotein G Human Respiratory Syncytial Virus Long

strain

Page 43: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% typedata swp:H1b_Human

!!AA_SEQUENCE 1.0ID H1B_HUMAN STANDARD; PRT; 218 AA.AC P10412;DT 01-MAR-1989 (REL. 10, CREATED)DT 01-MAR-1989 (REL. 10, LAST SEQUENCE UPDATE)DT 01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE)DE HISTONE H1B (H1.4).GN H1F4.OS HOMO SAPIENS (HUMAN).OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;OC EUTHERIA; PRIMATES.RN [1]RP SEQUENCE FROM N.A.RX MEDLINE; 92009931.RA ALBIG W., KARDALINOU E., DRABENT B., ZIMMER A., DOENECKE D.;RL GENOMICS 10:940-948(1991).RN [2]RP SEQUENCE.RC TISSUE=SPLEEN;RX MEDLINE; 87057092.RA OHE Y., HAYASHI H., IWAI K.;RL J. BIOCHEM. 100:359-368(1986).

Page 44: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

CC -!- FUNCTION: HISTONES H1 ARE NECESSARY FOR THE CONDENSATION OFCC NUCLEOSOME CHAINS INTO HIGHER ORDER STRUCTURES.CC -!- SUBCELLULAR LOCATION: NUCLEAR.CC -!- THIS VARIANT ACCOUNTS FOR 60% OF HISTONE H1.DR EMBL; M60748; G184074; -.DR PIR; A24413; HSHU1B.DR PIR; C40335; C40335.DR HSSP; P08287; 1GHC.KW CHROMOSOMAL PROTEIN; NUCLEAR PROTEIN; DNA-BINDING; MULTIGENE FAMILY;KW ACETYLATION; METHYLATION.FT INIT_MET 0 0FT MOD_RES 1 1 ACETYLATION.FT MOD_RES 25 25 METHYLATION (PARTIAL).FT DOMAIN 35 113 GLOBULAR.SQ SEQUENCE 218 AA; 21734 MW; 5A277FB0 CRC32;

Page 45: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

H1B_HUMAN Length: 218 April 13, 1998 13:19 Type: P Check: 2701 .. 1 SETAPAAPAA PAPAEKTPVK KKARKSAGAA KRKASGPPVS ELITKAVAAS 51 KERSGVSLAA LKKALAAAGY DVEKNNSRIK LGLKSLVSKG TLVQTKGTGA 101 SGSFKLNKKA ASGEAKPKAK KAGAAKAKKP AGAAKKPKKA TGAATPKKSA 151 KKTPKKAKKP AAAAGAKKAK SPKKAKAAKP KKAPKSPAKA KAVKPKAAKP 201 KTAKPKAAKP KKAAAKKK analyze%

Page 46: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Swiss-Prot TrEMBL

Translation of all EMBL Nucleic Acid coding sequences not yet present in Swiss-Prot

Allows rapid availability without immediate annotation

Release 16.3 March 30, 2001 436,896 entries

Page 47: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

TrEMBL Divisions

Everything in TrEMBL: spt sp_bacteria sp_fungi sp_human sp_invertebrate sp_mammal sp_mhc sp_organelle sp_phage sp_plant sp_rodent sp_unclassified sp_vertebrate

Page 48: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Protein Identification Resource - PIR

http://pir.georgetown.edu/ National Biomedical Research Foundation Georgetown University Current Release 67.05 March 23, 2001 219,178 Entries

Page 49: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

National Biomedical Research Foundation

Database begun over twenty years ago by Margaret O. Dayhoff

Originally published sequences in book form

Started with sequences derived from direct amino acid sequencing

Page 50: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% typedata -ref PIR1:HSHU1B

!!AA_SEQUENCE 1.0P1;HSHU1B - histone H1-4 - humanN;Alternate names: histone H1.4; histone H1bC;Species: Homo sapiens (man)C;Date: 31-Dec-1988 #sequence_revision 12-Apr-1996 #text_change 05-Sep-1997C;Accession: C40335; A24413R;Albig, W.; Kardalinou, E.; Drabent, B.; Zimmer, A.; Doenecke, D.Genomics 10, 940-948, 1991A;Title: Isolation and characterization of two human H1 histone genes within clusters of core histone genes.A;Reference number: A40335; MUID:92009931A;Accession: C40335A;Status: preliminaryA;Molecule type: DNAA;Residues: 1-219 <ALB>A;Cross-references: GB:M60748; NID:g184073; PID:g184074A;Experimental source: bloodR;Ohe, Y.; Hayashi, H.; Iwai, K.J. Biochem. 100, 359-368, 1986A;Title: Human spleen histone H1. Isolation and amino acid sequence of a main variant, H1b.A;Reference number: A24413; MUID:87057092A;Accession: A24413A;Molecule type: proteinA;Residues: 2-219 <OHE>A;Experimental source: spleen

Page 51: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

C;Comment: This variant accounts for 60% of histone H1.C;Genetics:A;Gene: GDB:H1F4A;Cross-references: GDB:120030; OMIM:142220A;Map position: 12q11-12q21C;Superfamily: histone H1C;Keywords: acetylated amino end; chromosomal protein; DNA binding; methylated amino acid; nucleosome; spleenF;2-219/Product: histone H1-4 #status experimental <MAT>F;2-32/Domain: amino-terminal <NH2>F;33-110/Domain: globular <GLB>F;111-219/Domain: carboxyl-terminal <END>F;2/Modified site: acetylated amino end (Ser) (in mature form) #status experimentalF;26/Modified site: N6-methyllysine (Lys) (partial) #status experimental

Page 52: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

iProClass Database - PIR

http://pir.georgetown.edu/iproclass/ Comprehensive family relationships and

structural/functional classifications and features of proteins Superfamilies Families Domains

Page 53: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

GCG Supplied Databases

GCG sequence database files are NOT normal UNIX files. UNIX commands cannot be used to

manipulate sequences in these databases Stored as Data Libraries Stored in Oracle RDB

Page 54: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Sequence Data Updates

Genbank Daily

GCG Flat file No longer updated Last update June, 2000

GCG SeqStore Oracle RDB Daily updates

Page 55: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Database listing – GCG-FF

Databases available:

GenBank Release 118.0 (06/2000)

EMBL (Abridged) Release 62.0 (03/2000)

PIR-Protein Release 65.0 (06/2000)

NRL_3D Release 27.0 (03/2000)

SWISS-PROT Release 39.0 (06/2000)

SP-TREMBL Release 14.0 (06/2000)

PROSITE Release 16.0 (07/1999)

Restriction Enzymes (REBASE) (06/2000)

Page 56: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Database listing – SeqStore

Databases available:

GCGNUC updated nightly by DATASERVE

GCGPROT updated weekly by DATASERVE

GCGEST updated nightly by DATASERVE

PROSITE Release 15.0 (07/1999)

Restriction Enzymes (REBASE) (06/2000)

Page 57: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Data Libraries

Allows rapid searches Sequences organized into groups Each data library can be referred to by a

logical name Individual sequences can be extracted

from the data library.

Page 59: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

GCG SeqStore (Oracle-based Sequences)

Data Library Names

Page 60: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Database Name DescriptionNucleic Acid Sequences

gcgnuc All Genbank nucleotide sequences (except ESTs) updated nightly by SeqStore

gcgest All Genbank Expressed Sequence Tags updated nightly by SeqStore

Protein Sequences

gcgprot All Swissprot and Swissprot TrEMBL sequences updated nightly by SeqStore

Page 61: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

GCG Flat-file

Data Library Names

Page 62: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Nucleic Acid Databases (Genbank and EMBL)

Database Name(s) DescriptionGenEMBL, GE Entire database (except tags)

genemblplus gep geplus Entire database (including tags)

Bacterial, Bacteria, Ba Bacterial sequences

HTG High throughput genome

Invertebrate, In Invertebrate sequence

Organelle, Or Organelle sequences

Other_Mammalian, OtherMammal, OtherMamm, Om

non-rodent, non-primate Mammalian sequences

Other_Vertebrate, Ov, OtherVertebrate, OtherVert

non-mammalian Vertebrate sequences

Page 63: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Nucleic Acid Databases…

Database Name(s) DescriptionPatent, Pat Sequences from patents and

patent applications

Phage, Ph Phage sequences

Plant, Pl Plant and Fungal sequences

Primate, Pr Primate (Mammalian) sequences

Rodent, Ro Rodent (Mammalian) sequences

Structural_RNA, Structural St Structural RNA sequences (such as rRNAs)

Synthetic, Sy Synthetic sequences

Unannotated, Un Unannotated sequences

Viral, Vi Viral sequences

Page 64: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Sequence Tag Databases

Database Name(s) DescriptionEST Expressed sequence tags

GSS Genome survey sequences

STS Sequence-tagged site sequences

Tags EST, STS, and GSS

Page 65: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Protein Databases

Database Name(s) DescriptionPIR,P Entire PIR-Protein Protein

Sequence Data Library

Protein, Prot, PIR1 PIR-Protein annotated sequences

New, Nw PIR-Protein preliminary and unverified sequences

PIR2 PIR-Protein preliminary sequences

PIR3 PIR-Protein unverified sequences

SwissProt, Swiss Entire SwissProt Protein Sequence Data Library

Sptrembl, spt Newly added preliminary sequences, translated from EMBL

swissprotplus swplus swp SwissProt + SPTrEMBL

Page 66: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

NCBI Blast Databases

Page 67: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Nucleotide Databases for NetBlast Searching nr Non-redundant Genbank+EMBL+DDBJ+PDB sequences

(but no EST's or STS's)

pdb PDB nucleotide sequences

vector Vector subset of Genbank

yeast Saccharomyces cerevisiae genomic nucleotide sequences

est Non-redundant Database of Genbank+EMBL+DDBJ EST Division

sts Non-redundant Database of Genbank+EMBL+DDBJ STS Division

htgs High Throughput Genomic Sequences

mito Database of mitochondrial sequences, Rel. 1.0, July 1995

kabat Kabat Sequences of Nucleic Acid of Immunological Interest

epd Eukaryotic Promotor Database

alu Select Alu Repeats from REPBASE

gss Genome Survey Sequence, includes single_pass genomic data

ecoli E. coli genomic nucleotide sequences

Drosophila genome Drosophila genome provided by Celera and Berkeley

month All new or revised Genbank+EMBL+DDBJ+PDB sequences released in the last 30 days

Page 68: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Protein Databases for NetBlast Searchingnr Non-redundant Genbank CDS

translations+PDB+SwissProt+PIR

pdb PDB protein sequences

swissprot SwissProt sequences

yeast Saccharomyces cerevisiae protein sequences

kabat Kabat Sequences of Proteins of Immunological Interest

alu Translations of Select Alu Repeats from REPBASE

ecoli E. coli genomic CDS translations

Drosophila genome Drosophila genome proteins provided by Celera and Berkeley

month All new or revised Genbank CDS translation+PDB+SwissProt+PIR sequences released in the last 30 days

Page 69: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Specifying Sequences

Filename Data library specification Accession number specification

Page 70: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Sequences within your own directories

Use the normal file specification:

lefkowit/sequences/vsvcg.seq

Page 71: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Sequences within a Data Library

Flatfile Data Library:Sequence Name sw:vglg_vsvsj - VSV G protein in the

SwissProt library primate:humada

The sequence for human adenosine deaminase mRNA

SeqStore gcgprot:vglg_vsvsj gcgnuc:humada

Page 72: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Sequence Formats

GCG requires a specific sequence format Sequences entered from outside GCG

must be reformatted analyze% reformat

GCG program analyze% readseq

Non-GCG addition

Page 73: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Non-GCG Sequence File

analyze% cat seq.txt

ACGAAGACAAACAAACCATTATTATCATTAAAAGGCTC

AGGAGAAACTTTAACAGTAATCAAAATGTCTGTTACAG

TCAAGAGAATCATTGACAACACAG

analyze%

Page 74: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% reformat

analyze% reformat -check seq.txt

Reformat rewrites sequence file(s), scoring matrix file(s), or enzyme

data file(s) so that they can be read by GCG programs.

Minimal Syntax: % reformat [-INfile=]reformat.txt -Default

Prompted Parameters: None

Local Data Files:

-DATa=translate.txt three-letter to one-letter codes

Page 75: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Optional Parameters: [-OUTfile=]NewSeqName names the output file-EXTension=.seq specifies a file name extension for the output-LIStfile[=reformat.list] writes a list file of output sequence names-MSF reformats sequences into an MSF output file-RSF reformats sequences into an RSF output file-PROtein or -NUCleotide insists that the sequences are reformatted as protein or nucleotide sequences-DEGap removes gap characters (. and ~) from the sequence-LINesize=50 sets number of characters per line-BLOcksize=10 sets number of characters per block-BLAnklines=1 puts blank lines between the sequence lines-NONUMbering suppresses numbering-NOCOMments suppresses comments-DNA changes U into T-RNA changes T into U-UPPer makes all sequence characters uppercase-LOWer makes all sequence characters lowercase-ONEIntothree translates one-letter peptides into three-letter-THReeintoone translates three-letter peptides into one-letter-NOHEAding input sequence from stdin contains no header information

Page 76: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

-COMparison reformats a scoring matrix instead of a sequence (used with -PROtein or -NUCleotide, insists that the matrix is reformatted as a protein or nucleotide scoring matrix)-GAPweight=12 specifies the gap creation penalty associated with the scoring matrix-LENgthweight=4 specified the gap extension penalty associated with the scoring matrix-SCAle=10 multiplies each value in the scoring matrix by 10 (use any number from .01 to 100.0)-EQUALSformat writes the scoring matrix in a form that may be more easily read-OLDCMPformat converts a pre-Version 9 scoring matrix into a Version 9 scoring matrix (all options used with -COMparison can also be used with -OLDCMPformat. -PROtein or -NUCleotide must be specified with -OLDCMPformat-TRANSlate=filename.txt lets you name the translation table-NOMONitor suppresses the screen trace showing each output file Add what to the command line ?

No ".." divider seq.txt length: 100 bpanalyze%

Page 77: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% cat seq.txt'!!NA_SEQUENCE 1.0 REFORMAT of: seq.txt check: 3430 from: 1 to: 100 April 9, 1998 14:31 (No documentation) seq.txt Length: 100 April 9, 1998 14:31 Type: N Check: 3430 .. 1 ACGAAGACAA ACAAACCATT ATTATCATTA AAAGGCTCAG GAGAAACTTT 51 AACAGTAATC AAAATGTCTG TTACAGTCAA GAGAATCATT GACAACACAG analyze%

Reformatted Sequence

Page 78: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

GCG Sequence Import Programs

fromstaden fromembl fromgenbank frompir fromig fromfasta fromtrace

Page 79: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

GCG Sequence Export Programs

tostaden topir toig tofasta

Page 80: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

ReadSeq

General reformatting program

Page 81: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% readseqanalyze% readseqreadSeq (1Feb93), multi-format molbio sequence reader. Name of output file (?=help, defaults to display):seq.fasta 1. IG/Stanford 10. Olsen (in-only) 2. GenBank/GB 11. Phylip3.2 3. NBRF 12. Phylip 4. EMBL 13. Plain/Raw 5. GCG 14. PIR/CODATA 6. DNAStrider 15. MSF 7. Fitch 16. ASN.1 8. Pearson/Fasta 17. PAUP/NEXUS 9. Zuker (in-only) 18. Pretty (out-only) Choose an output format (name or #):8

Page 82: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Name an input sequence or -option:seq.txt Name an input sequence or -option:

analyze% cat seq.fasta>seq.txt, 100 bases, D66 checksum.ACGAAGACAAACAAACCATTATTATCATTAAAAGGCTCAGGAGAAACTTTAACAGTAATCAAAATGTCTGTTACAGTCAAGAGAATCATTGACAACACAGanalyze%

ReadSeq Formatted Sequence

Page 83: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Sequence File Utilities

Chopup Break up long lines in a text file prior to

running reformat Breakup

Breakup long sequences into individual, overlapping sequence files

Page 84: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

>uunt, 751719 bases, 1F08 checksum.ATGGCTAATAATTATCAAACTTTATATGATTCAGCAATAAAAAGGATTCCATACGATCTTATTTCTGATCAAGCTTATGCAATTCTACAAAATGCTAAAACTCATAAAGTTTGCGATGGTGTTTTATATATAATTGTAGCCAATGCCTTTGAAAAAAGTATTATTAACGGTAATTTTATTAACATTATTTCTAAATATCTAAGCGAAGAATTCAAAAAGGAAAATATTGTTAATTTTGAATTTATTATAGACAATGAAAAATTATTAATTAATAGCAATTTTTTAATTAAAGAAACTAATATTAAAAATCGTTTTAATTTTAGTGATGAACTTTTACGTTACAATTTTAACAATTTAGTAATTAGTAATTTTAATCAAAAAGCGATTAAGGCGATTGAAAATTTATTTTCAAATAACTATGATAATAGTTCAATGTGTAACCCTTTATTTTTATTTGGTAAAGTTGGTGTTGGTAAAACGCATATCGTGGCTGCTGCTGGTAATCGTTTTGCTAATAGTAATCCTAATTTAAAAATTTATTATTATGAAGGGCAAGATTTTTTTCGAAAGTTTTGTTCTGCTTCGTTAAAAGGGACTAGTTATGTTGAAGAGTTTAAAAAAGAAATTGCTTCAGCAGATTTATTAATTTTTGAAGATATTCAAAATATCCAATCACGTGATTCAACGGCTGAATTGTTTTTTAATATCTTTAATGATATAAAATTAAATGGTGGAAAAATTATCTTAACATCTGACCGTACACCAAACGAACTTAATGGTTTTCATAATCGAATTATTTCGAGATTAGCGTCAGGTTTGCAGTGTAAAATTTCTCAACCCGACAAAAATGAAGCTATTAAAATTATTAATAATTGGTTTGAATTCAAAAAAAAATATCAAATTACTGACGAAGCTAAAGAATATATTGCTGAAGGTTTTCACACTGATATTAGACAGATGATtGGTAATCTAAAACAAATTTGTTTTTGAGCGGACAATGATACTAATAAAGATTTAATAATCACAAAAGATTATGTAATTGAGTGTTCAGTTGAAAACGAAATTCCACTAAATATTGTTGTTAAAAAACAATTTAAACC

Page 85: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% readseqreadSeq (1Feb93), multi-format molbio sequence reader. Name of output file (?=help, defaults to display):uunt.seq 1. IG/Stanford 10. Olsen (in-only) 2. GenBank/GB 11. Phylip3.2 3. NBRF 12. Phylip 4. EMBL 13. Plain/Raw 5. GCG 14. PIR/CODATA 6. DNAStrider 15. MSF 7. Fitch 16. ASN.1 8. Pearson/Fasta 17. PAUP/NEXUS 9. Zuker (in-only) 18. Pretty (out-only) Choose an output format (name or #):5 Name an input sequence or -option:uunt Name an input sequence or -option:

Page 86: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% more uunt.sequunt uunt, Length: 751719 (today) Check: 7944 .. 1 ATGGCTAATA ATTATCAAAC TTTATATGAT TCAGCAATAA AAAGGATTCC 51 ATACGATCTT ATTTCTGATC AAGCTTATGC AATTCTACAA AATGCTAAAA 101 CTCATAAAGT TTGCGATGGT GTTTTATATA TAATTGTAGC CAATGCCTTT 151 GAAAAAAGTA TTATTAACGG TAATTTTATT AACATTATTT CTAAATATCT 201 AAGCGAAGAA TTCAAAAAGG AAAATATTGT TAATTTTGAA TTTATTATAG 251 ACAATGAAAA ATTATTAATT AATAGCAATT TTTTAATTAA AGAAACTAAT 301 ATTAAAAATC GTTTTAATTT TAGTGATGAA CTTTTACGTT ACAATTTTAA 351 CAATTTAGTA ATTAGTAATT TTAATCAAAA AGCGATTAAG GCGATTGAAA 401 ATTTATTTTC AAATAACTAT GATAATAGTT CAATGTGTAA CCCTTTATTT 451 TTATTTGGTA AAGTTGGTGT TGGTAAAACG CATATCGTGG CTGCTGCTGG 501 TAATCGTTTT GCTAATAGTA ATCCTAATTT AAAAATTTAT TATTATGAAG 551 GGCAAGATTT TTTTCGAAAG TTTTGTTCTG CTTCGTTAAA AGGGACTAGT ...

751301 GAAAATAAAC TACGATTTGA TTAGAATGAA TTTTTTGTTG TTTCTTAATT 751351 GTATCAAGTA TATCTTCATT TTTTTTTAGA CTAATAAAAT TAGCCATAAA 751401 AATTATTTTT CACTAGAAAC TGTTAGACTA TGACGCCCTT TAAGTCTTCT 751451 TCTAGCTAAA ACATTACGCC CATTTTTTGT TTTCATGCGT GCACGAAAAC 751501 CATGCACTTT TGCTCTTTTA CGATTATTAG GTTGAAACGT TCTTTTCATA 751551 AATCCACCGC CCTCTTACTT TTTTGAAAAC ATAATATGGA TTATTATAAC 751601 ATTTTAGTTA TTTTTTATTT AATATATTTT TTTAAAAAAG TCAATGATAT 751651 CTTTTTAAAA ATAAACATAT ATAATATGAT AATAGGACAA AGATTATTTA 751701 TAAAAAATAG AGGTTACTA

Page 87: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% map uunt.seq Map maps a DNA sequence and displays both strands of the mapped sequencewith restriction enzyme cut points above the sequence and proteintranslations below. Map can also create a peptide map of an amino acidsequence. ***Error: Sequence "uunt.seq" could not be read or is not in GCG format

analyze% breakup uunt.seq BreakUp reads a GCG-format sequence file containing more than 350,000sequence characters and writes it as a set of separate, shorter,overlapping sequence files that can be analyzed by Wisconsin Package programs. uunt_0.seq length: 110000 bp uunt_1.seq length: 110000 bp uunt_2.seq length: 110000 bp uunt_3.seq length: 110000 bp uunt_4.seq length: 110000 bp uunt_5.seq length: 110000 bp uunt_6.seq length: 110000 bp uunt_7.seq length: 51719 bp analyze%

Page 88: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Specifying Multiple Sequences

Page 89: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Multiple sequences

If the program prompts with: sequences(s), file(s), or file name(s), then it can accept more than one input file

Page 90: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Specifying Multiple Sequences

Wild Card Specification File of File Names

List Files Multiple Sequence Format File

Page 91: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Wild card specification (flatfile)

GenEMBL:* All sequences in Genbank and EMBL

Primate:* All primate sequences in GenBank

Primate:Hum* All Human sequences in GenBank EMBL uses HS for human

Page 92: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Wild card specification (SeqStore)

gcgnuc:* All sequences in Genbank and EMBL

Must create a query or list for most groupings

Page 93: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

File of Sequence Names

List Files You or certain GCG programs can

construct a file containing any number of sequence names.

Page 94: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Specify as @Sequence_names.fil

The @ tells the program that Sequence_names.fil is a file of sequence names

The program uses all listed sequences

Page 95: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Contents of a File of Sequence Names

Begin with a comment Sequence file names follow a double

period at the end of a line: .. Other comments can be included if

preceded by a ! One sequence name per line

Page 96: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

File of Sequence Names...

Put an ! in front of a name to have the program ignore that particular entry.

A sequence name may include a wild card The file can contain another file of

sequence names as a listing It must be preceded by an @

Page 97: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

hsp70.fil File

January 21, 1998 ..

SWP:Hs70_Brelc SWP:Hs70_Chick SWP:Hs70_Human SWP:Hs70_Leido SWP:Hs70_Leima SWP:Hs70_Maize SWP:Hs70_Mouse SWP:Hs70_Pethy SWP:HS77_YeastSWP:GR78_Yeast -BEGin=43 -END=682sequences/hsp70/ssa4.pepob0/users/lefkowit/sequences/hsp70/ssa1.pepSWP:DNAK_EColi

Page 98: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Multiple Sequence Files (msf)

File containing multiple sequences that are related and have been aligned

Specifying msf files: filename.msf{*} The {*}indicates which sequences are to be used

You can exclude a sequence in subsequent analyses by preceding its name within the msf file with an ! sign.

Page 99: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

hsp70.msf

PileUp of: @Hsp70.Fil Symbol comparison table: GenRunData:NWSGapPep.Cmp CompCheck: 1254 GapWeight: 3.0 GapLengthWeight: 0.1

Pileup.Msf MSF: 738 Type: P December 26, 1990 13:39 Check: 288 .. Name: Hs70_Plafa Len: 738 Check: 9820 Weight: 1.00Name: Hs70_Thean Len: 738 Check: 120 Weight: 1.00!Name: Hs70_Leido Len: 738 Check: 7985 Weight: 1.00// 1 50Hs70_Plafa .......... .....MASAK GSKPNLPESN IAIGIDLGTT YSCVGVWRNE Hs70_Thean .......... .......... .......MTG PAIGIDLGTT YSCVAVYKDN Hs70_Leido .......... .......... ......MTFD GAIGIDLGTT YSCVGVWQNE

51 100Hs70_Plafa NVDIIANDQG NRTTPSYVAF T.DTERLIGD AAKNQVARNP ENTVFDAKRL Hs70_Thean NVEIIPNDQG NRTTPSYVAF T.DTERLIGD AAKNQEARNP ENTIFDAKRL Hs70_Leido RVDIIANDQG NRTTPSYVAF TSDSERLIGD AAKNQVAMNP HNTVFDAKRL

Page 100: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

rsf Files

Rich Sequence Format Allows entry of additional information

about each sequence File can contain multiple sequences

Allows gaps Different sequences do not need to be

related Create and Edit rsf files within SeqLab

Page 101: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

rsf Sequence Information

Creator/author of the sequence Sequence weight Creation date One-line description of the sequence Offset, or the number of leading gaps in a

sequence that is part of an alignment or fragment assembly project

Known sequence features

Page 102: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

rsf File Specification

Similar to msf files hsp70.rsf{*}

Use all the sequences in the file hsp70.rsf{hs70_human}

Only use this single sequence hsp70.rsf{hs70*}

Only use sequences whose name starts with hs70

Page 103: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% more rsb.rsf!!RICH_SEQUENCE 1.0..{name dc-62-18537descrip Description: PileUp of: *.seqtype DNAlongname dc-62-18537checksum 8717creation-date 4/10/98 15:45:50strand 1sequence TCCACCGTGCTCGACACAATCACTCCAAAATACACAATCCAACAGCAATCCCTCCACTCA ACCACCTCCGAAAACACACCCAGCTCCACACAAATACCCACAGCATCCGAGCCCTCCACA TTAAATCCTAAT}{name swed-60-860descrip Description: PileUp of: *.seqtype DNAlongname swed-60-860checksum 8595creation-date 4/10/98 15:45:50strand 1sequence TCCACCGTGATCGACACAATCACTCCAAAATACACAATCCAACAGCAATCCCTCCACTCA ACCACCTCCGAAAACACACCCAGCTCCACACAAATACCCACAGCATCCGAGCCCTCCACA TCAAATCCTACT}

Page 104: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Finding and Displaying Sequences

Page 105: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

List Refinement

Run search program 1 Create a list of file names Use as input to search program 2 Create a second list of file names Edit the listfile at each step as necessary. etc.

Page 106: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Programs Which Create a List of Sequences

Names Blast Lookup StringSearch FindPatterns FastA TFastA

Page 107: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Names

Searches sequence names for a match analyze% names primate:Hum*

Will create a file listing all human sequences present in GenBank

Dependent on knowing name features GenBank:Hum* EMBL:Hs*

Page 108: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% names -check pr:huma* Names identifies GCG data files and sequence entries by name. It canshow you what set of sequences is implied by any sequence specification. Minimal Syntax: % names [-INfile=]GenEMBL:Humhb* -Default Prompted Parameters: [-OUTfile=]Term output file name (defaults to your terminal) Options: -SHOwfiles=132 limits documentation in the output file to column 132-NOHEAding suppresses the heading at the top of the file.-NOMONitor suppresses the screen monitor Add what to the command line ? What (file of filenames) output file (* TERM *) ? gb_pr1: huma1aadr huma1acm huma1acmb huma1ar1huma1ar2 huma1at huma1ata huma1atb

Page 109: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% more list.file!!SEQUENCE_LIST 1.0! NAMES from: pr:huma* April 13, 1998 14:55 .. gb_pr1:huma1aadr LOCUS HUMA1AADR 2002 bp mRNA PRI 04-NOV-1991 DEFINITION Human alpha-A1-adrenergic receptor mRNA, complete cds. ACCE

gb_pr1:huma1acm LOCUS HUMA1ACM 1520 bp mRNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antichymotrypsin (AACT) mRNA, complete cds. ACC

gb_pr1:huma1acmb LOCUS HUMA1ACMB 559 bp DNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antichymotrypsin gene, exon 1. ACCESSION M18035

gb_pr1:huma1ar1 LOCUS HUMA1AR1 890 bp DNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin-related protein gene, exon 2. ACCESSI

gb_pr1:huma1ar2 LOCUS HUMA1AR2 3758 bp DNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin-related protein gene, exons 3, 4 and

gb_pr1:huma1at LOCUS HUMA1AT 143 bp mRNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin (alpha-1-AT) mRNA, 3' end. ACCESSION M

gb_pr1:huma1ata LOCUS HUMA1ATA 322 bp DNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin gene, exon 1 (unexpressed). ACCESSION

gb_pr1:huma1atb LOCUS HUMA1ATB 1345 bp mRNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin mRNA, complete cds. ACCESSION M1146

Page 110: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

StringSearch

Old search method Searches for a particular text pattern in the

sequence documentation. Definition Search Record Search

Complete search for possible text occurances

Very Slow!!

Page 111: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Lookup (gcgff only)

Rapid Text Pattern Searching Uses an index of sequence file

documentation Allows field-specific searches Allows AND; OR; NOT matching

Page 112: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Lookup Considerations

Be sure that analyze is set to use a vt100 terminal: analyze% setenv TERM vt100

Lookup may miss some sequences Dependent on the annotation Spelling counts

Searches are case Insensitive

Page 113: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Logical Operators Within a Field

AND: & A & B means find all entries that contain both A

and B. OR: |

A | B means find all entries that contain either A or B.

BUT-NOT: ! A ! B means find all entries that contain A but do

not contain B.

Page 114: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

analyze% lookup -check LookUp identifies sequence database entries by name, accession number,author, organism, keyword, title, reference, feature, definition, length, or date. The output is a list of sequences. The LookUp program is experimental in this release. LookUp sometimescrashes or produces incorrect results if you query a nucleic aciddatabase and request fragment output. Please look carefully at yourresults. Minimal Syntax: % lookup [-ALLtext=]Globin -Default

Page 115: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Prompted Parameters: -LIBrary=SwissProt[,...] lookup in specified data libraries -ALLtext=Globin searches all text indices for globin-DEFInition=Globin words indexed independently "Globin & Region"-AUThor=Smithies for more than one "Smithies,O. & Slightom,J.L."-KEYword=Globin see document before using keywords-NAMe=hsggl3 entry name-ACCessionnumber=S12345 accession number-ORGanism="Homo Sapiens" genus and species-REFerence=Cell&1981 complete reference: "Cell & 26 & 191- & 1981"-TITle=History title of citation "History & Duplication"-FEAture=Gamma any word in a feature table-SHOrtest=100 find only sequences of length 100 or more-LONgest=400 find only sequences of length 400 or less-EARliest=01-apr-1992 sequences modified on or after April 1, 1992-LATest=30-apr-1992 sequences modified on or before April 30, 1992-MATch=OR specifies inter-field logic (AND is default)-OUTfile=lookup.list output file for list of sequences

Page 116: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Optional Parameters: -NOWILdcardextension turns off automatic wildcard [email protected] searches in lookup.list instead of libraries-ANNotate=FEAture[,...] shows fields from original annotation in output acceptable values include: ACCession, AUThor, DATe, DEFinition, FEAture, NAMe, KEYword, ORGanism, REFerence, and TITle-FRAgments shows features as fragments instead of whole entries-COMplete shows only features with unambiguous coordinates-MONitor shows databases searched and how many hits found Add what to the command line ?

Page 117: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

LOOKUP in what sequence libraries: a) swissprot b) sptrembl c) pir d) embl e) genbank f) em_tags g) gb_tags h) All libraries q) quit Please choose one or more (* h *):

Page 118: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Complete the query form below: All text: Definition: Author: Keyword: Sequence name: Accession number: Organism: Reference: Title: Feature: On or after (dd-mmm-yy): On or before (dd-mmm-yy): Shortest sequence length: Longest sequence length: Inter-field operator: AND Form of output list: Whole Entries Press <Ctrl>D to continue.

Page 119: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

SeqStore

Sequence searching

Page 120: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Lookup_rdb (gcgrdb)

Seqstore command-line sequence searching

Barebones – Use Seqstore Web interface

Page 121: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

SeqStore Web Searching

Setup multiple criteria for selecting sets of sequences

Save as a query or list Query: Active list. Changes as new sequences are

added List: Static list. o change with database updates

Save to SeqWeb Powerful but can be slow

Page 122: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

NCBI Sequence Services

Obtain sequences directly from NCBI Sequence Searches Sequence Retrieval

Other services BLAST Searches Sequence Submission PubMed Searches

Page 123: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Entrez

NCBI Databases on the Web Sequence retrieval Text pattern searches

GenBank is updated on a daily basis Web Site: http://www.ncbi.nlm.nih.gov

Page 124: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Finding Sequences by Similarity

Using GCG

Page 125: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Sequence Similarities

What other sequences have some primary sequence similarity to my query sequence?

Time and cost of the search is dependent on the size of the database Restrict the size of the database

Page 126: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

FindPatterns

Look for sequence patterns within sequence files

Allows complex pattern definitions Ambiguous sequence specifications

Page 127: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

BLAST; NetBlast

All search combinations possible nt vs. nt database

blastn protein vs. protein database

blastp translated nt vs. protein database

blastx protein vs. translated nt database

tblastn translated nt vs. translated nt database

tblastx

Page 128: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

FastA,

Search nucleotide sequences with a nucleotide query

Search protein sequences with a peptide query

Page 129: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

TFastA

Translates nucleotide sequences in all 6 reading frames

Search the translated sequences with a peptide query

Page 130: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Displaying Data

analyze% typedata Displays on your screen the contents of any

GCG data file -REF

Display documentation only

Page 131: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Copying Data

analyze% fetch Will copy any GCG data or sequence file to

your director

Page 132: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Sequence Symbols

Sequence symbols Handout lists the sequence symbols

recognized by GCG Ambiguity codes are as proposed by the IUB

nomenclature committee Used by GenBank, EMBL, and NBRF

Page 133: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Nucleotide Symbols IUB/GCG Meaning Complement Staden/Sanger A A T A C C G C G G C G T/U T A T M A or C K 5 R A or G Y R W A or T W 7 S C or G S 8 Y C or T R Y K G or T M 6 V A or C or G B not supported H A or C or T D not supported D A or G or T H not supported B C or G or T V not supported X/N G or A or T or C X -/X (Gap). not G or A or T or C . not supported

Page 134: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Amino Acid Symbols IUB Symbol 3-letter Meaning Codons Depiction A Ala Alanine GCT,GCC,GCA,GCG !GCX B Asp,Asn Aspartic, Asparagine GAT,GAC,AAT,AAC !RAY C Cys Cysteine TGT,TGC !TGY D Asp Aspartic GAT,GAC !GAY E Glu Glutamic GAA,GAG !GAR F Phe Phenylalanine TTT,TTC !TTY G Gly Glycine GGT,GGC,GGA,GGG !GGX H His Histidine CAT,CAC !CAY I Ile Isoleucine ATT,ATC,ATA !ATH K Lys Lysine AAA,AAG !AAR L Leu Leucine TTG,TTA,CTT,CTC,CTA,CTG !TTR,CTX,YTR;YTX M Met Methionine ATG !ATG N Asn Asparagine AAT,AAC !AAY P Pro Proline CCT,CCC,CCA,CCG !CCX Q Gln Glutamine CAA,CAG !CAR R Arg Arginine CGT,CGC,CGA,CGG,AGA,AGG !CGX,AGR,MGR;MGX S Ser Serine TCT,TCC,TCA,TCG,AGT,AGC !TCX,AGY;WSX T Thr Threonine ACT,ACC,ACA,ACG !ACX V Val Valine GTT,GTC,GTA,GTG !GTX W Trp Tryptophan TGG !TGG X Xxx Unknown !XXX Y Tyr Tyrosine TAT, TAC !TAY Z Glu,Gln Glutamic, Glutamine GAA,GAG,CAA,CAG !SAR * End Terminator TAA, TAG, TGA !TAR,TRA;TRR

Page 135: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Other Stuff

Non-sequence Data

Page 136: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

NonSequence Data

Non-Sequence Data Data required to run a program Copy to your directory with Fetch

Page 137: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Local Data Files

Copies of GCG Data files stored in your own directory.

May be altered as desired.

Page 138: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Using Local Data Files

Programs will look first in the default directory for a particular data file with a particular name. If not found the public data file will be used. A user may specify a new name for the data

file when running a program.

Page 139: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Restriction Enzyme Files

REBASE (enzyme.dat) REBASE 6/2000 Dr. Richard J. Roberts Cold Spring Harbor Laboratory

Used by: Map, MapSort, MapPlot

Page 140: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Prosite

Dictionary of sequence motifs Dr. Amos Bairoch, University of Geneva

Release 16, 7/1999 over 1300 patterns

Used by: Motifs

Page 141: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Profiles

Database of peptide profiles Drs. Michael Gribskov and Amos Bairoch

Over 600 Profiles Used by ProfileScan

Page 142: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Eukaryotic Transcription Factor Recognition Sites

Transcription Factor Database Dr. David Ghosh, NCBI Release 7.5, 3/96 genmoredata:tfsites.dat Used by:

FindPatterns Map, MapSort, MapPlot

Page 143: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Codon Frequency Tables

Frequency of particular codon usage Look in genmoredata Organism

Human E. coli Drosophila

Used by: BackTranslate, CodonPreference

Page 144: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Translation Tables

Standard Table for translating nucleotide sequences into amino acid sequences

Look in genmoredata Alternate translation tables

Mitochondria Mycoplasma

Used by: Translate, Map, Frames

Page 145: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Symbol Comparison Tables

Amino acid similarities What is the chance that one amino acid can

substitute for another without affecting function?

Used by all sequence comparison programs FastA, TFastA, Blast Gap, BestFit PileUp

Page 146: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Protein Analysis Data

Amino acid properties Charge, hydrophobicity, molecular weight,

secondary structure predictions ect. Protease digestion sites Used by:

PepPlot; PlotStructure

Page 147: Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle.

Free Energy Values

RNA secondary structure prediction Used by:

Mfold, FoldRNA


Recommended