+ All Categories
Home > Documents > GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov...

GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov...

Date post: 12-Jan-2016
Category:
Upload: christina-eaton
View: 218 times
Download: 0 times
Share this document with a friend
Popular Tags:
35
GBIO0009-1 Bioinformatics Introduction to DB
Transcript
Page 1: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

GBIO0009-1 BioinformaticsIntroduction to DB

Page 2: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Instructors

• Practical sessions

Kyrylo Bessonov (Kirill)• Office: B37 1/16• [email protected]• Office hours: by appointment

Page 3: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Overview1. Introduction to public databases

2. Databases demo HW

3. The submission system

Page 4: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

What are we looking for?

Data & databases

Page 5: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Biologists Collect Lots of Data• Hundreds of thousands of species to explore• Millions of written articles in scientific journals• Detailed genetic information:

• gene names• phenotype of mutants• location of genes/mutations on chromosomes• linkage (distances between genes)

• High Throughput lab technologies• PCR• Rapid inexpensive DNA sequencing (Illumina HiSeq)• Microarrays (Affymetrix)• Genome-wide SNP chips / SNP arrays (Illumina)

• Must store data such that• Minimum data quality is checked• Well annotated according to standards• Made available to wide public to foster research

Page 6: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

What is database?• Organized collection of data• Information is stored in "records“, "fields“, “tables”• Fields are categories

Must contain data of the same type (e.g. columns below)• Records contain data that is related to one object

(e.g. protein, SNP) (e.g. rows below)

SNP ID SNPSeqID Gene +primer -primer

D1Mit160_1 10.MMHAP67FLD1.seq lymphocyte antigen 84 AAGGTAAAAGGCAATCAGCACAGCC

TCAACCTGGAGTCAGAGGCT

M-05554_1 12.MMHAP31FLD3.seq procollagen, type III, alpha

TGCGCAGAAGCTGAAGTCTA

TTTTGAGGTGTTAATGGTTCT

Page 7: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Genome sequencing generates lots of data

Page 8: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Biological DatabasesThe number of databases is constantly growing!- OBRC: Online Bioinformatics Resources Collection currently lists over 2826 databases (2013)

Page 9: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Main databases by categoryLiterature• PubMed: scientific & medical abstracts/citations Health• OMIM: online mendelian inheritance in manNucleotide SequencesNucleotide: DNA and RNA sequencesGenomes• Genome: genome sequencing projects by organism• dbSNP: short genetic variationsGenes• Protein: protein sequences• UniProt: protein sequences and related informationChemicals• PubChem Compound: chemical information with structures,

information and linksPathways• BioSystems: molecular pathways with links to genes, proteins• KEGG Pathway: information on main biological pathways

Page 10: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Growth of UniProtKB database

• UniProtKB contains mainly protein sequences (entries). The database growth is exponential

• Data management issues? (e.g. storage, search, indexing?)

Source: http://www.ebi.ac.uk/uniprot/TrEMBLstats

num

ber

of e

ntrie

s

Page 11: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Primary and Secondary Databases

Primary databases REAL EXPERIMENTAL DATA (raw)

Biomolecular sequences or structures and associated annotation information (organism, function, mutation linked to disease, functional/structural patterns, bibliographic etc.)

Secondary databases

DERIVED INFORMATION (analyzed and annotated)Fruits of analyses of primary data in the primary sources (patterns, blocks, profiles etc. which represent the most conserved features of multiple alignments)

Page 12: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Primary Databases

Sequence Information– DNA: EMBL, Genbank, DDBJ– Protein: SwissProt, TREMBL, PIR, OWL

Genome Information– GDB, MGD, ACeDB

Structure Information– PDB, NDB, CCDB/CSD

Page 13: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Secondary Databases

Sequence-related Information– ProSite, Enzyme, REBase

Genome-related Information– OMIM, TransFac

Structure-related Information– DSSP, HSSP, FSSP, PDBFinder

Pathway Information– KEGG, Pathways

Page 14: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

GenBank database

• Contains all DNA and protein sequences described in the scientific literature or collected in publicly funded research

• One can search by protein name to get DNA/mRNA sequences

• The search results could be filtered by species and other parameters

Page 15: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

GenBank main fields

Page 16: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

NCBI Databases contain more than just DNA & protein sequences

NCBI main portal: http://www.ncbi.nlm.nih.gov/

Page 17: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Fasta format to store sequences

Saccharomyces cerevisiae strain YC81 actin (ACT1) geneGenBank: JQ288018.1>gi|380876362|gb|JQ288018.1| Saccharomyces cerevisiae strain YC81 actin

(ACT1) gene, partial cds TGGCATCATACCTTCTACAACGAATTGAGAGTTGCCCCAGAAGAACACCCTGTTCTTTTGACTGAAGCTCCAATGAACCCTAAATCAAACAGAGAAAAGATGACTCAAATTATGTTTGAAACTTTCAACGTTCCAGCCTTCTACGTTTCCATCCAAGCCGTTTTGTCCTTGTACTCTTCCGGTAGAACTACTGGTATTGTTTTGGATTCCGGTGATGGTGTTACTCACGTCGTTCCAATTTACGCTGGTTTCTCTCTACCTCACGCCATTTTGAGAATCGATTTGGCCGGTAGAGATTTGACTGACTACTTGATGAAGATCTTGAGTGAACGTGGTTACTCTTTCTCCACCACTGCTGAAAGAGAAATTGTCCGTGACATCAAGGAAAAACTATGTTACGTCGCCTTGGACTTCGAGCAAGAAATGCAAACCGCTGCTCAATCTTCTTCAATTGAAAAATCCTACGAACTTCCAGATGGTCAAGTCATCACTATTGGTAAC

• The FASTA format is now universal for all databases and software that handles DNA and protein sequences

• Specifications:• One header line• starts with > with a ends with [return]

Page 18: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

OMIM database

Online Mendelian Inheritance in Man (OMIM)•  ”information on all known mendelian disorders linked to

over 12,000 genes”• “Started at 1960s by Dr. Victor A. McKusick as a catalog of

mendelian traits and disorders”• Linked disease data• Links disease phenotypes and causative genes • Used by physicians and geneticists

Page 19: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

OMIM – basic search

• Online Tutorial: http://www.openhelix.com/OMIM• Each search results entry has *, +, # or % symbol

• # entries are the most informative as molecular basis of phenotype – genotype association is known is known

• Will do search on: Ankylosing spondylitis (AS)• AS characterized by chronic inflammation of spine

Page 20: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

OMIM-search results• Look for the entires that link to the genes. Apply filters if needed

Filter results if known SNP is associated to the entry

Some of the interesting entries. Try to look for the ones with # sign

Page 21: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

OMIM-entries

Page 22: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

OMIM Gene ID -entries

Page 23: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

OMIM-Finding disease linked genes

• Read the report of given top gene linked phenotype• Mapping – Linkage heterogeneity section

• Go back to the original results• Previously seen entry *607562 – IL23R

Page 24: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

PubMed database

• PubMed is one of the best known database in the whole scientific community

• Most of biology related literature from all the related fields are being indexed by this database

• It has very powerful mechanism of constructing search queries• Many search fields ● Logical operatiors (AND, OR)

• Provides electronic links to most journals• Example of searching by author articles published within 2012-2013

Page 25: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

References

[1] Durinck, Steffen, et al. "BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis." Bioinformatics 21.16 (2005): 3439-3440.

[2] Hamosh, Ada, et al. "Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders." Nucleic acids research30.1 (2002): 52-55.

[3] Ihaka, Ross, and Robert Gentleman. "R: a language for data analysis and graphics." Journal of computational and graphical statistics 5.3 (1996): 299-314.

Page 26: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Demo homework

Exploring OMIM and PubMed databases

Page 27: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Demo HW assignment (1)

Question 1: Inherited Disease Genes

In this question, you will choose a human disease and find the GenBank accession numbers and sequences of some genes which are thought to affect it.

1)Go to the OMIM database: http://www.ncbi.nlm.nih.gov/omim

2)Perform a search for a human disease you are interested in. Some possibilities include: Leukaemia, Breast cancer, Crohn, IBD. You can choose any other disease

3)Print the first page of the search results and circle two results in the printout which you will use to find related to the disease nucleotide sequences (i.e. genes). (Not every item in the search results is related/linked to a sequence)

Page 28: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Demo HW assignment (2)

4) For each of the two circled entries, follow the links to a GenBank database (Note that some of the sequences you will see in the first list may not be human.)

5) Display the chosen nucleotide sequences of the disease-related genes in FASTA format as Plain Text and copy&paste it below (only the 1st 5 lines, do not copy whole FASTA file)

Page 29: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Demo HW assignment (3)

Question 2: Medical Articles

In this question, you will search for articles on your chosen disease and restrict your search in various ways.

1)Go to the PubMed database: http://www.ncbi.nlm.nih.gov/pubmed

2)Perform a search for the same human disease as you used for OMIM. Write down how many articles are out there? Provide below the search key word(s) used to obtain the results

3)Perform the same search, only for articles which appeared exactly within the 2013 year. How many did you found? Provide below the exact query search key words used to obtain the results (e.g. ([Author] …) AND ([Journal] …) ) and or graphical explanation on how the publication date filter was applied

4)Print the Abstracts of the first 5 search results

Page 30: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Assignment SubmissionStep by Step Guide

Page 31: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Assignment submission

• All assignments should be zipped into one file (*.zip) and submitted online

• Create a submission account

Page 32: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Account creation• Any member of the group can submit assignment• Account details will be emailed to you automatically• All GBIO009-1 students should create an account

Page 33: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Submit your assignment• After account creation login into a submission page• The remaining time to deadline is displayed. Good idea to

check it from time to time in order to be on top of things• File extension should be zip• Can submit assignment as many times as you wish

Page 34: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:
Page 35: GBIO0009-1 Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 kbessonov@ulg.ac.be Office hours:

Next class bring PC for R installation!

Next class

form groups of 2-3 persons to work on HW


Recommended