Michael Schroeder BioTechnological Center TU Dresden
[email protected] Biotec Programming for Bioinformatics
Slide 2
The module nwill teach students basic programming skills
relevant to bioinformatics, which will enable them to actively
develop bioinformatics tools. nwill take a problem-driven approach.
nwill present bioinformatics problems and show how to solve them
using existing online tools and how to implement such tools. nwill
revisit some of the problems and databases discussed in applied
bioinformatics. nwill be very practical and hands-on approach to
basic computer science tools such as using command line operating
systems, programming in Python, and using relational
databases.
Slide 3
Objectives nStudents will have an understanding of different
operating systems nStudents will be able to automate simple
repetitive information retrieval tasks nStudents will be able to
write simple programs in Python nStudents will be able to work with
relational databases nStudents will appreciate the principles,
limits, and possibilities of programming nStudents will be able to
formulate biological questions as information processing problems
nStudents will understand when and how programming can help to
automate bioinformatics problems
Slide 4
Module Structure nIntroduction nDatabases nIntroduction to SQL
nA Little Exercise nA Little Science nIntroduction to Python nData
types and loops nSequences and lists nPatterns and functions
nDictionaries nAdvanced topics nMore Python nDynamic programming
nClustering nRevision Class
Slide 5
Books nYou will need two books for the module: a reference book
on MySQL and a book on Python
Slide 6
Books: Python nWe will follow a number of online resources.
(see course web page) (see course web page) nFurther, we look in
Python in a Nutshell, Alex Martelli, OReilly nWesley Chun's Core
Python Programming nPython Cookbook (OReilly) nThe publisher
OReilly has many general programming books on linux, python, etc.
nThey allow you to read all books for 2 weeks online for free. This
is very nice to decide what to buy and what not. nYou can also buy
electronic copies of the book.
Slide 7
Books: MySQL nThere are many, many books on MySQL nThe
following two are just sugestions, as there are many other books
covering the same material nMySQL Cookbook by Paul DuBois, O'Reilly
or nMySQL by Paul DuBois, Michael Widenius, O'Reilly
Slide 8
Structure of Labs nDatabases nLab 1,2: Simple SQL nLab 3,4: SQL
to answer interesting scientific questions nPython nLab 5: Data
types and loops, accessing a DB from Python nLab 6: Sequences and
lists nLab 7: Patterns and Functions nLab 8: Dictionaries nLab 9:
BioPython nLab 10: Python & PyMOL n More Python: nLab 11:
Dynamic programming revisited nLab 12: Clustering revisited nLab
13: Revision
Slide 9
Assessment nLab nExercises: nEach week during the lab you get
exercises which you have to do during the lab and finish on your
own during the week nThese exercises need to be handed in on paper
at the next lecture nResults are discussed during the labs and as
part of the assessment you will have to present a solution once
nDoing the exercises is compulsory, but there are no marks nProject
nYou will demonstrate your programming skills by implementing and
presenting a software project nExam nPen and paper exam on material
covered in lecture
Slide 10
Programming Project nGoal: Demonstrate ability to use SQL and
programming nGoal 2: Produce science movie for Long Night of
Science nYou will work in a team and get a biological problem.
nPart 1: Programming: You have to implement some workflows, which
integrate data from various sites and use various tools
programmatically. This includes an animation of your target protein
in PyMol. nPart 2: Make a movie. Tell the story about your protein
based on the data collected and analysis carried out. Create a
story board and turn all material and Pymol animations into a
movie.
Slide 11
Motivation: Databases nIn the last term, nwe accessed most
information online via the web nwe interacted directly and manually
with databases and tools nwe had to manually submit queries,
interpret results. select interesting results, cut&paste them,
and submit queries again, nPro: nReasonably easy to get hold of
information nCon: nNot possible to ask many queries nQueries
limited by interface provided by web page nDifficult/impossible to
integrate information from different sites nIn this term, we will
look at the databases underlying the online front ends nHow is the
data internally stored? nHow can we - and more important computer
programs - directly interact with the underlying data, so that we
can ask more powerful queries, large queries, and integrate
different systems
Slide 12
What actually happens You are limited by what web server allows
you to ask: Example CATH: PDB ID, CATH code, or General text But
you cannot ask: In how many different PDB structures is there a
P-loop domain? Is there a PDB entry with a P-loop and a DNA-binding
domain How many different superfamilies does the largest structure
in PDB have? With direct access to the underlying database you
could answer all these questions (and many more)
Slide 13
Motivation: SCOP as Relational Database nWe worked with SCOP,
the Structural Classification of Proteins nFamily: >30% sequence
identity nSuperfamily: Similar structure and function (possibly
lower 30% sequence identity) Picture from
www.jenner.ac.uk/YBF/DanielleTalbot.ppt 30% Family Same
Superfamily, But not family
Slide 14
Motivation: Databases nWe wish to answer the following
questions: nHow many families and superfamilies are there? nDo all
superfamilies roughly have the same number of families? nHow many
families does the immunoglobulin superfamily have? nWhich
superfamily has the most families and how many? nHow many percent
of superfamilies have only one family? nWhich PDB structure has the
largest number of distinct superfamilies? nHow many percent of PDB
structures have only one type of superfamily, how many percent have
at least two? nWhich is the most popular superfamily? nAre all
superfamilies equally likely to co-occur or do they have
preferences? nWhich superfamily has the most co-occurrence
partners? nIs the number of co-occurrence partners and the
frequency of the superfamily correlated?
Slide 15
What is a Database nSCOP contains relevant information, but we
cannot answer the above questions through the web-interface of SCOP
nThe problem is that we do not have access to the underlying
database nWhat is a database anyway? nA database provides nLogical
organization of data ndata models, schema design, dictionaries
nPhysical organization of data nFast retrieval, indexing, compact
storage of data
Slide 16
Relational Database nCentral Idea: Data as relations in a table
nE.g. Employee +-------+------+---------+---------+ | id | name |
salary | role | +-------+------+---------+---------+ | 46457 | pete
| 50.000 | director| | 46458 | jane | 60.000 | nurse | | 46459 |
asif | 70.000 | driver | +-------+------+---------+---------+
SCOP Tables mysql> select * from cla limit 1;
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
| sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px |
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
| d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460
| 46461 | 14982 |
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
mysql> select * from des limit 1;
+-------+------+------+------+--------------------+ | id | type |
sccs | sid | description |
+-------+------+------+------+--------------------+ | 46456 | cl |
a | - | All alpha proteins |
+-------+------+------+------+--------------------+ mysql>
select * from astral limit 1;
+---------+---------+-----------------------------------------------------------+
| sid | sccs | seq |
+---------+---------+-----------------------------------------------------------+
| d1dlwa_ | a.1.1.1 |
slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...|
+---------+---------+-----------------------------------------------------------+
mysql> select * from subchain limit 1;
+----+-------+----------+-------+------+ | id | px | chain_id |
begin | end | +----+-------+----------+-------+------+ | 1 | 14982
| A | | | +----+-------+----------+-------+------+
Slide 19
SCOP Tables mysql> select * from cla limit 1;
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
| sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px |
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
| d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460
| 46461 | 14982 |
+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
mysql> select * from des limit 1;
+-------+------+------+------+--------------------+ | id | type |
sccs | sid | description |
+-------+------+------+------+--------------------+ | 46456 | cl |
a | - | All alpha proteins |
+-------+------+------+------+--------------------+ mysql>
select * from astral limit 1;
+---------+---------+-----------------------------------------------------------+
| sid | sccs | seq |
+---------+---------+-----------------------------------------------------------+
| d1dlwa_ | a.1.1.1 |
slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...|
+---------+---------+-----------------------------------------------------------+
mysql> select * from subchain limit 1;
+----+-------+----------+-------+------+ | id | px | chain_id |
begin | end | +----+-------+----------+-------+------+ | 1 | 14982
| A | | | +----+-------+----------+-------+------+
Slide 20
Querying Relational Databases nSQL = Structured Query Language
Select Which attributes? from Which tables? where Which conditions?
nSelect from where nDistinct nLike nUnion/intersect nJoin
nCount/average/sum/min/m ax nGroup by nHaving nShow tables nShow
databases nUse nCreate database nCreate table as nDrop table nLoad
data nInsert into
Slide 21
Databases Given SCOP as relational database, we can answer all
the questions raised above using the SQL constructs of the previous
slide!
Slide 22
Programming nWe will use Python (Guido van Rossum, named after
Monty Python) as a convenient extension to the operating system
nEasy to write quick programs nMore than just a scripting language
nInterpreted, interactive, indented nSupports string processing
well nWidely used in bioinformatics nObject oriented, general
purpose nMany nice libraries for database access, Graphics, Web,
GUI, R nScientific orientation: Numerical Python (math), Scientific
Python, Biopython nBeware: Python is inefficient, but
computationally expensive parts can be included as C-libraries
Slide 23
Motivation: Families and Identity nWe said that SCOP families
share >30% identity nWhat does that mean? nAny two structures in
a family >30%? nAt least one other member in family with
>30%? nWhat is the average sequence similarity within a family?
Within a superfamily? nGiven a sequence and that we know already
which superfamily it belongs to. Can we find the superfamilys
family best suited for the sequence
Slide 24
Two approaches: Blast vs. DIY nWe can answer the above easily:
nWe use SCOP database and run database queries from a Python script
nFor a given superfamily select all corresponding sequences from
the astral table nFor all pairs of selected sequences nCall Blast
and record the sequence identity nOr run your own dynamic
programming algorithm and record the sequence identity nFor second
problem: Compare sequence to all family sequences and assign it to
the family which shares the highest (must be >30%) similarity
with the sequence
Slide 25
Motivation: Sequence vs. Structure nCan we verify the plot
below? nCan we create a similar plot for specific superfamilies?
E.g. DNA-binding domains? Picture from
www.jenner.ac.uk/YBF/DanielleTalbot.ppt 30% Family Same
Superfamily, But not family
Slide 26
Motivation: Sequence vs. Structure Again: select the relevant
sequences from the astral table and besides computing the sequence
identity, we compute structural similarity to the relevant
structure using an algorithm like Dali or CE Then plot the two
similarities against each other in a scatter plot
Slide 27
Motivation: Amino Acid Composition of Families nCan we
characterise the amino acid composition of different
families/superfamilies? nAgain: select the relevant sequences from
astral and count the frequencies of amino acids nIs the amino acid
composition at the interface of a domain different from the rest of
the domain?
Slide 28
Motivation: Lets rebuild SCOP families nGiven a SCOP
superfamily and its sequences, how can we divide it into families?
nFirst, we need dynamic programming to determine the sequence
similarity nThen we do the following: nFor all pairs of sequences,
call the sequence similarity algorithm and record the similarity
into a distance matrix nNext, run hierarchical clustering to
cluster the sequences.
Hello World in Python Given a file helloworld.py Open a shell
and type at the command prompt helloworld.py nThe shell then
executes your programme nIn the first line, it realises that the
python interpreter needs to be loaded and that what follows is a
python program nThe line below prints a message print "Hello World"
File: helloworld.py
Slide 32
Read a text file in python The command open opens a text file
and creates r as second argument after the filename indicates that
file is read (this is default, ie. can be left out) w as second
argument indicates that file is written to a as second argument
indicates that file is appended to nThe for-loop reads all lines of
the file one by one (requires python >2.2) The body of the loop
prints them on the screen (note that print adds a new line
automatically, avoid that with adding a , ) data = open("seq.txt,
r) for line in data: print "Line:, line, acgt gggt File: seq.txt
File: fileIO.py Line: acgt Line: gggt Output
Slide 33
Variables in Python The = symbol is used to assign values to
variables The + symbol is also used to concatenate strings lineNo =
1 for line in open(seq.txt): print lineNo+: +line, lineNo =
lineNo+1 acgt gggt File: seq.txt File: fileIO.pl 1: acgt 2: gggt
Output
Slide 34
If-then- else and strings in Python data = open("seq.txt")
line1 = data.readline().rstrip() line2 = data.readline().rstrip()
len1=len(line1) len2=len(line2) if len1 < len2: minLen = len1
else: minLen = len2 line3 = "" for i in range(minLen): if line1[i]
== line2[i]: line3=line3+"*" else: line3=line3+" " print "Sequence
comparison" print line1 print line2 print line3 acgt gggt File:
seq.txt File: seqcomp.py Sequence comparison acgt gggt **
Output