+ All Categories
Home > Documents > Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

Date post: 11-Jan-2016
Category:
Upload: sherilyn-green
View: 223 times
Download: 0 times
Share this document with a friend
Popular Tags:
130
Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases
Transcript
Page 1: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

Michael Schroeder BioTechnological CenterTU Dresden Biotec

Introduction to Databases

Page 2: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 2

Structure

Motivation Introduction to MySQL Example Queries Using SQL to query SCOP

Page 3: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 3

Motivation In the last term,

we accessed most information online via the web we interacted directly and manually with databases and tools we had to manually submit queries, interpret results. select interesting

results, cut&paste them, and submit queries again,… Pro:

Reasonably easy to get hold of information Con:

Not possible to ask many queries Queries limited by interface provided by web page Difficult/impossible to integrate information from different sites

In this term, we will look at the databases underlying the online front ends How is the data internally stored? How can we - and more important computer programs - directly interact

with the underlying data, so that we can ask more powerful queries, large queries, and integrate different systems

Page 4: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 4

What actually happens when you retrieve data online? LLNE

YLEEVE EYEEDE

LLNEYLEEVE EYEEDE

… Compose result web page and send it

Send resultDisplay result4

Start programme that evaluates query by accessing database…

Send queryDisplay home page, enter query, and press submit

3

Send home page2

Get it and send itGet home page1

Web ServerMessageClient

Page 5: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 5

What actually happensYou are limited by what web server allows you to ask:Example CATH:

•PDB ID, •CATH code, or •General text

But you cannot ask:•In how many different PDB structures is there a P-loop domain?•Is there a PDB entry with a P-loop and a DNA-binding domain•How many different superfamilies does the largest structure in PDB have?

•With direct access to the underlying database you could answer all these questions (and many more)

Page 6: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 6

Querying over the Web

Problem is always the same: The web interface limits access to

the underlying database How can we interact directly with

the database

Page 7: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 7

What databases are about Logical organization of data

data models, schema design, dictionaries Physical organization of data

Fast retrieval, indexing, compact storage of data

Other requirements: Logging (important to know who did what to the data) Security and access control (important to know who can

do what) Transactions and concurrency control (important when

more than one person is working on database) Integrity (important to ensure that only valid entries in the

database) Recovery (important as hardware and software can

sometimes fail

Page 8: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 8

Different types of databases

Flat files XML Relational database (Object databases) (Object relational databases)

Page 9: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 9

Flat files We can store any data in a flat

file, e.g. EMBL But is this a database?

Logical data organisation: None, unless we define one (as done for EMBL) and adhere to it, which is not enforced

Physical data organisation: None, we cannot optimise retrieval for common queries

Logging: No Access control: Implicit

through Unix Transaction and concurrency

control: None Integrity: None Recovery: If files are backed-

up they can be recovered. However, not on the fly

ID BTBPTIG standard; genomic DNA; MAM; 3998 BP.XXAC X03365; K00966;XXSV X03365.1XXDT 18-NOV-1986 (Rel. 10, Created)DT 20-MAY-1992 (Rel. 31, Last updated, Version 3)XXDE Bovine pancreatic trypsin inhibitor (BPTI) geneXXKW Alu-like repetitive sequence; protease inhibitor;

trypsin inhibitor.XXOS Bos taurus (cow)OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;

Euteleostomi; Mammalia;OC Eutheria; Cetartiodactyla; Ruminantia; Pecora;

Bovoidea; Bovidae; Bovinae;OC Bos.XXRN [1]RP 1-3998RX MEDLINE; 86158754.RX PUBMED; 2420326.RA Kingston I.B., Anderson S.;RT "Sequences encoding two trypsin inhibitors occur in

strikingly similarRT genomic environments";RL Biochem. J. 233(2):443-450(1986).XXRN [2]RX MEDLINE; 84070725.RX PUBMED; 6580617.RA Anderson S., Kingston I.B.;RT "Isolation of a genomic clone for bovine pancreatic

trypsin inhibitor byRT using a unique-sequence synthetic DNA probe.";

Page 10: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 10

XML files We can store any data in XML,

the eXtentable Mark-up Language, e.g. Medline

But is this a database? Logical data organisation:

yes, XML schema, which is enforced

Physical data organisation: None, we cannot optimise retrieval for common queries

Logging: No Access control: Implicit

through Unix Transaction and concurrency

control: None Integrity: None Recovery: If files are backed-

up they can be recovered. However, not on the fly

<Article>

<Journal>

<ISSN>0270-7306</ISSN>

<JournalIssue>

<Volume>19</Volume>

<Issue>11</Issue>

<PubDate>

<Year>1999</Year>

<Month>Nov</Month>

</PubDate>

</JournalIssue>

</Journal>

<ArticleTitle>Differential regulation of the cell wall integrity mitogen-activated protein kinase pathway in budding yeast by the protein tyrosine phosphatases Ptp2 and Ptp3.

</ArticleTitle>

<Pagination>

<MedlinePgn>7651-60</MedlinePgn>

</Pagination>

<Abstract>

<AbstractText>Mitogen-activated protein kinases (MAPKs) are inactivated by dual-specificity and protein tyrosine phosphatases (PTPs) in yeasts. In Saccharomyces cerevisiae, two PTPs, Ptp2 and Ptp3, inactivate the MAPKs, Hog1 and Fus3, with different specificities... </AbstractText>

</Abstract>

<Affiliation>Department of Chemistry, University of Colorado, Boulder, Colorado 80309-0215, USA.

</Affiliation>…

Page 11: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 11

Relational Database

Central Idea: Data as relations in a table E.g. SCOP, Structural Classification of Proteins

+-------+------+---------+---------+--------------------------------------+| id | type | sccs | sid | description |+-------+------+---------+---------+--------------------------------------+| 46457 | cf | a.1 | - | Globin-like || 46458 | sf | a.1.1 | - | Globin-like || 46459 | fa | a.1.1.1 | - | Truncated hemoglobin || 46460 | dm | a.1.1.1 | - | Truncated hemoglobin || 46461 | sp | a.1.1.1 | - | Ciliate (Paramecium caudatum) || 14982 | px | a.1.1.1 | d1dlwa_ | 1dlw A: || 46462 | sp | a.1.1.1 | - | Green alga (Chlamydomonas eugametos) || 14983 | px | a.1.1.1 | d1dlya_ | 1dly A: || 63437 | sp | a.1.1.1 | - | Mycobacterium tuberculosis || 62301 | px | a.1.1.1 | d1idra_ | 1idr A: |+-------+------+---------+---------+--------------------------------------+

Page 12: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 12

Relational Database

Central Idea: Data as relations in a table E.g. Employee

+-------+------+---------+---------+| id | name | salary | role |+-------+------+---------+---------+| 46457 | pete | 50.000 | director|| 46458 | jane | 60.000 | nurse || 46459 | asif | 70.000 | driver |+-------+------+---------+---------+

Page 13: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 13

Relational Database

Central Idea: Data as relations in a table E.g. pets

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 14: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 14

Relational Database

Central Idea: Data as relations in a table E.g. school

+-------+------+---------+| id | name | subject |+-------+------+---------+| 46458 | rick | bio || 46459 | gerd | bio || 46460 | mary | bio || 46461 | ella | math || 14982 | anne | math || 46462 | paul | math |+-------+------+---------+

+-------+------+---------+| id | prof | subject |+-------+------+---------+| 51221 | bert | bio || 55435 | anne | math |+-------+------+---------+

+---------+------+-----+------+| subject | room | day | time |+---------+------+-----+------+| bio | A | mo | 3pm || math | B | tue | 1pm |+---------+------+-----+------+

Page 15: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 15

Relational Database

A cell in the table stores a single number or string, but not a list

Lists, sets need to be flattened

+------+-------------+| prof | subjects |+------+-------------+| bert | {bio,sport} || anne | {math,arts} |+------+-------------+

+------+-------------+| prof | subject |+------+-------------+| bert | bio || bert | sport || anne | arts || anne | math |+------+-------------+

Page 16: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 16

Bioinformatics: 10 years of resistance to flattening!

Why the resistance? Bioinformatics data is naturally nested Extensive Use of sets and lists

E.g. Swissprot: Features, keywords, References

Such data can be flattened, but the resulting relational schema

is hard to understand hence it is hard to formulate queries. For example, storing the SWISSPROT entry in a relational

database would split it over 15-20 tables.

Page 17: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 17

Relational Databases

RDB introduced in 1970 by Codd Took off in the 80s

In the business world, relational databases are the rule (Oracle, Sybase, mySQL, DB2, Microsoft Access).

Large biomedical databases typically use a relational technology; but there are also a lot of homegrown systems (ACeDB, SRS indexed files). Data is almost always viewed and exported in a variety of flat file formats (EMBL, GenBank among others)

Page 18: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 18

The flood of biomedical data…

Since 1980, the number and size of biomedical databases has been growing exponentially.

How can you find sources of information you are seeking? Nucleic Acids Research Database Issue in January of

every year (http://nar.oupjournals.org/) Dbcat (http://www.infobiogen.fr/ ): a flat file database

of 500 biological databases.

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 19: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 19

Relational Schema

The schema of a database is a set of relation names, their field names and types.

Example:

Entry(ID: int, Length: int, Seq: string, Mod: date)

Feature(ID: int, Type: string, From: int,To: int)

Entry and Feature are relation names,

ID, Seq, Mod, etc are attribute names, and

int, string, date are domains

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 20: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 20

Relation Instance

An instance of a relation is a set of tuples of the type of the relation.

A tuple of Entry could be:

( ID: 82814, Length: 597, Seq:“ccagctaaccg”, Mod: 1-7-95)

A tuple of Feature could be:

(ID: 82814, Type:“source”, From:1,To:8959)

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 21: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 21

Tabular representation

Typically, relations are displayed as tables

Sequence:

Feature:ID Type From To

82814 “Source” 1 18482814 “Gene” 23 65

tuples

attributes

ID Length Seq Mod

82814 597 “ccagctaa...” 1-07-9598608 18976 “accgcct...” 2-14-98

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 22: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 22

Entities and Relationships

There is a one-many relationship from Entry to Feature; each entry can have many features, but a feature can be on at most one entry.

Put another way, the existence of a feature depends on the existence of the owning entry referential integrity

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 23: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 23

Integrity Constraints

ID is the key of Entry, indicated by underlining: No two tuples of any instance of Entry can have the

same ID. In Feature, there is a referential integrity constraint

on ID: Every ID in Feature must appear in some tuple in

Entry. This is specified in the data definition language

(DDL), and enforced by the system as updates are made to the instance.

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 24: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 24

DDL for this relational schema

CREATE TABLE Entry (Id INTEGER, Length INTEGER, Sequence LONGCHAR, Mod DATE, PRIMARY KEY (Id) )

CREATE TABLE Feature (Id INTEGER, Type CHAR(15), From INTEGER, To INTEGER, PRIMARY KEY (Id, Type, From, To) FOREIGN KEY (Id) REFERENCES Entry ON DELETE CASCADE ON UPDATE CASCADE)

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 25: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 25

Querying relational databases

The language SQL has become a standard for querying relational databases. Based on a curious mixture of the relational algebra and relational calculus (formal languages), it allows new relations of information to be computed from a set of relations.

Unlike the relational algebra, it allows other useful stuff: count, sum, min, max, etc.

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 26: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 26

Basic Query relation-list A list of relation names (possibly with a

range-variable after each name). target-list A list of attributes of relations in relation-list.

* can be used to denote all atts. qualification Comparisons (Attr op const or Attr1 op

Attr2, where op is one of <, <=, >, >=, =, <> combined using AND, OR and NOT.

DISTINCT (optional) keyword indicates that the answer should not contain duplicates. Default is that duplicates are not eliminated!

SELECT [DISTINCT] target-listFROM relation-listWHERE qualification

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 27: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 27

Conceptual Evaluation Strategy

Compute the product of relation-list Discard tuples that fail qualification Project over attributes in target-list If DISTINCT then eliminate duplicates

This is probably a very bad way of executing the query, and a good query optimizer will use all sorts of tricks to find efficient strategies to compute the same answer.

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 28: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 28

Sample tables

Sequence:

Feature:ID Type From To82814 “Source” 1 59782814 “Gene” 23 651 “Gene” 3 999913428 “Gene” 11000 1666513428 “Source” 1 16665

ID Length Seq Mod

82814 597 “ccagctaa...” 1-07-9598608 18976 “accgcct...” 2-14-981 16665 “gtgtaa….” 1-19-9776582 9976 “actgga…” 2-29-00

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 29: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 29

Simple queriesSELECT * FROM SequenceWHERE Length < 10000;

SELECT TypeFROM Feature;

Print all sequences with length less than 10000.

Print the type of all features.

ID Length Seq Mod

82814 597 “ccagctaa...” 1-07-9576582 9976 “actgga…” 2-29-00

Type“Source” “Gene” “Gene” “Gene” “Source”

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 30: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 30

Distinct Note that SQL did not eliminate duplicates. We

need to request this explicitly.

SELECT DISTINCT TypeFROM Feature;

Print the type of all features(no duplicates).

Type“Source” “Gene”

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 31: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 31

Pattern Matching Can be used in where clause. “_” denotes any

character, “%” 0 or more characters.

SELECT * FROM SequenceWHERE Seq LIKE ‘a_%g'

ID Length Seq Mod

98608 18976 “accgcct...” 2-14-9876582 9976 “actgga…” 2-29-00

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 32: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 32

Arithmetic “as” can be used to label columns in the output;

arithmetic can be used to compute results

SELECT DISTINCT ID, To-From+1 as LengthFROM Feature;

ID Length82814 59782814 43 13428 999713428 566613428 16665

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 33: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 33

Set operations -- union

SELECT ID FROM SequenceWHERE Length<10000UNIONSELECT IDFROM FeatureWHERE Type=“Source”;

• Duplicates do not occur in the union.

ID 765828281413428

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 34: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 34

The UNION ALL operator preserves duplicates

SELECT ID FROM SequenceWHERE Length<10000UNION ALLSELECT IDFROM FeatureWHERE Type=“Source”;

ID 76582828148281413428

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 35: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 35

Intersection and difference

SELECT Id FROM SequenceINTERSECTSELECT IdFROM Feature;

SELECT Id FROM SequenceMINUSSELECT IdFROM Feature;

ID 9860876582

ID 8281413428

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 36: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 36

Products

Note that the ID column name is duplicated in the output.

SELECT *FROM Sequence,Feature;

ID Length Seq Mod ID Type From To

82814 597 “ccagctaa...” 1-07-95 82814 “Source” 1 59798608 18976 “accgcct...” 2-14-98 82814 “Source” 1 5971 16665 “gtgtaa….” 1-19-97 82814 “Source” 1 59776582 9976 “actgga…” 2-29-00 82814 “Source” 1 597……. (lots more!)

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 37: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 37

Conditional join

SELECT *FROM Sequence, FeatureWHERE Sequence.Id = Feature.Id;

ID Length Seq Mod ID Type From To

82814 597 “ccagctaa...” 1-07-95 82814 “Source” 1 59782814 597 “ccagctaa...” 1-07-95 82814 “Gene” 23 6513428 16665 “gtgtaa….” 1-19-97 13428 “Gene” 3 99991 16665 “gtgtaa….” 1-19-97 13428 “Gene” 11000 1666513428 16665 “gtgtaa….” 1-19-97 13428 “Source” 1 16665

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 38: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 38

Counting

Surprisingly, the answer to both of these is the following:

SELECT COUNT(*)FROM Feature;

SELECT COUNT(Type)FROM Feature;

COUNT(TYPE) 5

Print the numberof feature entries.

Print the numberof types of features.

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 39: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 39

Counting, cont.

To fix this, we use the keyword “DISTINCT”:

Can also use SUM, AVG, MIN and MAX.

SELECT COUNT(DISTINCT Type)FROM Feature;

COUNT(DISTINCT Type) 3

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 40: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 40

Group by

So far, these aggregate operators have been applied to all qualifying tuples. Sometimes we want to apply them to each of several groups of tuples.

For example: “Print the type and number of features of each type.”

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 41: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 41

Group by

Note that only the columns that appear in the GROUP BY statement and “aggregated” columns can appear in the output. So the following would generate an error.

SELECT Type, COUNT(*)FROM FeatureGROUP BY Type;

Type COUNT(*)“Source” 2“Gene” 3

SELECT Type, From, To, COUNT(*)FROM FeatureGROUP BY Type;

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 42: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 42

Group by … having

HAVING is to GROUP BY as WHERE is to FROM

“HAVING” is used to restrict the groups that appear in the result.

SELECT Type, COUNT(*)FROM FeatureWHERE From-To > 50GROUP BY TypeHAVING AVG(From-To)> 8500

TYPE COUNT(*)“Source” 2

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 43: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 43

Summary

SQL is “relationally complete”: allows you to perform operators in an algebra of relations (the relational algebra).

Additional features: string comparisons, set membership, arithmetic and grouping.

In contrast, Entrez is a much more limited language.

Susan B. Davidson, Biol537/CIS636, Fall 2003

Page 44: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 44

A Little Exercise

Page 45: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 45

A Little Exercise

Given the table pet below let us formulate some queries…

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 46: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 46

A Little Exercise

Get all pet names…

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 47: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 47

A Little Exercise

SELECT name

FROM pet; Get all owners and list them only once

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 48: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 48

A Little Exercise

SELECT DISTINCT owner

FROM pet; Select the names of all birds…

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 49: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 49

A Little ExerciseSELECT name

FROM pet

WHERE species=“bird”; Select the names of all female birds…

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 50: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 50

A Little ExerciseSELECT name

FROM pet

WHERE species=“bird” AND sex=“f”; Select names and owners

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 51: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 51

A Little Exercise

SELECT name,owner

FROM pet; Select owners of birds and dogs…

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 52: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 52

A Little ExerciseSELECT owner

FROM pet

WHERE species=“bird” OR species=“dog”; Select all owners starting with Dia…

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 53: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 53

A Little ExerciseSELECT owner

FROM pet

WHERE owner LIKE “Dia%”; How many pets has Gwen?...

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 54: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 54

A Little ExerciseSELECT count(name)

FROM pet

WHERE owner=“Gwen”; Select owners of male pets in sorted order…

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 55: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 55

A Little ExerciseSELECT owner

FROM pet

WHERE sex=“m”

ORDER BY owner;+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

List owners and the number of pets they have…

Page 56: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 56

A Little ExerciseSELECT owner, COUNT(name)

FROM pet

GROUP BY owner;

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

List owners and the number of pets they have in descending order…

Page 57: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 57

A Little ExerciseSELECT owner, COUNT(name) AS num

FROM pet

GROUP BY owner

ORDER BY num DESC;

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

List owners and the number of pets they have in descending order, but only if they have more than 1 pet…

Page 58: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 58

A Little ExerciseSELECT owner, COUNT(name) AS num

FROM pet

GROUP BY owner HAVING num > 1

ORDER BY num DESC; List all pairs of cats and dogs…+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 59: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 59

A Little ExerciseSELECT p1.name, p1.species, p2.name, p2.species

FROM pet AS p1, pet AS p2

WHERE p1.species=“dog” AND p2.species=“cat”; Select all male/female pairs of the same species…

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 60: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 60

A Little ExerciseSELECT p1.name, p1.species, p1.sex, p2.name, p2.species, p2.sexFROM pet AS p1, pet AS p2WHERE p1.species=p2.species AND p1.sex=“m” AND p2.sex=“f”;

Can we write p1.sex != p2.sex instead? …

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 61: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 61

A Little Exercise

We would get the pair Whistler and Chirpy as well

+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+

Page 62: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 62

A Little Science

When working with SCOP through the web interface we are limited in what we can ask

What can we get out of SCOP when it is available as a relational table?

A reminder Classes: all alpha, all beta, alpha/beta, alpha+beta SCOP family: >30% sequence similarity SCOP superfamily: good structural similiary (possibly

<30%)

Page 63: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 63

A Little Science At low sequence identity, good structural

alignments possible

Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt

30%

Family

Same Superfamily,

But not family

Page 64: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 64

A Little Science

Three tables: cla, PDB entry and reference to its class, fold,

superfamily, family, domain des, description of each node in the SCOP hierarchy subchain, chain and possibly beginning and end on

chain for a domain instance astral, sequence for a domain

Page 65: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 65

A Little Sciencemysql> SELECT * FROM cla LIMIT 1;+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+| sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px |+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+| d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 |+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+

mysql> SELECT * FROM des LIMIT 1;+-------+------+------+------+--------------------+| id | type | sccs | sid | description |+-------+------+------+------+--------------------+| 46456 | cl | a | - | All alpha proteins |+-------+------+------+------+--------------------+

mysql> SELECT * FROM astral LIMIT 1;+---------+---------+-----------------------------------------------------------+| sid | sccs | seq |+---------+---------+-----------------------------------------------------------+| d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...|+---------+---------+-----------------------------------------------------------+

mysql> SELECT * FROM subchain LIMIT 1;+----+-------+----------+-------+------+| id | px | chain_id | begin | end |+----+-------+----------+-------+------+| 1 | 14982 | A | | |+----+-------+----------+-------+------+

Page 66: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

Entity relationship diagram for SCOP

Thanks to Boris VassilevT

Page 67: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 67

A Little Science

How many nodes are there in the hierarchy of type class, fold, superfamily, family?

des+-------+------+------+------+--------------------+| id | type | sccs | sid | description |+-------+------+------+------+--------------------+| 46456 | cl | a | - | All alpha proteins |+-------+------+------+------+--------------------+

Page 68: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 68

A Little Science

How many nodes are there in the hierarchy of type class, fold, superfamily, family?

Let us first find out how these types are called:SELECT DISTINCT type

FROM des; Now let’s list them with the numbers

des+-------+------+------+------+--------------------+| id | type | sccs | sid | description |+-------+------+------+------+--------------------+| 46456 | cl | a | - | All alpha proteins |+-------+------+------+------+--------------------+

Page 69: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 69

A Little Science How many nodes are there in the hierarchy of type

class, fold, superfamily, family? SELECT type, COUNT(*) AS num FROM des GROUP BY type ORDER BY num;

There are not that many morefamilies than superfamilies.

Which superfamily has the mostfamiliesdes+-------+------+------+------+--------------------+| id | type | sccs | sid | description |+-------+------+------+------+--------------------+| 46456 | cl | a | - | All alpha proteins |+-------+------+------+------+--------------------+

+------+-------+| type | num |+------+-------+| cl | 11 || cf | 854 || sf | 1305 || fa | 2156 || dm | 4567 || sp | 7111 || px | 44327 |+------+-------+

Page 70: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 70

A Little ScienceWhich superfamily has the most families?SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS numFROM des, cla WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num DESC;+---------+---------------------------------------------------------+-----+| sccs | description | num |+---------+---------------------------------------------------------+-----+| a.4.5 | "Winged helix" DNA-binding domain | 35 || c.69.1 | alpha/beta-Hydrolases | 23 || c.66.1 | S-adenosyl-L-methionine-dependent methyltransferases | 20 || c.52.1 | Restriction endonuclease-like | 19 || c.37.1 | P-loop containing nucleotide triphosphate hydrolases | 18 || b.18.1 | Galactose-binding domain-like | 15 || d.92.1 | Metalloproteases ("zincins"), catalytic domain | 14 || b.29.1 | Concanavalin A-like lectins/glucanases | 14 || f.2.1 | Membrane all-alpha | 13 || c.47.1 | Thioredoxin-like | 12 || c.68.1 | Nucleotide-diphospho-sugar transferases | 12 || c.2.1 | NAD(P)-binding Rossmann-fold domains | 11 || a.4.1 | Homeodomain-like | 10 || b.40.4 | Nucleic acid-binding proteins | 10 || a.118.1 | ARM repeat | 10 |

Page 71: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 71

A Little Science

Which families does the DNA binding-domain superfamily have?

The sccs of the superfamily is a.4.5. Its families have sccs a.4.5.1, a.4.5.2,…, so how can we list them?

mysql> SELECT * FROM cla LIMIT 1;+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+| sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px |+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+| d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 |+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+

mysql> SELECT * FROM des LIMIT 1;+-------+------+------+------+--------------------+| id | type | sccs | sid | description |+-------+------+------+------+--------------------+| 46456 | cl | a | - | All alpha proteins |+-------+------+------+------+--------------------+

Page 72: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 72

A Little ScienceWhich families does the DNA binding-domain superfamily

have? SELECT DISTINCT sccs, descriptionFROM des WHERE sccs LIKE “a.4.5%” AND type=“fa” ORDER BY sccs;

| sccs | description+----------+------------------------------------------------| a.4.5.1 | Biotin repressor-like | a.4.5.10 | Replication initiation protein| a.4.5.11 | Helicase DNA-binding domain| a.4.5.12 | Restriction endonuclease FokI, N-terminal (recognition) domain| a.4.5.13 | Histone H1/H5| a.4.5.14 | Forkhead DNA-binding domain| a.4.5.15 | DNA-binding domain from rap30| a.4.5.16 | C-terminal domain of RPA32| a.4.5.17 | Cell cycle transcription factor e2f-dp| a.4.5.18 | The central core domain of TFIIE beta| a.4.5.19 | Z-DNA binding domain| a.4.5.2 | LexA repressor, N-terminal DNA-binding domain| a.4.5.20 | P4 origin-binding domain-like| a.4.5.21 | ets domain| a.4.5.22 | Heat-shock transcription factor| a.4.5.23 | Interferon regulatory factor…

Page 73: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 73

A Little Science

Which families does the DNA binding-domain superfamily have?...| a.4.5.24 | Iron-dependent represor protein| a.4.5.25 | Methionine aminopeptidase, insert domain| a.4.5.26 | mu transposase, DNA-binding domain| a.4.5.27 | TnsA endonuclease, C-terminal domain| a.4.5.28 | MarR-like transcriptional regulators| a.4.5.29 | Plant O-methyltransferase, N-terminal domain| a.4.5.3 | Arginine repressor (ArgR), N-terminal DNA-binding domain| a.4.5.30 | C-terminal domain of the rap74 subunit of TFIIF| a.4.5.31 | DEP domain| a.4.5.32 | Lrp/AsnC-like transcriptional regulator N-terminal domain| a.4.5.33 | Thanscriptional regulator IclR, N-terminal domain| a.4.5.34 | SCF ubiquitin ligase complex WHB domain| a.4.5.35 | C-terminal fragment of elongation factor SelB| a.4.5.4 | CAP C-terminal domain-like| a.4.5.5 | ArsR-like transcriptional regulators| a.4.5.6 | GntR-like transcriptional regulators| a.4.5.7 | Replication terminator protein (RTP)| a.4.5.8 | N-terminal domain of molybdate-dependent transcriptional regulator ModE | a.4.5.9 | Transcription factor MotA, activation domain

Page 74: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 74

A Little Science Which families does the DNA binding-domain superfamily

have? Let’s find example pdb’s

mysql> SELECT DISTINCT pdb_id FROM cla WHERE sccs="a.4.5.1";+--------+| pdb_id |+--------+| 1bia || 1hxd || 1bib || 1j5y |+--------+

mysql> SELECT DISTINCT pdb_id FROM cla WHERE sccs="a.4.5.2";+--------+| pdb_id |+--------+| 1jhf || 1jhh || 1lea || 1leb |+--------+

Page 75: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 75

A Little Science1bia1jhf

Page 76: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 76

A Little Science

1cgpa.4.5.4

Page 77: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 77

A Little Science Some more…

1smt, a.4.5.5

1hw1, a.4.5.6

1b9n, a.4.5.8

1bm9, a.4.5.7

1f4k, which family?

1f4k, a.4.5.7

Page 78: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 78

A Little Science

How many percent of superfamilies have only 1 family, how many 2,… ?

Page 79: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 79

A Little Science How many percent of superfamilies have only 1,2,3,… families?

First let’s deposit the result of the query that found the number of families for each superfamily in a tableCREATE TABLE fa_freq AS

SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS num

FROM des, cla WHERE des.id=cla.sf

GROUP BY cla.sf ORDER BY num desc;

Now we count how many superfamilies have 1,2,3,… familiesSELECT num AS fa_per_sf, COUNT(*) AS freq

FROM fa_freq GROUP BY num;

Page 80: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 80

A Little Science+-----------+------+| fa_per_sf | freq |+-----------+------+| 1 | 981 || 2 | 164 || 3 | 65 || 4 | 29 || 5 | 25 || 6 | 14 || 7 | 6 || 8 | 5 || 9 | 1 || 10 | 3 || 11 | 1 || 12 | 2 || 13 | 1 || 14 | 2 || 15 | 1 || 18 | 1 || 19 | 1 || 20 | 1 || 23 | 1 || 35 | 1 |+-----------+------+

SELECT COUNT(*) FROM fa_freq ;+----------+| count(*) |+----------+| 1305 |+----------+

+-----------+------+| fa_per_sf | perc |+-----------+------+| 1 | 0.75 || 2 | 0.13 || 3 | 0.05 || 4 | 0.02 || 5 | 0.02 || 6 | 0.01 || 7 | 0.00 || 8 | 0.00 || 9 | 0.00 || 10 | 0.00 || 11 | 0.00 || 12 | 0.00 || 13 | 0.00 || 14 | 0.00 || 15 | 0.00 || 18 | 0.00 || 19 | 0.00 || 20 | 0.00 || 23 | 0.00 || 35 | 0.00 |+-----------+------+

How many percent of superfamilies have only 1,2,3,… families?

SELECT num AS fa_per_sf, (COUNT(*)/1305) AS perc FROM fa_freq GROUP BY num;

Page 81: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 81

A Little Science

How many percent of superfamilies have only 1,2,3,… families?

This is interesting! For the majority of superfamilies there is only one family!

What is the PDB structure with the largest number of (distinct) superfamilies?

Page 82: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 82

A Little ScienceSELECT pdb_id, COUNT(sf) AS sf_numFROM cla GROUP BY pdb_id ORDER BY sf_num DESC LIMIT 10;+--------+--------+| pdb_id | sf_num |+--------+--------+| 1aon | 49 || 1hto | 48 || 1ir2 | 48 || 1htq | 48 || 1der | 42 || 1f49 | 40 || 1jyy | 40 || 1gho | 40 || 1jyz | 40 || 1jz1 | 40 |+--------+--------+

SELECT pdb_id, COUNT(DISTINCT sf) AS distinct_sf_numFROM cla GROUP BY pdb_id ORDER BY distinct_sf_num DESC LIMIT 10;+--------+-----------------+| pdb_id | distinct_sf_num |+--------+-----------------+| 1m1k | 23 || 1k9m | 23 || 1kd1 | 23 || 1kqs | 23 || 1jj2 | 23 || 1k8a | 23 || 1ffk | 22 || 1i96 | 21 || 1hnz | 20 || 1hr0 | 20 |+--------+-----------------+

What is the PDB structure with the largest number of (distinct) superfamilies?

Page 83: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 83

A structure with

23 different superfamilies

1k9m Co Crystal Structure Of Tylosin Bound To The 50S Ribosomal Subunit Of Haloarcula MarismortuiRibosome

Page 84: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 84

A Little Science Now let’s plot how many PDBs have 1, 2, 3,… distinct

superfamilies First of all let us put the result of the previous slide in a table

(note if the table already exists we have to erase it first: DROP TABLE pdb_sf_num. But be careful using DROP

CREATE TABLE pdb_sf_num AS

SELECT pdb_id,

COUNT(DISTINCT sf) AS distinct_sf_num

FROM cla

GROUP BY pdb_id

ORDER BY distinct_sf_num DESC; Now let us count how many PDBs with 1,2,3,… distinct

superfamilies…

Page 85: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 85

A Little ScienceHow many PDBs have 1, 2, 3,… distinct superfamiliesSELECT distinct_sf_num, COUNT(pdb_id) AS numFROM pdb_sf_num GROUP BY distinct_sf_num ORDER BY distinct_sf_num;+-----------------+-------+| distinct_sf_num | num |+-----------------+-------+| 1 | 13960 || 2 | 2721 || 3 | 495 || 4 | 178 || 5 | 33 || 6 | 25 || 7 | 1 || 9 | 4 || 20 | 9 || 21 | 1 || 22 | 1 || 23 | 6 |+-----------------+-------+

Page 86: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 86

A Little Science

Let’s do the same in percent

SELECT COUNT(DISTINCT pdb_id)

FROM cla;

+----------+

| count(*) |

+----------+

| 17434 |

+----------+

There are 17434 PDB IDs

Page 87: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 87

A Little ScienceHow many PDBs have 1, 2, 3,… distinct superfamiliesSELECT distinct_sf_num, COUNT(pdb_id)/17434 AS percFROM pdb_sf_num GROUP BY distinct_sf_num ORDER BY distinct_sf_num;+-----------------+------+| distinct_sf_num | perc |+-----------------+------+| 1 | 0.80 || 2 | 0.16 || 3 | 0.03 || 4 | 0.01 || 5 | 0.00 || 6 | 0.00 || 7 | 0.00 || 9 | 0.00 || 20 | 0.00 || 21 | 0.00 || 22 | 0.00 || 23 | 0.00 |+-----------------+------+

80% of PDB entries consist only of one type of superfamily!

Page 88: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 88

A Little Science

What are the most popular superfamilies?

I.e. for which are there the most PDB entries

Page 89: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 89

A Little ScienceWhat are the most popular superfamilies?SELECT des.sccs, des.description, COUNT(DISTINCT cla.pdb_id) AS num_of_pdb_idsFROM cla,des WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num_of_pdb_ids DESC LIMIT 10;+--------+------------------------------------------------------+----------------+| sccs | description | num_of_pdb_ids |+--------+------------------------------------------------------+----------------+| b.1.1 | Immunoglobulin | 823 || d.2.1 | Lysozyme-like | 777 || b.47.1 | Trypsin-like serine proteases | 649 || c.37.1 | P-loop containing nucleotide triphosphate hydrolases | 521 || c.2.1 | NAD(P)-binding Rossmann-fold domains | 384 || a.1.1 | Globin-like | 384 || c.1.8 | (Trans)glycosidases | 332 || b.50.1 | Acid proteases | 288 || b.29.1 | Concanavalin A-like lectins/glucanases | 230 || c.47.1 | Thioredoxin-like | 217 |+--------+------------------------------------------------------+----------------+

Page 90: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 90

A Little Science

Are all superfamilies equally likely to co-occur?

Let us generate a co-occurrence map as an answer Which superfamilies co-occur most frequently Which superfamilies have the most co-occurrence partners

The co-occurrence map should consist of two tables A table with PDB ID, superfamily 1, superfamily 2 (to avoid

repetition we will require that sf1 is alphabetically before sf2) A table with superfamily 1 and 2 and the number of PDBs

containing this co-occurrence

Page 91: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 91

A Little ScienceCo-occurrence map:SELECT DISTINCT c1.pdb_id, c1.sf, c2.sfFROM cla AS c1, cla AS c2 WHERE c1.pdb_id=c2.pdb_id AND c1.sf<c2.sf LIMIT 10;+--------+-------+-------+| pdb_id | sf | sf |+--------+-------+-------+| 1cqx | 46458 | 63380 || 1cqx | 46458 | 52343 || 1gvh | 46458 | 63380 || 1gvh | 46458 | 52343 || 1b33 | 46458 | 54580 || 1qgw | 46458 | 56568 || 1kf6 | 46548 | 46977 || 1kf6 | 46548 | 51905 || 1kf6 | 46548 | 54292 || 1kf6 | 46548 | 56425 |+--------+-------+-------+

We are still missing the sf names, which we can get from the des table

Page 92: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 92

A Little ScienceCo-occurrence Map:CREATE TABLE cooc AS SELECT DISTINCT c1.pdb_id, c1.sf AS sf1, d1.description AS sf1name, c2.sf AS sf2, d2.description AS sf2name FROM cla AS c1, cla AS c2, des AS d1, des AS d2 WHERE c1.pdb_id=c2.pdb_id AND c1.sf<c2.sf AND c1.sf=d1.id AND c2.sf=d2.id;

+--------+-------+--------------------------+-------+--------------------------------------------------------------+| pdb_id | sf1 | sf1name | sf2 | sf2name |+--------+-------+--------------------------+-------+--------------------------------------------------------------+| 1cqx | 46458 | Globin-like | 63380 | Riboflavin synthase domain-like || 1cqx | 46458 | Globin-like | 52343 | Ferredoxin reductase-like, C-terminal NADP-linked domain || 1gvh | 46458 | Globin-like | 63380 | Riboflavin synthase domain-like || 1gvh | 46458 | Globin-like | 52343 | Ferredoxin reductase-like, C-terminal NADP-linked domain || 1b33 | 46458 | Globin-like | 54580 | Allophycocyanin linker chain (domain) || 1qgw | 46458 | Globin-like | 56568 | Non-globular alpha+beta subunits of globular proteins || 1kf6 | 46548 | alpha-helical ferredoxin | 46977 | Succinate dehydrogenase/fumarate reductase C-terminal domain || 1kf6 | 46548 | alpha-helical ferredoxin | 51905 | FAD/NAD(P)-binding domain || 1kf6 | 46548 | alpha-helical ferredoxin | 54292 | 2Fe-2S ferredoxin-like || 1kf6 | 46548 | alpha-helical ferredoxin | 56425 | Succinate dehydrogenase/fumarate reductase catalytic domain |+--------+-------+--------------------------+-------+--------------------------------------------------------------+

Now let us count the distinct PDB IDs for each co-occurrence

Page 93: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 93

A Little ScienceNumber of instances in co-occurrence mapSELECT COUNT(DISTINCT pdb_id) AS num, sf1, sf1name, sf2, sf2nameFROM cooc GROUP BY sf1,sf2 ORDER BY num DESC LIMIT 10;+-----+-------+-----------------------------------------------+-------+------------------------------------------------------------------+| num | sf1 | sf1name | sf2 | sf2name |+-----+-------+-----------------------------------------------+-------+------------------------------------------------------------------+| 137 | 48726 | Immunoglobulin | 54452 | MHC antigen-recognition domain || 125 | 51011 | alpha-Amylases, C-terminal beta-sheet domain | 51445 | (Trans)glycosidases || 117 | 47616 | Glutathione S-transferases, C-terminal domain | 52833 | Thioredoxin-like || 99 | 47802 | DNA polymerase beta, N-terminal domain-like | 56699 | Nucleotidyltransferases || 97 | 53098 | Ribonuclease H-like | 56672 | DNA/RNApolymerases || 74 | 51735 | NAD(P)-binding Rossmann-fold domains | 55347 | Glyceraldehyde-3-phosphate dehydrogenase-like, C-terminal domain || 64 | 48726 | Immunoglobulin | 51445 | (Trans)glycosidases || 63 | 51905 | FAD/NAD(P)-binding domain | 54373 | FAD-linked reductases, C-terminal domain || 58 | 50203 | Bacterial enterotoxins | 54334 | Superantigen toxins, C-terminal domain || 55 | 48726 | Immunoglobulin | 51011 | alpha-Amylases, C-terminal beta-sheet domain |+-----+-------+-----------------------------------------------+-------+------------------------------------------------------------------+

Is it valid to draw any conclusions from the above table of superfamilyco-occurrences with their frequencies? We should be careful, as the number of co-occurrences may be biased by the abundance of each superfamily, e.g. immunoglobulin is very frequent in the PDB

Page 94: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 94

A Little Science

This is quite similar a problem to the generation of substitution matrices (last term)

A little reminder…

Page 95: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 95

BLOSUM

BLOcks SUbstitution Matrix (based on BLOCKS database) Generation of BLOSUM x

Group highly similar sequences and replace them by a representative sequences.

Only consider sequences with no more than x % similarity Align sequences (no gaps)

For any pair of amino acids a,b and for all columns c of the alignment, let q(a,b) be the number of co-occurrences of a,b in all columns c.

Let p(a) be the overall probability of a occurring

BLOSUM entry for a,b is log2 ( q(a,b) / ( p(a)*p(b) ) )

A Little Science

Page 96: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 96

A Little Science

“Normalised” number of instances in co-occurrence map

Ok, so to avoid bias we will compute the following Logarithm of probability of sf1 and sf2 co-occurring /

(probability of sf1 * probability of sf2) To compute

the co-occurrence probabilities we will count the frequency of sf1 and sf2 co-occurring and divide this by the overall number of co-occurrences and

the probability of a superfamily we will divide the frequency of the superfamily by the overall number of superfamilies

To compute this let us put “freq of sf1 and sf2 co-occurring” into a table cooc_pdb and “freq of sf” into a table sf_pdb

Page 97: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 97

A Little ScienceCount PDBs for each co-occurrence

CREATE TABLE cooc_pdb AS

SELECT COUNT(DISTINCT pdb_id) AS num,

sf1, sf1name, sf2, sf2name

FROM cooc

GROUP BY sf1,sf2

ORDER BY num DESC;

Count PDBs for each superfamily

CREATE TABLE sf_pdb AS

SELECT des.id AS sf, des.description,

COUNT(DISTINCT cla.pdb_id) AS num_of_pdb_ids

FROM cla,des

WHERE des.id=cla.sf

GROUP BY cla.sf

ORDER BY num_of_pdb_ids DESC;

Page 98: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 98

A Little Science

Count overall number of PDBs for each co-occurrenceSELECT SUM(num) AS totalCoocFreqFROM cooc_pdb;+---------------+| totalCoocFreq |+---------------+| 9813 |+---------------+Count PDBs for each superfamilySELECT SUM(num_of_pdb_ids) AS total_num_of_pdb_idsFROM sf_pdb;+----------------------+| total_num_of_pdb_ids |+----------------------+| 22318 |+----------------------+

Page 99: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 99

A Little ScienceNormalised co-occurrence mapCREATE TABLE normalised_cooc AS SELECT LOG((cooc_pdb.num/9813)/ ((sf_pdb1.num_of_pdb_ids/22318)*(sf_pdb2.num_of_pdb_ids/22318))) AS

val, cooc_pdb.sf1, cooc_pdb.sf1name, cooc_pdb.sf2, cooc_pdb.sf2name FROM cooc_pdb, sf_pdb AS sf_pdb1, sf_pdb AS sf_pdb2 WHERE cooc_pdb.sf1=sf_pdb1.sf AND cooc_pdb.sf2=sf_pdb2.sf ORDER BY val desc;+-----------+----------------------------------------------+-------------------------------+| val | sf1name | sf2name |+-----------+----------------------------------------------+-------------------------------+| 10.834834 | Arp2/3 complex 21 kDa subunit ARPC3 | Arp2/3 complex 16 kDa subunit || 10.834834 | tRNA splicing endonuclease, C-terminal domain| tRNA splicing endonuclease Edn|| 10.834834 | Cell-division inhibitor MinC, C-terminal doma| Cell-division inhibitor MinC, || 10.834834 | Tricorn protease N-terminal domain | Tricorn protease N-terminal do|| 10.834834 | L-fucose isomerase, C-terminal domain | L-fucose isomerase, N-terminal|| 10.834834 | Catalytic domain of malonyl-CoA ACP transacyl| Probable ACP-binding domain of|| 10.834834 | Transcription factor IIA (TFIIA), N-terminal | Transcription factor IIA (TFII|| 10.834834 | Glutamyl tRNA-reductase dimerization domain | Glutamyl tRNA-reductase cataly|| 10.834834 | N-terminal domain of phosphatidylinositol tra| C-terminal domain of phosphati|| 10.834834 | Rotavirus NSP2 fragment, C-terminal domain | Rotavirus NSP2 fragment, N-ter|| 10.834834 | Lipovitellin-phosvitin complex, superhelical | Lipovitellin-phosvitin complex|| 10.834834 | Aminoimidazole ribonucleotide synthetase (Pur| Aminoimidazole ribonucleotide || 10.834834 | Colicin E3 translocation domain | Colicin E3 receptor domain || 10.834834 | Arp2/3 complex 21 kDa subunit ARPC3 | Arp2/3 complex subunits || 10.834834 | Arp2/3 complex 16 kDa subunit ARPC5 | Arp2/3 complex subunits || 10.834834 | Head domain of nucleotide exchange factor Grp| Coiled-coil domain of nucleoti|| 10.834834 | YhbC-like, C-terminal domain | YhbC-like, N-terminal domain || 10.141687 | TolB, C-terminal domain | TolB, N-terminal domain |+-----------+----------------------------------------------+-------------------------------+

Page 100: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 100

A Little Science

Which superfamily has the most co-occurrence partners?

Page 101: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 101

A Little ScienceWhich superfamily has the most co-occurrence partners?CREATE TABLE cooc_partner AS SELECT COUNT(DISTINCT c2.sf) AS distinctnum, COUNT(c2.sf) AS num, des.sccs, des.description FROM cla AS c1, cla AS c2, des WHERE c1.pdb_id=c2.pdb_id AND des.id=c1.sf GROUP BY c1.sf;

SELECT *FROM cooc_partner ORDER BY distinctnum DESC LIMIT 10;

| distinctnum | num | sccs | description |+-------------+-------+--------+------------------+| 74 | 19177 | b.1.1 | Immunoglobulin| 68 | 5228 | c.37.1 | P-loop| 67 | 1602 | b.40.4 | Nucleic acid-binding proteins| 42 | 435 | c.55.4 | Translational machinery components| 42 | 580 | g.39.1 | Glucocorticoid receptor-like| 31 | 561 | b.43.3 | Translation proteins | 28 | 1575 | b.47.1 | Trypsin-like serine proteases | 28 | 725 | d.14.1 | Ribosomal protein S5 domain 2-like | 25 | 982 | a.4.5 | "Winged helix" DNA-binding domain | 25 | 283 | d.52.3 | Prokaryotic type KH domain (pKH-…+-------------+-------+--------+------------------+

Page 102: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 102

A Little ScienceWhich superfamily has the most co-occurrence partners?SELECT *FROM cooc_partner ORDER BY num DESC LIMIT 10;+-------------+-------+---------+--------------+| distinctnum | num | sccs | description+-------------+-------+---------+--------------+| 74 | 19177 | b.1.1 | Immunoglobulin| 4 | 7532 | b.1.4 | beta-Galactosidase/glucuronidase| 17 | 6355 | c.2.1 | NAD(P)-binding Rossmann-fold domains| 24 | 6233 | f.2.1 | Membrane all-alpha| 7 | 6007 | d.153.1 | N-terminal nucleophile aminohydrolases | 1 | 5687 | i.1.1 | Ribosome and ribosomal fragments| 68 | 5228 | c.37.1 | P-loop| 18 | 5154 | c.1.8 | (Trans)glycosidases| 5 | 3915 | a.1.1 | Globin-like| 11 | 3880 | b.18.1 | Galactose-binding domain-like+-------------+-------+---------+---------------+

Page 103: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 103

Scale-free Networks

Small-world property “everybody on earth is related with a degree of six

intermediaries” Reason: Network structure

A few highly connected nodes Many nodes with few connections Consequence: very short average distance between any

two nodes Formally: Number of interaction partners follows power-

law, i.e. distribution of number of interaction partners is an exponential function

Protein interactions are small-world networks!

Page 104: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 104

Scale-free Networks

Is our co-occurrence map a scale-free network?

Page 105: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 105

Scale-free Networks

Is our co-occurrence map a scale-free network?

Let’s get the distribution of the number of distinct co-occurrence partners SELECT distinctnum, COUNT(*)

FROM cooc_partner

GROUP BY distinctnum

ORDER BY distinctnum; Let’s plot it

+-------------+----------+| distinctnum | count(*) |+-------------+----------+| 1 | 586 || 2 | 291 || 3 | 152 || 4 | 72 || 5 | 47 || 6 | 27 || 7 | 26 || 8 | 9 || 9 | 16 || 10 | 8 || 11 | 9 || 12 | 3 || 13 | 3 || 14 | 1 || 15 | 1 || 16 | 2 || 17 | 1 || 18 | 3 || 20 | 1 || 21 | 16 || 23 | 18 || 24 | 1 || 25 | 4 || 28 | 2 || 31 | 1 || 42 | 2 || 67 | 1 || 68 | 1 || 74 | 1 |+-------------+----------+

Page 106: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 106

Scale-free Networks

Distribution of number of co-occurrence partners

(x-axis = number of partners, y-axis=frequency

Page 107: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 107

Scale-free Networks

But maybe the number of co-occurrence partners is simply correlated to the number of PDB entries we have for that superfamily, in which case the power-law relationship would be an artefact

Let’s test this by extending the previous table with a column for the average number of PDBs given as the num column in cooc_partner

Page 108: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 108

Scale-free Networks

Distinct number of co-occurrence partners and average number of co-occurrences

SELECT distinctnum,

COUNT(*),AVG(num)

FROM cooc_partner

GROUP BY distinctnum

ORDER BY distinctnum;

+-------------+----------+------------+| distinctnum | count(*) | avg(num) |+-------------+----------+------------+| 1 | 586 | 66.8328 || 2 | 291 | 124.0481 || 3 | 152 | 251.1316 || 4 | 72 | 324.1389 || 5 | 47 | 344.2340 || 6 | 27 | 289.7778 || 7 | 26 | 883.1923 || 8 | 9 | 663.8889 || 9 | 16 | 375.0625 || 10 | 8 | 772.3750 || 11 | 9 | 1454.3333 || 12 | 3 | 726.6667 || 13 | 3 | 642.3333 || 14 | 1 | 489.0000 || 15 | 1 | 2530.0000 || 16 | 2 | 645.0000 || 17 | 1 | 6355.0000 || 18 | 3 | 2254.3333 || 20 | 1 | 979.0000 || 21 | 16 | 254.1875 || 23 | 18 | 290.2778 || 24 | 1 | 6233.0000 || 25 | 4 | 433.0000 || 28 | 2 | 1150.0000 || 31 | 1 | 561.0000 || 42 | 2 | 507.5000 || 67 | 1 | 1602.0000 || 68 | 1 | 5228.0000 || 74 | 1 | 19177.0000 |+-------------+----------+------------+

Page 109: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 109

Scale-free Networks No good correlation between frequency of number of distinct

co-occurrence partners and number of underlying instances. Hence: scale-free property appears to be no artefact

Page 110: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 110

Limits

List all superfamily pairs which can interact directly or indirectly (“Transitive Closure”).

This query cannot be expressed in SQL SQL does not have the same “power” as a

programming language like Python

Page 111: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 111

Introduction to MySQL

Page 112: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 112

SQL: “Structured Query Language”—the most common standardized language used to access databases.

SQL has several parts: DDL – Data Definition Language

{Defining, Deleting, Modifying relation schemas}

DML – Data Manipulation Language{Inserting, Deleting, Modifying tuples in database}

Embedded SQL – defines how SQL statements can be used with general-purposed programming

Page 113: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 113

MySQL, the most popular Open Source SQL database, is developed, distributed and supported by MySQL AB.

MySQL is a relational database management system. MySQL software is Open Source.

• Written in C and C++. Tested with a broad range of different compilers. Works on many different platforms. APIs for C, C++, Eiffel, Java, Perl, PHP, Python, Ruby,

and Tcl.• You can find MySQl manual and documentation at:

www .mysql.com/documentation/• You can download and install MySQL on your own computer

(both under Windows and Linux)

Page 114: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 114

MySQL To see a list of options provided by mysql, invoke it with the --help option: shell> mysql --help

Using SQL:

On any linux you have to use this to log on to MySQL:

shell> /usr/local/mysql/bin/mysql -h hostname -D loginname -p

shell> mysql -h host -u user -p Enter password: ********

The ******** represents your password; enter it when mysql displays the Enter password: prompt.

Page 115: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 115

Basic Query:select A1, A2,…,An

from r1, r2, …,rm

where P;

A1, A2,…,An represent attributesr1, r2, …rm represent relationsP represents predicate (guard condition)

Keywords may be entered in any letter case:mysql> SELECT VERSION(), CURRENT_DATE; mysql> select version(), current_date; mysql> SeLeCt vErSiOn(), current_DATE;

Page 116: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 116

Prompt Meaning mysql> Ready for new command. -> Waiting for next line of multiple-line command. ‘> Waiting for next line, collecting a string that

begins with a single quote (` ’ ’). “> Waiting for next line, collecting a string that

begins with a double quote (` ” ’).

mysql> SELECT * -> FROM my_table -> WHERE name = “Smith” AND age < 30;

mysql> SELECT * FROM my_table WHERE name = "Smith AND age < 30;

"> "\c mysql>

\c to cancel the execution of a command

Page 117: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 117

Basic Database Operation Create a database Create a table Load data into the table Retrieve data from the table in various ways Use multiple tables

Suppose you have several pets in your home (your menagerie) and you'd like to keep track of various types of information about them. You can do so by creating tables to hold your data and loading them with the desired information. Then you can answer different sorts of questions about your animals by retrieving data from the tables.

Page 118: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 118

Creating and Using a Database

mysql> SHOW DATABASES;

SHOW statement can be used to find out the databases currently existing on the server

mysql> USE testdb

testdb is a database name. USE command does not need a semi colon and must be given in a single line.Database needs to be invoked in order to use it.

mysql> CREATE DATABASE example;

Database names are case-sensitive unlike keywords; Same applies for table namesSo example != Example != EXAMPLE or some other variant

Page 119: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 119

Creating a Table

mysql> SHOW TABLES; Displays the current list of tables

mysql> CREATE TABLE pet (name VARCHAR(20), owner VARCHAR(20), -> species VARCHAR(20), sex CHAR(1), birth DATE, death

DATE);

mysql> SHOW TABLES;Will display the table with the table name pet

Verification of the table can be done with DESCRIBE commandmysql> DESCRIBE pet; +---------+-------------+---------------+---------+---------+---------+| Field | Type | Null | Key | Default | Extra | +---------+-------------+---------------+---------+---------+---------+| name | varchar(20) | YES | | NULL | | | owner | varchar(20) | YES | | NULL | | | species | varchar(20) | YES | | NULL | || sex | char(1) | YES | | NULL | | | birth | date | YES | | NULL | | | death | date | YES | | NULL | |

Page 120: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 120

Loading Data into a Table

LOAD DATA uses a text file with single record in a line that match the attributes in the table.Useful for inserting when multiple records are involved.

Example: pet.txt is a text file with a single record

Name owner species sex birth death Whistler Gwen bird \N 1997-12-09 \N

mysql> LOAD DATA LOCAL INFILE "pet.txt" INTO TABLE pet;

INSERT command can be used when records needs to be inserted one at a time. NULL can be directly inserted in the field column

Example:mysql> INSERT INTO pet -> VALUES ('Puffball','Diane','hamster','f','1999-03-30',NULL);

Page 121: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 121

Retrieving Information from a Table

The SELECT statement is used to pull information from a table. The general form of the statement is:

SELECT what_to_select FROM which_table WHERE conditions_to_satisfy

The simplest form of SELECT retrieves everything from a table: mysql> SELECT * FROM pet;

You can select only particular rows from your table. mysql> SELECT * FROM pet WHERE name = "Bowser";

You can specify conditions on any column, not just name. For example, if you want to know which animals were born after 1998, test the birth column:

mysql> SELECT * FROM pet WHERE birth >= "1998-1-1";

You can combine conditions, for example, to locate female dogs: mysql> SELECT * FROM pet WHERE species = "dog" AND sex = "f";

Page 122: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 122

Selecting Particular ColumnsIf you don't want to see entire rows from your table, just name the columns in which you're interested, separated by commas.

For example, if you want to know when your animals were born, selectthe name and birth columns: mysql> SELECT name, birth FROM pet;

To find out who owns pets, use this query: mysql> SELECT owner FROM pet; mysql> SELECT DISTINCT owner FROM pet;

You can use a WHERE clause to combine row selection with column selection. For example, to get birth dates for dogs and cats only, use this query:

mysql> SELECT name, species, birth FROM pet -> WHERE species = "dog" OR species = "cat";

Page 123: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 123

Sorting Rows

To sort a result, use an ORDER BY clause.

Here are animal birthdays, sorted by date: mysql> SELECT name, birth FROM pet ORDER BY birth;

To sort in reverse order, add the DESC (descending) keyword to the name of the column you are sorting by:

mysql> SELECT name, birth FROM pet ORDER BY birth DESC;

You can sort on multiple columns. For example, to sort by type of animal, then by birth date within animal type with youngest animals first, use the following query:

mysql> SELECT name, species, birth FROM pet ORDER BY species, birth DESC;

Page 124: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 124

Pattern Matching

MySQL provides standard SQL pattern matching as well as a form of pattern matching based on extended regular expressions similar to those used by Unix utilities such as grep.

SQL pattern matching allows you to use `_' to match any single character and `%' to match an arbitrary number of characters (including zero characters). In MySQL, SQL patterns are case-insensitive by default. Some examples are shown here. Note that you do not use = or <> when you use SQL patterns; use the LIKE or NOT LIKE comparison operators instead.

To find names beginning with `b':

mysql> SELECT * FROM pet WHERE name LIKE "b%";

To find names containing exactly five characters, use the `_' pattern character:

mysql> SELECT * FROM pet WHERE name LIKE “_____”;

Page 125: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 125

Counting Rows

For example, you might want to know how many pets each owner has, Counting the total number of animals you have is the same question as “How many rows

are in the pet table?”

The COUNT() function counts the number of non-NULL results, so the query to count your animals looks like this:

mysql> SELECT COUNT(*) FROM pet;

You can use COUNT() if you want to find out how many pets each owner has:

mysql> SELECT owner, COUNT(*) FROM pet GROUP BY owner; +---------+---------------+ | owner | COUNT(*) | +---------+---------------+ | Benny | 2 || Diane | 2 | | Gwen | 3 | | Harold | 2 | +---------+---------------+

Page 126: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 126

Examples of some common queries

CREATE TABLE shop ( article INT(4) UNSIGNED ZEROFILL DEFAULT ‘0000' NOT NULL, dealer CHAR(20) DEFAULT ‘’ NOT NULL, price DOUBLE(16,2) DEFAULT '0.00’ NOT NULL, PRIMARY KEY(article, dealer));

INSERT INTO shop VALUES (1,'A',3.45),(1,'B',3.99),(2,'A',10.99),(3,'B',1.45),(3,'C',1.69), (3,'D',1.25),(4,'D',19.95);

mysql> SELECT * FROM shop; +--------------+---------+--------+ | article | dealer | price | +--------------+---------+--------+ | 0001 | A | 3.45 | | 0001 | B | 3.99 | | 0002 | A | 10.99 | | 0003 | B | 1.45 | | 0003 | C | 1.69 | | 0003 | D | 1.25 | | 0004 | D | 19.95 | +--------------+---------+--------+

The maximum value for a columnThe row holding the maximum of a certain columnMaximum of column per groupThe rows holding the group-wise maximum of a certain field

Page 127: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 127

“What's the highest price?”SELECT MAX(price) AS price FROM shop; +---------+ | price | +---------+ | 19.95 | +---------+

“Find number, dealer, and price of the most expensive article.”

In ANSI SQL (and MySQL Version 4.1) this is easily done with a subquery:

SELECT article, dealer, price FROM shop WHERE price = (SELECT MAX(price) FROM shop)

In MySQL versions prior to 4.1, you have to do it in two steps: 1. Get the maximum price value from the table with a SELECT statement. 2. Using this value compile the actual query: SELECT article, dealer, price FROM shop WHERE price=19.95

Or, Using User Variables ( @variable-name, @temp := 5 )Select @max_price := max(price) from shop;Select article, dealer, price from shop where price = @max_price;

Page 128: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 128

Maximum of Column per Group

“What's the highest price per article?”

SELECT article, MAX(price) AS price FROM shop GROUP BY article;

+---------+---------+ | article | price | +---------+---------+ | 0001 | 3.99 | | 0002 | 10.99 | | 0003 | 1.69 | | 0004 | 19.95 | +---------+---------+

Page 129: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 129

The Rows Holding the Group-wise Maximum of a Certain Field

“For each article, find the dealer(s) with the most expensive price.”

In ANSI SQL (MySQL Version 4.1 or greater), do it with a subquery

SELECT article, dealer, price

FROM shop s1

WHERE price=(SELECT MAX(s2.price)

FROM shop s2

WHERE s1.article = s2.article);

Page 130: Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction to Databases.

By Michael Schroeder, Biotec 130

But, In MySQL versions prior to 4.1, it has to be done in several steps, with a temporary table (It doesn’t support nested-query \subquery).

CREATE TEMPORARY TABLE tmp

( article INT(4) UNSIGNED ZEROFILL DEFAULT '0000' NOT NULL,

price DOUBLE(16,2) DEFAULT '0.00' NOT NULL);

LOCK TABLES shop read;

INSERT INTO tmp

SELECT article, MAX(price) FROM shop GROUP BY article;

SELECT shop.article, dealer, shop.price

FROM shop, tmp

WHERE shop.article=tmp.article AND shop.price=tmp.price;

UNLOCK TABLES; DROP TABLE tmp;


Recommended