Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | sherilyn-green |
View: | 223 times |
Download: | 0 times |
Michael Schroeder BioTechnological CenterTU Dresden Biotec
Introduction to Databases
By Michael Schroeder, Biotec 2
Structure
Motivation Introduction to MySQL Example Queries Using SQL to query SCOP
By Michael Schroeder, Biotec 3
Motivation In the last term,
we accessed most information online via the web we interacted directly and manually with databases and tools we had to manually submit queries, interpret results. select interesting
results, cut&paste them, and submit queries again,… Pro:
Reasonably easy to get hold of information Con:
Not possible to ask many queries Queries limited by interface provided by web page Difficult/impossible to integrate information from different sites
In this term, we will look at the databases underlying the online front ends How is the data internally stored? How can we - and more important computer programs - directly interact
with the underlying data, so that we can ask more powerful queries, large queries, and integrate different systems
By Michael Schroeder, Biotec 4
What actually happens when you retrieve data online? LLNE
YLEEVE EYEEDE
LLNEYLEEVE EYEEDE
… Compose result web page and send it
Send resultDisplay result4
Start programme that evaluates query by accessing database…
Send queryDisplay home page, enter query, and press submit
3
Send home page2
Get it and send itGet home page1
Web ServerMessageClient
By Michael Schroeder, Biotec 5
What actually happensYou are limited by what web server allows you to ask:Example CATH:
•PDB ID, •CATH code, or •General text
But you cannot ask:•In how many different PDB structures is there a P-loop domain?•Is there a PDB entry with a P-loop and a DNA-binding domain•How many different superfamilies does the largest structure in PDB have?
•With direct access to the underlying database you could answer all these questions (and many more)
By Michael Schroeder, Biotec 6
Querying over the Web
Problem is always the same: The web interface limits access to
the underlying database How can we interact directly with
the database
By Michael Schroeder, Biotec 7
What databases are about Logical organization of data
data models, schema design, dictionaries Physical organization of data
Fast retrieval, indexing, compact storage of data
Other requirements: Logging (important to know who did what to the data) Security and access control (important to know who can
do what) Transactions and concurrency control (important when
more than one person is working on database) Integrity (important to ensure that only valid entries in the
database) Recovery (important as hardware and software can
sometimes fail
By Michael Schroeder, Biotec 8
Different types of databases
Flat files XML Relational database (Object databases) (Object relational databases)
By Michael Schroeder, Biotec 9
Flat files We can store any data in a flat
file, e.g. EMBL But is this a database?
Logical data organisation: None, unless we define one (as done for EMBL) and adhere to it, which is not enforced
Physical data organisation: None, we cannot optimise retrieval for common queries
Logging: No Access control: Implicit
through Unix Transaction and concurrency
control: None Integrity: None Recovery: If files are backed-
up they can be recovered. However, not on the fly
ID BTBPTIG standard; genomic DNA; MAM; 3998 BP.XXAC X03365; K00966;XXSV X03365.1XXDT 18-NOV-1986 (Rel. 10, Created)DT 20-MAY-1992 (Rel. 31, Last updated, Version 3)XXDE Bovine pancreatic trypsin inhibitor (BPTI) geneXXKW Alu-like repetitive sequence; protease inhibitor;
trypsin inhibitor.XXOS Bos taurus (cow)OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi; Mammalia;OC Eutheria; Cetartiodactyla; Ruminantia; Pecora;
Bovoidea; Bovidae; Bovinae;OC Bos.XXRN [1]RP 1-3998RX MEDLINE; 86158754.RX PUBMED; 2420326.RA Kingston I.B., Anderson S.;RT "Sequences encoding two trypsin inhibitors occur in
strikingly similarRT genomic environments";RL Biochem. J. 233(2):443-450(1986).XXRN [2]RX MEDLINE; 84070725.RX PUBMED; 6580617.RA Anderson S., Kingston I.B.;RT "Isolation of a genomic clone for bovine pancreatic
trypsin inhibitor byRT using a unique-sequence synthetic DNA probe.";
By Michael Schroeder, Biotec 10
XML files We can store any data in XML,
the eXtentable Mark-up Language, e.g. Medline
But is this a database? Logical data organisation:
yes, XML schema, which is enforced
Physical data organisation: None, we cannot optimise retrieval for common queries
Logging: No Access control: Implicit
through Unix Transaction and concurrency
control: None Integrity: None Recovery: If files are backed-
up they can be recovered. However, not on the fly
<Article>
<Journal>
<ISSN>0270-7306</ISSN>
<JournalIssue>
<Volume>19</Volume>
<Issue>11</Issue>
<PubDate>
<Year>1999</Year>
<Month>Nov</Month>
</PubDate>
</JournalIssue>
</Journal>
<ArticleTitle>Differential regulation of the cell wall integrity mitogen-activated protein kinase pathway in budding yeast by the protein tyrosine phosphatases Ptp2 and Ptp3.
</ArticleTitle>
<Pagination>
<MedlinePgn>7651-60</MedlinePgn>
</Pagination>
<Abstract>
<AbstractText>Mitogen-activated protein kinases (MAPKs) are inactivated by dual-specificity and protein tyrosine phosphatases (PTPs) in yeasts. In Saccharomyces cerevisiae, two PTPs, Ptp2 and Ptp3, inactivate the MAPKs, Hog1 and Fus3, with different specificities... </AbstractText>
</Abstract>
<Affiliation>Department of Chemistry, University of Colorado, Boulder, Colorado 80309-0215, USA.
</Affiliation>…
By Michael Schroeder, Biotec 11
Relational Database
Central Idea: Data as relations in a table E.g. SCOP, Structural Classification of Proteins
+-------+------+---------+---------+--------------------------------------+| id | type | sccs | sid | description |+-------+------+---------+---------+--------------------------------------+| 46457 | cf | a.1 | - | Globin-like || 46458 | sf | a.1.1 | - | Globin-like || 46459 | fa | a.1.1.1 | - | Truncated hemoglobin || 46460 | dm | a.1.1.1 | - | Truncated hemoglobin || 46461 | sp | a.1.1.1 | - | Ciliate (Paramecium caudatum) || 14982 | px | a.1.1.1 | d1dlwa_ | 1dlw A: || 46462 | sp | a.1.1.1 | - | Green alga (Chlamydomonas eugametos) || 14983 | px | a.1.1.1 | d1dlya_ | 1dly A: || 63437 | sp | a.1.1.1 | - | Mycobacterium tuberculosis || 62301 | px | a.1.1.1 | d1idra_ | 1idr A: |+-------+------+---------+---------+--------------------------------------+
By Michael Schroeder, Biotec 12
Relational Database
Central Idea: Data as relations in a table E.g. Employee
+-------+------+---------+---------+| id | name | salary | role |+-------+------+---------+---------+| 46457 | pete | 50.000 | director|| 46458 | jane | 60.000 | nurse || 46459 | asif | 70.000 | driver |+-------+------+---------+---------+
By Michael Schroeder, Biotec 13
Relational Database
Central Idea: Data as relations in a table E.g. pets
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 14
Relational Database
Central Idea: Data as relations in a table E.g. school
+-------+------+---------+| id | name | subject |+-------+------+---------+| 46458 | rick | bio || 46459 | gerd | bio || 46460 | mary | bio || 46461 | ella | math || 14982 | anne | math || 46462 | paul | math |+-------+------+---------+
+-------+------+---------+| id | prof | subject |+-------+------+---------+| 51221 | bert | bio || 55435 | anne | math |+-------+------+---------+
+---------+------+-----+------+| subject | room | day | time |+---------+------+-----+------+| bio | A | mo | 3pm || math | B | tue | 1pm |+---------+------+-----+------+
By Michael Schroeder, Biotec 15
Relational Database
A cell in the table stores a single number or string, but not a list
Lists, sets need to be flattened
+------+-------------+| prof | subjects |+------+-------------+| bert | {bio,sport} || anne | {math,arts} |+------+-------------+
+------+-------------+| prof | subject |+------+-------------+| bert | bio || bert | sport || anne | arts || anne | math |+------+-------------+
By Michael Schroeder, Biotec 16
Bioinformatics: 10 years of resistance to flattening!
Why the resistance? Bioinformatics data is naturally nested Extensive Use of sets and lists
E.g. Swissprot: Features, keywords, References
Such data can be flattened, but the resulting relational schema
is hard to understand hence it is hard to formulate queries. For example, storing the SWISSPROT entry in a relational
database would split it over 15-20 tables.
By Michael Schroeder, Biotec 17
Relational Databases
RDB introduced in 1970 by Codd Took off in the 80s
In the business world, relational databases are the rule (Oracle, Sybase, mySQL, DB2, Microsoft Access).
Large biomedical databases typically use a relational technology; but there are also a lot of homegrown systems (ACeDB, SRS indexed files). Data is almost always viewed and exported in a variety of flat file formats (EMBL, GenBank among others)
By Michael Schroeder, Biotec 18
The flood of biomedical data…
Since 1980, the number and size of biomedical databases has been growing exponentially.
How can you find sources of information you are seeking? Nucleic Acids Research Database Issue in January of
every year (http://nar.oupjournals.org/) Dbcat (http://www.infobiogen.fr/ ): a flat file database
of 500 biological databases.
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 19
Relational Schema
The schema of a database is a set of relation names, their field names and types.
Example:
Entry(ID: int, Length: int, Seq: string, Mod: date)
Feature(ID: int, Type: string, From: int,To: int)
Entry and Feature are relation names,
ID, Seq, Mod, etc are attribute names, and
int, string, date are domains
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 20
Relation Instance
An instance of a relation is a set of tuples of the type of the relation.
A tuple of Entry could be:
( ID: 82814, Length: 597, Seq:“ccagctaaccg”, Mod: 1-7-95)
A tuple of Feature could be:
(ID: 82814, Type:“source”, From:1,To:8959)
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 21
Tabular representation
Typically, relations are displayed as tables
Sequence:
Feature:ID Type From To
82814 “Source” 1 18482814 “Gene” 23 65
tuples
attributes
ID Length Seq Mod
82814 597 “ccagctaa...” 1-07-9598608 18976 “accgcct...” 2-14-98
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 22
Entities and Relationships
There is a one-many relationship from Entry to Feature; each entry can have many features, but a feature can be on at most one entry.
Put another way, the existence of a feature depends on the existence of the owning entry referential integrity
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 23
Integrity Constraints
ID is the key of Entry, indicated by underlining: No two tuples of any instance of Entry can have the
same ID. In Feature, there is a referential integrity constraint
on ID: Every ID in Feature must appear in some tuple in
Entry. This is specified in the data definition language
(DDL), and enforced by the system as updates are made to the instance.
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 24
DDL for this relational schema
CREATE TABLE Entry (Id INTEGER, Length INTEGER, Sequence LONGCHAR, Mod DATE, PRIMARY KEY (Id) )
CREATE TABLE Feature (Id INTEGER, Type CHAR(15), From INTEGER, To INTEGER, PRIMARY KEY (Id, Type, From, To) FOREIGN KEY (Id) REFERENCES Entry ON DELETE CASCADE ON UPDATE CASCADE)
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 25
Querying relational databases
The language SQL has become a standard for querying relational databases. Based on a curious mixture of the relational algebra and relational calculus (formal languages), it allows new relations of information to be computed from a set of relations.
Unlike the relational algebra, it allows other useful stuff: count, sum, min, max, etc.
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 26
Basic Query relation-list A list of relation names (possibly with a
range-variable after each name). target-list A list of attributes of relations in relation-list.
* can be used to denote all atts. qualification Comparisons (Attr op const or Attr1 op
Attr2, where op is one of <, <=, >, >=, =, <> combined using AND, OR and NOT.
DISTINCT (optional) keyword indicates that the answer should not contain duplicates. Default is that duplicates are not eliminated!
SELECT [DISTINCT] target-listFROM relation-listWHERE qualification
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 27
Conceptual Evaluation Strategy
Compute the product of relation-list Discard tuples that fail qualification Project over attributes in target-list If DISTINCT then eliminate duplicates
This is probably a very bad way of executing the query, and a good query optimizer will use all sorts of tricks to find efficient strategies to compute the same answer.
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 28
Sample tables
Sequence:
Feature:ID Type From To82814 “Source” 1 59782814 “Gene” 23 651 “Gene” 3 999913428 “Gene” 11000 1666513428 “Source” 1 16665
ID Length Seq Mod
82814 597 “ccagctaa...” 1-07-9598608 18976 “accgcct...” 2-14-981 16665 “gtgtaa….” 1-19-9776582 9976 “actgga…” 2-29-00
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 29
Simple queriesSELECT * FROM SequenceWHERE Length < 10000;
SELECT TypeFROM Feature;
Print all sequences with length less than 10000.
Print the type of all features.
ID Length Seq Mod
82814 597 “ccagctaa...” 1-07-9576582 9976 “actgga…” 2-29-00
Type“Source” “Gene” “Gene” “Gene” “Source”
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 30
Distinct Note that SQL did not eliminate duplicates. We
need to request this explicitly.
SELECT DISTINCT TypeFROM Feature;
Print the type of all features(no duplicates).
Type“Source” “Gene”
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 31
Pattern Matching Can be used in where clause. “_” denotes any
character, “%” 0 or more characters.
SELECT * FROM SequenceWHERE Seq LIKE ‘a_%g'
ID Length Seq Mod
98608 18976 “accgcct...” 2-14-9876582 9976 “actgga…” 2-29-00
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 32
Arithmetic “as” can be used to label columns in the output;
arithmetic can be used to compute results
SELECT DISTINCT ID, To-From+1 as LengthFROM Feature;
ID Length82814 59782814 43 13428 999713428 566613428 16665
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 33
Set operations -- union
SELECT ID FROM SequenceWHERE Length<10000UNIONSELECT IDFROM FeatureWHERE Type=“Source”;
• Duplicates do not occur in the union.
ID 765828281413428
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 34
The UNION ALL operator preserves duplicates
SELECT ID FROM SequenceWHERE Length<10000UNION ALLSELECT IDFROM FeatureWHERE Type=“Source”;
ID 76582828148281413428
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 35
Intersection and difference
SELECT Id FROM SequenceINTERSECTSELECT IdFROM Feature;
SELECT Id FROM SequenceMINUSSELECT IdFROM Feature;
ID 9860876582
ID 8281413428
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 36
Products
Note that the ID column name is duplicated in the output.
SELECT *FROM Sequence,Feature;
ID Length Seq Mod ID Type From To
82814 597 “ccagctaa...” 1-07-95 82814 “Source” 1 59798608 18976 “accgcct...” 2-14-98 82814 “Source” 1 5971 16665 “gtgtaa….” 1-19-97 82814 “Source” 1 59776582 9976 “actgga…” 2-29-00 82814 “Source” 1 597……. (lots more!)
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 37
Conditional join
SELECT *FROM Sequence, FeatureWHERE Sequence.Id = Feature.Id;
ID Length Seq Mod ID Type From To
82814 597 “ccagctaa...” 1-07-95 82814 “Source” 1 59782814 597 “ccagctaa...” 1-07-95 82814 “Gene” 23 6513428 16665 “gtgtaa….” 1-19-97 13428 “Gene” 3 99991 16665 “gtgtaa….” 1-19-97 13428 “Gene” 11000 1666513428 16665 “gtgtaa….” 1-19-97 13428 “Source” 1 16665
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 38
Counting
Surprisingly, the answer to both of these is the following:
SELECT COUNT(*)FROM Feature;
SELECT COUNT(Type)FROM Feature;
COUNT(TYPE) 5
Print the numberof feature entries.
Print the numberof types of features.
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 39
Counting, cont.
To fix this, we use the keyword “DISTINCT”:
Can also use SUM, AVG, MIN and MAX.
SELECT COUNT(DISTINCT Type)FROM Feature;
COUNT(DISTINCT Type) 3
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 40
Group by
So far, these aggregate operators have been applied to all qualifying tuples. Sometimes we want to apply them to each of several groups of tuples.
For example: “Print the type and number of features of each type.”
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 41
Group by
Note that only the columns that appear in the GROUP BY statement and “aggregated” columns can appear in the output. So the following would generate an error.
SELECT Type, COUNT(*)FROM FeatureGROUP BY Type;
Type COUNT(*)“Source” 2“Gene” 3
SELECT Type, From, To, COUNT(*)FROM FeatureGROUP BY Type;
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 42
Group by … having
HAVING is to GROUP BY as WHERE is to FROM
“HAVING” is used to restrict the groups that appear in the result.
SELECT Type, COUNT(*)FROM FeatureWHERE From-To > 50GROUP BY TypeHAVING AVG(From-To)> 8500
TYPE COUNT(*)“Source” 2
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 43
Summary
SQL is “relationally complete”: allows you to perform operators in an algebra of relations (the relational algebra).
Additional features: string comparisons, set membership, arithmetic and grouping.
In contrast, Entrez is a much more limited language.
Susan B. Davidson, Biol537/CIS636, Fall 2003
By Michael Schroeder, Biotec 44
A Little Exercise
By Michael Schroeder, Biotec 45
A Little Exercise
Given the table pet below let us formulate some queries…
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 46
A Little Exercise
Get all pet names…
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 47
A Little Exercise
SELECT name
FROM pet; Get all owners and list them only once
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 48
A Little Exercise
SELECT DISTINCT owner
FROM pet; Select the names of all birds…
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 49
A Little ExerciseSELECT name
FROM pet
WHERE species=“bird”; Select the names of all female birds…
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 50
A Little ExerciseSELECT name
FROM pet
WHERE species=“bird” AND sex=“f”; Select names and owners
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 51
A Little Exercise
SELECT name,owner
FROM pet; Select owners of birds and dogs…
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 52
A Little ExerciseSELECT owner
FROM pet
WHERE species=“bird” OR species=“dog”; Select all owners starting with Dia…
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 53
A Little ExerciseSELECT owner
FROM pet
WHERE owner LIKE “Dia%”; How many pets has Gwen?...
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 54
A Little ExerciseSELECT count(name)
FROM pet
WHERE owner=“Gwen”; Select owners of male pets in sorted order…
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 55
A Little ExerciseSELECT owner
FROM pet
WHERE sex=“m”
ORDER BY owner;+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
List owners and the number of pets they have…
By Michael Schroeder, Biotec 56
A Little ExerciseSELECT owner, COUNT(name)
FROM pet
GROUP BY owner;
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
List owners and the number of pets they have in descending order…
By Michael Schroeder, Biotec 57
A Little ExerciseSELECT owner, COUNT(name) AS num
FROM pet
GROUP BY owner
ORDER BY num DESC;
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
List owners and the number of pets they have in descending order, but only if they have more than 1 pet…
By Michael Schroeder, Biotec 58
A Little ExerciseSELECT owner, COUNT(name) AS num
FROM pet
GROUP BY owner HAVING num > 1
ORDER BY num DESC; List all pairs of cats and dogs…+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 59
A Little ExerciseSELECT p1.name, p1.species, p2.name, p2.species
FROM pet AS p1, pet AS p2
WHERE p1.species=“dog” AND p2.species=“cat”; Select all male/female pairs of the same species…
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 60
A Little ExerciseSELECT p1.name, p1.species, p1.sex, p2.name, p2.species, p2.sexFROM pet AS p1, pet AS p2WHERE p1.species=p2.species AND p1.sex=“m” AND p2.sex=“f”;
Can we write p1.sex != p2.sex instead? …
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 61
A Little Exercise
We would get the pair Whistler and Chirpy as well
+----------+--------+---------+------+------------+------------+| name | owner | species | sex | birth | death |+----------+--------+---------+------+------------+------------+| Whistler | Gwen | bird | | 0000-00-00 | NULL || Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 || Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 || Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 || Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 || Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 || Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 || Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 |+----------+--------+---------+------+------------+------------+
By Michael Schroeder, Biotec 62
A Little Science
When working with SCOP through the web interface we are limited in what we can ask
What can we get out of SCOP when it is available as a relational table?
A reminder Classes: all alpha, all beta, alpha/beta, alpha+beta SCOP family: >30% sequence similarity SCOP superfamily: good structural similiary (possibly
<30%)
By Michael Schroeder, Biotec 63
A Little Science At low sequence identity, good structural
alignments possible
Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt
30%
Family
Same Superfamily,
But not family
By Michael Schroeder, Biotec 64
A Little Science
Three tables: cla, PDB entry and reference to its class, fold,
superfamily, family, domain des, description of each node in the SCOP hierarchy subchain, chain and possibly beginning and end on
chain for a domain instance astral, sequence for a domain
By Michael Schroeder, Biotec 65
A Little Sciencemysql> SELECT * FROM cla LIMIT 1;+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+| sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px |+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+| d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 |+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
mysql> SELECT * FROM des LIMIT 1;+-------+------+------+------+--------------------+| id | type | sccs | sid | description |+-------+------+------+------+--------------------+| 46456 | cl | a | - | All alpha proteins |+-------+------+------+------+--------------------+
mysql> SELECT * FROM astral LIMIT 1;+---------+---------+-----------------------------------------------------------+| sid | sccs | seq |+---------+---------+-----------------------------------------------------------+| d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...|+---------+---------+-----------------------------------------------------------+
mysql> SELECT * FROM subchain LIMIT 1;+----+-------+----------+-------+------+| id | px | chain_id | begin | end |+----+-------+----------+-------+------+| 1 | 14982 | A | | |+----+-------+----------+-------+------+
Entity relationship diagram for SCOP
Thanks to Boris VassilevT
By Michael Schroeder, Biotec 67
A Little Science
How many nodes are there in the hierarchy of type class, fold, superfamily, family?
des+-------+------+------+------+--------------------+| id | type | sccs | sid | description |+-------+------+------+------+--------------------+| 46456 | cl | a | - | All alpha proteins |+-------+------+------+------+--------------------+
By Michael Schroeder, Biotec 68
A Little Science
How many nodes are there in the hierarchy of type class, fold, superfamily, family?
Let us first find out how these types are called:SELECT DISTINCT type
FROM des; Now let’s list them with the numbers
des+-------+------+------+------+--------------------+| id | type | sccs | sid | description |+-------+------+------+------+--------------------+| 46456 | cl | a | - | All alpha proteins |+-------+------+------+------+--------------------+
By Michael Schroeder, Biotec 69
A Little Science How many nodes are there in the hierarchy of type
class, fold, superfamily, family? SELECT type, COUNT(*) AS num FROM des GROUP BY type ORDER BY num;
There are not that many morefamilies than superfamilies.
Which superfamily has the mostfamiliesdes+-------+------+------+------+--------------------+| id | type | sccs | sid | description |+-------+------+------+------+--------------------+| 46456 | cl | a | - | All alpha proteins |+-------+------+------+------+--------------------+
+------+-------+| type | num |+------+-------+| cl | 11 || cf | 854 || sf | 1305 || fa | 2156 || dm | 4567 || sp | 7111 || px | 44327 |+------+-------+
By Michael Schroeder, Biotec 70
A Little ScienceWhich superfamily has the most families?SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS numFROM des, cla WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num DESC;+---------+---------------------------------------------------------+-----+| sccs | description | num |+---------+---------------------------------------------------------+-----+| a.4.5 | "Winged helix" DNA-binding domain | 35 || c.69.1 | alpha/beta-Hydrolases | 23 || c.66.1 | S-adenosyl-L-methionine-dependent methyltransferases | 20 || c.52.1 | Restriction endonuclease-like | 19 || c.37.1 | P-loop containing nucleotide triphosphate hydrolases | 18 || b.18.1 | Galactose-binding domain-like | 15 || d.92.1 | Metalloproteases ("zincins"), catalytic domain | 14 || b.29.1 | Concanavalin A-like lectins/glucanases | 14 || f.2.1 | Membrane all-alpha | 13 || c.47.1 | Thioredoxin-like | 12 || c.68.1 | Nucleotide-diphospho-sugar transferases | 12 || c.2.1 | NAD(P)-binding Rossmann-fold domains | 11 || a.4.1 | Homeodomain-like | 10 || b.40.4 | Nucleic acid-binding proteins | 10 || a.118.1 | ARM repeat | 10 |
By Michael Schroeder, Biotec 71
A Little Science
Which families does the DNA binding-domain superfamily have?
The sccs of the superfamily is a.4.5. Its families have sccs a.4.5.1, a.4.5.2,…, so how can we list them?
mysql> SELECT * FROM cla LIMIT 1;+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+| sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px |+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+| d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 |+---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+
mysql> SELECT * FROM des LIMIT 1;+-------+------+------+------+--------------------+| id | type | sccs | sid | description |+-------+------+------+------+--------------------+| 46456 | cl | a | - | All alpha proteins |+-------+------+------+------+--------------------+
By Michael Schroeder, Biotec 72
A Little ScienceWhich families does the DNA binding-domain superfamily
have? SELECT DISTINCT sccs, descriptionFROM des WHERE sccs LIKE “a.4.5%” AND type=“fa” ORDER BY sccs;
| sccs | description+----------+------------------------------------------------| a.4.5.1 | Biotin repressor-like | a.4.5.10 | Replication initiation protein| a.4.5.11 | Helicase DNA-binding domain| a.4.5.12 | Restriction endonuclease FokI, N-terminal (recognition) domain| a.4.5.13 | Histone H1/H5| a.4.5.14 | Forkhead DNA-binding domain| a.4.5.15 | DNA-binding domain from rap30| a.4.5.16 | C-terminal domain of RPA32| a.4.5.17 | Cell cycle transcription factor e2f-dp| a.4.5.18 | The central core domain of TFIIE beta| a.4.5.19 | Z-DNA binding domain| a.4.5.2 | LexA repressor, N-terminal DNA-binding domain| a.4.5.20 | P4 origin-binding domain-like| a.4.5.21 | ets domain| a.4.5.22 | Heat-shock transcription factor| a.4.5.23 | Interferon regulatory factor…
By Michael Schroeder, Biotec 73
A Little Science
Which families does the DNA binding-domain superfamily have?...| a.4.5.24 | Iron-dependent represor protein| a.4.5.25 | Methionine aminopeptidase, insert domain| a.4.5.26 | mu transposase, DNA-binding domain| a.4.5.27 | TnsA endonuclease, C-terminal domain| a.4.5.28 | MarR-like transcriptional regulators| a.4.5.29 | Plant O-methyltransferase, N-terminal domain| a.4.5.3 | Arginine repressor (ArgR), N-terminal DNA-binding domain| a.4.5.30 | C-terminal domain of the rap74 subunit of TFIIF| a.4.5.31 | DEP domain| a.4.5.32 | Lrp/AsnC-like transcriptional regulator N-terminal domain| a.4.5.33 | Thanscriptional regulator IclR, N-terminal domain| a.4.5.34 | SCF ubiquitin ligase complex WHB domain| a.4.5.35 | C-terminal fragment of elongation factor SelB| a.4.5.4 | CAP C-terminal domain-like| a.4.5.5 | ArsR-like transcriptional regulators| a.4.5.6 | GntR-like transcriptional regulators| a.4.5.7 | Replication terminator protein (RTP)| a.4.5.8 | N-terminal domain of molybdate-dependent transcriptional regulator ModE | a.4.5.9 | Transcription factor MotA, activation domain
By Michael Schroeder, Biotec 74
A Little Science Which families does the DNA binding-domain superfamily
have? Let’s find example pdb’s
mysql> SELECT DISTINCT pdb_id FROM cla WHERE sccs="a.4.5.1";+--------+| pdb_id |+--------+| 1bia || 1hxd || 1bib || 1j5y |+--------+
mysql> SELECT DISTINCT pdb_id FROM cla WHERE sccs="a.4.5.2";+--------+| pdb_id |+--------+| 1jhf || 1jhh || 1lea || 1leb |+--------+
By Michael Schroeder, Biotec 75
A Little Science1bia1jhf
By Michael Schroeder, Biotec 76
A Little Science
1cgpa.4.5.4
By Michael Schroeder, Biotec 77
A Little Science Some more…
1smt, a.4.5.5
1hw1, a.4.5.6
1b9n, a.4.5.8
1bm9, a.4.5.7
1f4k, which family?
1f4k, a.4.5.7
By Michael Schroeder, Biotec 78
A Little Science
How many percent of superfamilies have only 1 family, how many 2,… ?
By Michael Schroeder, Biotec 79
A Little Science How many percent of superfamilies have only 1,2,3,… families?
First let’s deposit the result of the query that found the number of families for each superfamily in a tableCREATE TABLE fa_freq AS
SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS num
FROM des, cla WHERE des.id=cla.sf
GROUP BY cla.sf ORDER BY num desc;
Now we count how many superfamilies have 1,2,3,… familiesSELECT num AS fa_per_sf, COUNT(*) AS freq
FROM fa_freq GROUP BY num;
By Michael Schroeder, Biotec 80
A Little Science+-----------+------+| fa_per_sf | freq |+-----------+------+| 1 | 981 || 2 | 164 || 3 | 65 || 4 | 29 || 5 | 25 || 6 | 14 || 7 | 6 || 8 | 5 || 9 | 1 || 10 | 3 || 11 | 1 || 12 | 2 || 13 | 1 || 14 | 2 || 15 | 1 || 18 | 1 || 19 | 1 || 20 | 1 || 23 | 1 || 35 | 1 |+-----------+------+
SELECT COUNT(*) FROM fa_freq ;+----------+| count(*) |+----------+| 1305 |+----------+
+-----------+------+| fa_per_sf | perc |+-----------+------+| 1 | 0.75 || 2 | 0.13 || 3 | 0.05 || 4 | 0.02 || 5 | 0.02 || 6 | 0.01 || 7 | 0.00 || 8 | 0.00 || 9 | 0.00 || 10 | 0.00 || 11 | 0.00 || 12 | 0.00 || 13 | 0.00 || 14 | 0.00 || 15 | 0.00 || 18 | 0.00 || 19 | 0.00 || 20 | 0.00 || 23 | 0.00 || 35 | 0.00 |+-----------+------+
How many percent of superfamilies have only 1,2,3,… families?
SELECT num AS fa_per_sf, (COUNT(*)/1305) AS perc FROM fa_freq GROUP BY num;
By Michael Schroeder, Biotec 81
A Little Science
How many percent of superfamilies have only 1,2,3,… families?
This is interesting! For the majority of superfamilies there is only one family!
What is the PDB structure with the largest number of (distinct) superfamilies?
By Michael Schroeder, Biotec 82
A Little ScienceSELECT pdb_id, COUNT(sf) AS sf_numFROM cla GROUP BY pdb_id ORDER BY sf_num DESC LIMIT 10;+--------+--------+| pdb_id | sf_num |+--------+--------+| 1aon | 49 || 1hto | 48 || 1ir2 | 48 || 1htq | 48 || 1der | 42 || 1f49 | 40 || 1jyy | 40 || 1gho | 40 || 1jyz | 40 || 1jz1 | 40 |+--------+--------+
SELECT pdb_id, COUNT(DISTINCT sf) AS distinct_sf_numFROM cla GROUP BY pdb_id ORDER BY distinct_sf_num DESC LIMIT 10;+--------+-----------------+| pdb_id | distinct_sf_num |+--------+-----------------+| 1m1k | 23 || 1k9m | 23 || 1kd1 | 23 || 1kqs | 23 || 1jj2 | 23 || 1k8a | 23 || 1ffk | 22 || 1i96 | 21 || 1hnz | 20 || 1hr0 | 20 |+--------+-----------------+
What is the PDB structure with the largest number of (distinct) superfamilies?
By Michael Schroeder, Biotec 83
A structure with
23 different superfamilies
1k9m Co Crystal Structure Of Tylosin Bound To The 50S Ribosomal Subunit Of Haloarcula MarismortuiRibosome
By Michael Schroeder, Biotec 84
A Little Science Now let’s plot how many PDBs have 1, 2, 3,… distinct
superfamilies First of all let us put the result of the previous slide in a table
(note if the table already exists we have to erase it first: DROP TABLE pdb_sf_num. But be careful using DROP
CREATE TABLE pdb_sf_num AS
SELECT pdb_id,
COUNT(DISTINCT sf) AS distinct_sf_num
FROM cla
GROUP BY pdb_id
ORDER BY distinct_sf_num DESC; Now let us count how many PDBs with 1,2,3,… distinct
superfamilies…
By Michael Schroeder, Biotec 85
A Little ScienceHow many PDBs have 1, 2, 3,… distinct superfamiliesSELECT distinct_sf_num, COUNT(pdb_id) AS numFROM pdb_sf_num GROUP BY distinct_sf_num ORDER BY distinct_sf_num;+-----------------+-------+| distinct_sf_num | num |+-----------------+-------+| 1 | 13960 || 2 | 2721 || 3 | 495 || 4 | 178 || 5 | 33 || 6 | 25 || 7 | 1 || 9 | 4 || 20 | 9 || 21 | 1 || 22 | 1 || 23 | 6 |+-----------------+-------+
By Michael Schroeder, Biotec 86
A Little Science
Let’s do the same in percent
SELECT COUNT(DISTINCT pdb_id)
FROM cla;
+----------+
| count(*) |
+----------+
| 17434 |
+----------+
There are 17434 PDB IDs
By Michael Schroeder, Biotec 87
A Little ScienceHow many PDBs have 1, 2, 3,… distinct superfamiliesSELECT distinct_sf_num, COUNT(pdb_id)/17434 AS percFROM pdb_sf_num GROUP BY distinct_sf_num ORDER BY distinct_sf_num;+-----------------+------+| distinct_sf_num | perc |+-----------------+------+| 1 | 0.80 || 2 | 0.16 || 3 | 0.03 || 4 | 0.01 || 5 | 0.00 || 6 | 0.00 || 7 | 0.00 || 9 | 0.00 || 20 | 0.00 || 21 | 0.00 || 22 | 0.00 || 23 | 0.00 |+-----------------+------+
80% of PDB entries consist only of one type of superfamily!
By Michael Schroeder, Biotec 88
A Little Science
What are the most popular superfamilies?
I.e. for which are there the most PDB entries
By Michael Schroeder, Biotec 89
A Little ScienceWhat are the most popular superfamilies?SELECT des.sccs, des.description, COUNT(DISTINCT cla.pdb_id) AS num_of_pdb_idsFROM cla,des WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num_of_pdb_ids DESC LIMIT 10;+--------+------------------------------------------------------+----------------+| sccs | description | num_of_pdb_ids |+--------+------------------------------------------------------+----------------+| b.1.1 | Immunoglobulin | 823 || d.2.1 | Lysozyme-like | 777 || b.47.1 | Trypsin-like serine proteases | 649 || c.37.1 | P-loop containing nucleotide triphosphate hydrolases | 521 || c.2.1 | NAD(P)-binding Rossmann-fold domains | 384 || a.1.1 | Globin-like | 384 || c.1.8 | (Trans)glycosidases | 332 || b.50.1 | Acid proteases | 288 || b.29.1 | Concanavalin A-like lectins/glucanases | 230 || c.47.1 | Thioredoxin-like | 217 |+--------+------------------------------------------------------+----------------+
By Michael Schroeder, Biotec 90
A Little Science
Are all superfamilies equally likely to co-occur?
Let us generate a co-occurrence map as an answer Which superfamilies co-occur most frequently Which superfamilies have the most co-occurrence partners
The co-occurrence map should consist of two tables A table with PDB ID, superfamily 1, superfamily 2 (to avoid
repetition we will require that sf1 is alphabetically before sf2) A table with superfamily 1 and 2 and the number of PDBs
containing this co-occurrence
By Michael Schroeder, Biotec 91
A Little ScienceCo-occurrence map:SELECT DISTINCT c1.pdb_id, c1.sf, c2.sfFROM cla AS c1, cla AS c2 WHERE c1.pdb_id=c2.pdb_id AND c1.sf<c2.sf LIMIT 10;+--------+-------+-------+| pdb_id | sf | sf |+--------+-------+-------+| 1cqx | 46458 | 63380 || 1cqx | 46458 | 52343 || 1gvh | 46458 | 63380 || 1gvh | 46458 | 52343 || 1b33 | 46458 | 54580 || 1qgw | 46458 | 56568 || 1kf6 | 46548 | 46977 || 1kf6 | 46548 | 51905 || 1kf6 | 46548 | 54292 || 1kf6 | 46548 | 56425 |+--------+-------+-------+
We are still missing the sf names, which we can get from the des table
By Michael Schroeder, Biotec 92
A Little ScienceCo-occurrence Map:CREATE TABLE cooc AS SELECT DISTINCT c1.pdb_id, c1.sf AS sf1, d1.description AS sf1name, c2.sf AS sf2, d2.description AS sf2name FROM cla AS c1, cla AS c2, des AS d1, des AS d2 WHERE c1.pdb_id=c2.pdb_id AND c1.sf<c2.sf AND c1.sf=d1.id AND c2.sf=d2.id;
+--------+-------+--------------------------+-------+--------------------------------------------------------------+| pdb_id | sf1 | sf1name | sf2 | sf2name |+--------+-------+--------------------------+-------+--------------------------------------------------------------+| 1cqx | 46458 | Globin-like | 63380 | Riboflavin synthase domain-like || 1cqx | 46458 | Globin-like | 52343 | Ferredoxin reductase-like, C-terminal NADP-linked domain || 1gvh | 46458 | Globin-like | 63380 | Riboflavin synthase domain-like || 1gvh | 46458 | Globin-like | 52343 | Ferredoxin reductase-like, C-terminal NADP-linked domain || 1b33 | 46458 | Globin-like | 54580 | Allophycocyanin linker chain (domain) || 1qgw | 46458 | Globin-like | 56568 | Non-globular alpha+beta subunits of globular proteins || 1kf6 | 46548 | alpha-helical ferredoxin | 46977 | Succinate dehydrogenase/fumarate reductase C-terminal domain || 1kf6 | 46548 | alpha-helical ferredoxin | 51905 | FAD/NAD(P)-binding domain || 1kf6 | 46548 | alpha-helical ferredoxin | 54292 | 2Fe-2S ferredoxin-like || 1kf6 | 46548 | alpha-helical ferredoxin | 56425 | Succinate dehydrogenase/fumarate reductase catalytic domain |+--------+-------+--------------------------+-------+--------------------------------------------------------------+
Now let us count the distinct PDB IDs for each co-occurrence
By Michael Schroeder, Biotec 93
A Little ScienceNumber of instances in co-occurrence mapSELECT COUNT(DISTINCT pdb_id) AS num, sf1, sf1name, sf2, sf2nameFROM cooc GROUP BY sf1,sf2 ORDER BY num DESC LIMIT 10;+-----+-------+-----------------------------------------------+-------+------------------------------------------------------------------+| num | sf1 | sf1name | sf2 | sf2name |+-----+-------+-----------------------------------------------+-------+------------------------------------------------------------------+| 137 | 48726 | Immunoglobulin | 54452 | MHC antigen-recognition domain || 125 | 51011 | alpha-Amylases, C-terminal beta-sheet domain | 51445 | (Trans)glycosidases || 117 | 47616 | Glutathione S-transferases, C-terminal domain | 52833 | Thioredoxin-like || 99 | 47802 | DNA polymerase beta, N-terminal domain-like | 56699 | Nucleotidyltransferases || 97 | 53098 | Ribonuclease H-like | 56672 | DNA/RNApolymerases || 74 | 51735 | NAD(P)-binding Rossmann-fold domains | 55347 | Glyceraldehyde-3-phosphate dehydrogenase-like, C-terminal domain || 64 | 48726 | Immunoglobulin | 51445 | (Trans)glycosidases || 63 | 51905 | FAD/NAD(P)-binding domain | 54373 | FAD-linked reductases, C-terminal domain || 58 | 50203 | Bacterial enterotoxins | 54334 | Superantigen toxins, C-terminal domain || 55 | 48726 | Immunoglobulin | 51011 | alpha-Amylases, C-terminal beta-sheet domain |+-----+-------+-----------------------------------------------+-------+------------------------------------------------------------------+
Is it valid to draw any conclusions from the above table of superfamilyco-occurrences with their frequencies? We should be careful, as the number of co-occurrences may be biased by the abundance of each superfamily, e.g. immunoglobulin is very frequent in the PDB
By Michael Schroeder, Biotec 94
A Little Science
This is quite similar a problem to the generation of substitution matrices (last term)
A little reminder…
By Michael Schroeder, Biotec 95
BLOSUM
BLOcks SUbstitution Matrix (based on BLOCKS database) Generation of BLOSUM x
Group highly similar sequences and replace them by a representative sequences.
Only consider sequences with no more than x % similarity Align sequences (no gaps)
For any pair of amino acids a,b and for all columns c of the alignment, let q(a,b) be the number of co-occurrences of a,b in all columns c.
Let p(a) be the overall probability of a occurring
BLOSUM entry for a,b is log2 ( q(a,b) / ( p(a)*p(b) ) )
A Little Science
By Michael Schroeder, Biotec 96
A Little Science
“Normalised” number of instances in co-occurrence map
Ok, so to avoid bias we will compute the following Logarithm of probability of sf1 and sf2 co-occurring /
(probability of sf1 * probability of sf2) To compute
the co-occurrence probabilities we will count the frequency of sf1 and sf2 co-occurring and divide this by the overall number of co-occurrences and
the probability of a superfamily we will divide the frequency of the superfamily by the overall number of superfamilies
To compute this let us put “freq of sf1 and sf2 co-occurring” into a table cooc_pdb and “freq of sf” into a table sf_pdb
By Michael Schroeder, Biotec 97
A Little ScienceCount PDBs for each co-occurrence
CREATE TABLE cooc_pdb AS
SELECT COUNT(DISTINCT pdb_id) AS num,
sf1, sf1name, sf2, sf2name
FROM cooc
GROUP BY sf1,sf2
ORDER BY num DESC;
Count PDBs for each superfamily
CREATE TABLE sf_pdb AS
SELECT des.id AS sf, des.description,
COUNT(DISTINCT cla.pdb_id) AS num_of_pdb_ids
FROM cla,des
WHERE des.id=cla.sf
GROUP BY cla.sf
ORDER BY num_of_pdb_ids DESC;
By Michael Schroeder, Biotec 98
A Little Science
Count overall number of PDBs for each co-occurrenceSELECT SUM(num) AS totalCoocFreqFROM cooc_pdb;+---------------+| totalCoocFreq |+---------------+| 9813 |+---------------+Count PDBs for each superfamilySELECT SUM(num_of_pdb_ids) AS total_num_of_pdb_idsFROM sf_pdb;+----------------------+| total_num_of_pdb_ids |+----------------------+| 22318 |+----------------------+
By Michael Schroeder, Biotec 99
A Little ScienceNormalised co-occurrence mapCREATE TABLE normalised_cooc AS SELECT LOG((cooc_pdb.num/9813)/ ((sf_pdb1.num_of_pdb_ids/22318)*(sf_pdb2.num_of_pdb_ids/22318))) AS
val, cooc_pdb.sf1, cooc_pdb.sf1name, cooc_pdb.sf2, cooc_pdb.sf2name FROM cooc_pdb, sf_pdb AS sf_pdb1, sf_pdb AS sf_pdb2 WHERE cooc_pdb.sf1=sf_pdb1.sf AND cooc_pdb.sf2=sf_pdb2.sf ORDER BY val desc;+-----------+----------------------------------------------+-------------------------------+| val | sf1name | sf2name |+-----------+----------------------------------------------+-------------------------------+| 10.834834 | Arp2/3 complex 21 kDa subunit ARPC3 | Arp2/3 complex 16 kDa subunit || 10.834834 | tRNA splicing endonuclease, C-terminal domain| tRNA splicing endonuclease Edn|| 10.834834 | Cell-division inhibitor MinC, C-terminal doma| Cell-division inhibitor MinC, || 10.834834 | Tricorn protease N-terminal domain | Tricorn protease N-terminal do|| 10.834834 | L-fucose isomerase, C-terminal domain | L-fucose isomerase, N-terminal|| 10.834834 | Catalytic domain of malonyl-CoA ACP transacyl| Probable ACP-binding domain of|| 10.834834 | Transcription factor IIA (TFIIA), N-terminal | Transcription factor IIA (TFII|| 10.834834 | Glutamyl tRNA-reductase dimerization domain | Glutamyl tRNA-reductase cataly|| 10.834834 | N-terminal domain of phosphatidylinositol tra| C-terminal domain of phosphati|| 10.834834 | Rotavirus NSP2 fragment, C-terminal domain | Rotavirus NSP2 fragment, N-ter|| 10.834834 | Lipovitellin-phosvitin complex, superhelical | Lipovitellin-phosvitin complex|| 10.834834 | Aminoimidazole ribonucleotide synthetase (Pur| Aminoimidazole ribonucleotide || 10.834834 | Colicin E3 translocation domain | Colicin E3 receptor domain || 10.834834 | Arp2/3 complex 21 kDa subunit ARPC3 | Arp2/3 complex subunits || 10.834834 | Arp2/3 complex 16 kDa subunit ARPC5 | Arp2/3 complex subunits || 10.834834 | Head domain of nucleotide exchange factor Grp| Coiled-coil domain of nucleoti|| 10.834834 | YhbC-like, C-terminal domain | YhbC-like, N-terminal domain || 10.141687 | TolB, C-terminal domain | TolB, N-terminal domain |+-----------+----------------------------------------------+-------------------------------+
By Michael Schroeder, Biotec 100
A Little Science
Which superfamily has the most co-occurrence partners?
By Michael Schroeder, Biotec 101
A Little ScienceWhich superfamily has the most co-occurrence partners?CREATE TABLE cooc_partner AS SELECT COUNT(DISTINCT c2.sf) AS distinctnum, COUNT(c2.sf) AS num, des.sccs, des.description FROM cla AS c1, cla AS c2, des WHERE c1.pdb_id=c2.pdb_id AND des.id=c1.sf GROUP BY c1.sf;
SELECT *FROM cooc_partner ORDER BY distinctnum DESC LIMIT 10;
| distinctnum | num | sccs | description |+-------------+-------+--------+------------------+| 74 | 19177 | b.1.1 | Immunoglobulin| 68 | 5228 | c.37.1 | P-loop| 67 | 1602 | b.40.4 | Nucleic acid-binding proteins| 42 | 435 | c.55.4 | Translational machinery components| 42 | 580 | g.39.1 | Glucocorticoid receptor-like| 31 | 561 | b.43.3 | Translation proteins | 28 | 1575 | b.47.1 | Trypsin-like serine proteases | 28 | 725 | d.14.1 | Ribosomal protein S5 domain 2-like | 25 | 982 | a.4.5 | "Winged helix" DNA-binding domain | 25 | 283 | d.52.3 | Prokaryotic type KH domain (pKH-…+-------------+-------+--------+------------------+
By Michael Schroeder, Biotec 102
A Little ScienceWhich superfamily has the most co-occurrence partners?SELECT *FROM cooc_partner ORDER BY num DESC LIMIT 10;+-------------+-------+---------+--------------+| distinctnum | num | sccs | description+-------------+-------+---------+--------------+| 74 | 19177 | b.1.1 | Immunoglobulin| 4 | 7532 | b.1.4 | beta-Galactosidase/glucuronidase| 17 | 6355 | c.2.1 | NAD(P)-binding Rossmann-fold domains| 24 | 6233 | f.2.1 | Membrane all-alpha| 7 | 6007 | d.153.1 | N-terminal nucleophile aminohydrolases | 1 | 5687 | i.1.1 | Ribosome and ribosomal fragments| 68 | 5228 | c.37.1 | P-loop| 18 | 5154 | c.1.8 | (Trans)glycosidases| 5 | 3915 | a.1.1 | Globin-like| 11 | 3880 | b.18.1 | Galactose-binding domain-like+-------------+-------+---------+---------------+
By Michael Schroeder, Biotec 103
Scale-free Networks
Small-world property “everybody on earth is related with a degree of six
intermediaries” Reason: Network structure
A few highly connected nodes Many nodes with few connections Consequence: very short average distance between any
two nodes Formally: Number of interaction partners follows power-
law, i.e. distribution of number of interaction partners is an exponential function
Protein interactions are small-world networks!
By Michael Schroeder, Biotec 104
Scale-free Networks
Is our co-occurrence map a scale-free network?
By Michael Schroeder, Biotec 105
Scale-free Networks
Is our co-occurrence map a scale-free network?
Let’s get the distribution of the number of distinct co-occurrence partners SELECT distinctnum, COUNT(*)
FROM cooc_partner
GROUP BY distinctnum
ORDER BY distinctnum; Let’s plot it
+-------------+----------+| distinctnum | count(*) |+-------------+----------+| 1 | 586 || 2 | 291 || 3 | 152 || 4 | 72 || 5 | 47 || 6 | 27 || 7 | 26 || 8 | 9 || 9 | 16 || 10 | 8 || 11 | 9 || 12 | 3 || 13 | 3 || 14 | 1 || 15 | 1 || 16 | 2 || 17 | 1 || 18 | 3 || 20 | 1 || 21 | 16 || 23 | 18 || 24 | 1 || 25 | 4 || 28 | 2 || 31 | 1 || 42 | 2 || 67 | 1 || 68 | 1 || 74 | 1 |+-------------+----------+
By Michael Schroeder, Biotec 106
Scale-free Networks
Distribution of number of co-occurrence partners
(x-axis = number of partners, y-axis=frequency
By Michael Schroeder, Biotec 107
Scale-free Networks
But maybe the number of co-occurrence partners is simply correlated to the number of PDB entries we have for that superfamily, in which case the power-law relationship would be an artefact
Let’s test this by extending the previous table with a column for the average number of PDBs given as the num column in cooc_partner
By Michael Schroeder, Biotec 108
Scale-free Networks
Distinct number of co-occurrence partners and average number of co-occurrences
SELECT distinctnum,
COUNT(*),AVG(num)
FROM cooc_partner
GROUP BY distinctnum
ORDER BY distinctnum;
+-------------+----------+------------+| distinctnum | count(*) | avg(num) |+-------------+----------+------------+| 1 | 586 | 66.8328 || 2 | 291 | 124.0481 || 3 | 152 | 251.1316 || 4 | 72 | 324.1389 || 5 | 47 | 344.2340 || 6 | 27 | 289.7778 || 7 | 26 | 883.1923 || 8 | 9 | 663.8889 || 9 | 16 | 375.0625 || 10 | 8 | 772.3750 || 11 | 9 | 1454.3333 || 12 | 3 | 726.6667 || 13 | 3 | 642.3333 || 14 | 1 | 489.0000 || 15 | 1 | 2530.0000 || 16 | 2 | 645.0000 || 17 | 1 | 6355.0000 || 18 | 3 | 2254.3333 || 20 | 1 | 979.0000 || 21 | 16 | 254.1875 || 23 | 18 | 290.2778 || 24 | 1 | 6233.0000 || 25 | 4 | 433.0000 || 28 | 2 | 1150.0000 || 31 | 1 | 561.0000 || 42 | 2 | 507.5000 || 67 | 1 | 1602.0000 || 68 | 1 | 5228.0000 || 74 | 1 | 19177.0000 |+-------------+----------+------------+
By Michael Schroeder, Biotec 109
Scale-free Networks No good correlation between frequency of number of distinct
co-occurrence partners and number of underlying instances. Hence: scale-free property appears to be no artefact
By Michael Schroeder, Biotec 110
Limits
List all superfamily pairs which can interact directly or indirectly (“Transitive Closure”).
This query cannot be expressed in SQL SQL does not have the same “power” as a
programming language like Python
By Michael Schroeder, Biotec 111
Introduction to MySQL
By Michael Schroeder, Biotec 112
SQL: “Structured Query Language”—the most common standardized language used to access databases.
SQL has several parts: DDL – Data Definition Language
{Defining, Deleting, Modifying relation schemas}
DML – Data Manipulation Language{Inserting, Deleting, Modifying tuples in database}
Embedded SQL – defines how SQL statements can be used with general-purposed programming
By Michael Schroeder, Biotec 113
MySQL, the most popular Open Source SQL database, is developed, distributed and supported by MySQL AB.
MySQL is a relational database management system. MySQL software is Open Source.
• Written in C and C++. Tested with a broad range of different compilers. Works on many different platforms. APIs for C, C++, Eiffel, Java, Perl, PHP, Python, Ruby,
and Tcl.• You can find MySQl manual and documentation at:
www .mysql.com/documentation/• You can download and install MySQL on your own computer
(both under Windows and Linux)
By Michael Schroeder, Biotec 114
MySQL To see a list of options provided by mysql, invoke it with the --help option: shell> mysql --help
Using SQL:
On any linux you have to use this to log on to MySQL:
shell> /usr/local/mysql/bin/mysql -h hostname -D loginname -p
shell> mysql -h host -u user -p Enter password: ********
The ******** represents your password; enter it when mysql displays the Enter password: prompt.
By Michael Schroeder, Biotec 115
Basic Query:select A1, A2,…,An
from r1, r2, …,rm
where P;
A1, A2,…,An represent attributesr1, r2, …rm represent relationsP represents predicate (guard condition)
Keywords may be entered in any letter case:mysql> SELECT VERSION(), CURRENT_DATE; mysql> select version(), current_date; mysql> SeLeCt vErSiOn(), current_DATE;
By Michael Schroeder, Biotec 116
Prompt Meaning mysql> Ready for new command. -> Waiting for next line of multiple-line command. ‘> Waiting for next line, collecting a string that
begins with a single quote (` ’ ’). “> Waiting for next line, collecting a string that
begins with a double quote (` ” ’).
mysql> SELECT * -> FROM my_table -> WHERE name = “Smith” AND age < 30;
mysql> SELECT * FROM my_table WHERE name = "Smith AND age < 30;
"> "\c mysql>
\c to cancel the execution of a command
By Michael Schroeder, Biotec 117
Basic Database Operation Create a database Create a table Load data into the table Retrieve data from the table in various ways Use multiple tables
Suppose you have several pets in your home (your menagerie) and you'd like to keep track of various types of information about them. You can do so by creating tables to hold your data and loading them with the desired information. Then you can answer different sorts of questions about your animals by retrieving data from the tables.
By Michael Schroeder, Biotec 118
Creating and Using a Database
mysql> SHOW DATABASES;
SHOW statement can be used to find out the databases currently existing on the server
mysql> USE testdb
testdb is a database name. USE command does not need a semi colon and must be given in a single line.Database needs to be invoked in order to use it.
mysql> CREATE DATABASE example;
Database names are case-sensitive unlike keywords; Same applies for table namesSo example != Example != EXAMPLE or some other variant
By Michael Schroeder, Biotec 119
Creating a Table
mysql> SHOW TABLES; Displays the current list of tables
mysql> CREATE TABLE pet (name VARCHAR(20), owner VARCHAR(20), -> species VARCHAR(20), sex CHAR(1), birth DATE, death
DATE);
mysql> SHOW TABLES;Will display the table with the table name pet
Verification of the table can be done with DESCRIBE commandmysql> DESCRIBE pet; +---------+-------------+---------------+---------+---------+---------+| Field | Type | Null | Key | Default | Extra | +---------+-------------+---------------+---------+---------+---------+| name | varchar(20) | YES | | NULL | | | owner | varchar(20) | YES | | NULL | | | species | varchar(20) | YES | | NULL | || sex | char(1) | YES | | NULL | | | birth | date | YES | | NULL | | | death | date | YES | | NULL | |
By Michael Schroeder, Biotec 120
Loading Data into a Table
LOAD DATA uses a text file with single record in a line that match the attributes in the table.Useful for inserting when multiple records are involved.
Example: pet.txt is a text file with a single record
Name owner species sex birth death Whistler Gwen bird \N 1997-12-09 \N
mysql> LOAD DATA LOCAL INFILE "pet.txt" INTO TABLE pet;
INSERT command can be used when records needs to be inserted one at a time. NULL can be directly inserted in the field column
Example:mysql> INSERT INTO pet -> VALUES ('Puffball','Diane','hamster','f','1999-03-30',NULL);
By Michael Schroeder, Biotec 121
Retrieving Information from a Table
The SELECT statement is used to pull information from a table. The general form of the statement is:
SELECT what_to_select FROM which_table WHERE conditions_to_satisfy
The simplest form of SELECT retrieves everything from a table: mysql> SELECT * FROM pet;
You can select only particular rows from your table. mysql> SELECT * FROM pet WHERE name = "Bowser";
You can specify conditions on any column, not just name. For example, if you want to know which animals were born after 1998, test the birth column:
mysql> SELECT * FROM pet WHERE birth >= "1998-1-1";
You can combine conditions, for example, to locate female dogs: mysql> SELECT * FROM pet WHERE species = "dog" AND sex = "f";
By Michael Schroeder, Biotec 122
Selecting Particular ColumnsIf you don't want to see entire rows from your table, just name the columns in which you're interested, separated by commas.
For example, if you want to know when your animals were born, selectthe name and birth columns: mysql> SELECT name, birth FROM pet;
To find out who owns pets, use this query: mysql> SELECT owner FROM pet; mysql> SELECT DISTINCT owner FROM pet;
You can use a WHERE clause to combine row selection with column selection. For example, to get birth dates for dogs and cats only, use this query:
mysql> SELECT name, species, birth FROM pet -> WHERE species = "dog" OR species = "cat";
By Michael Schroeder, Biotec 123
Sorting Rows
To sort a result, use an ORDER BY clause.
Here are animal birthdays, sorted by date: mysql> SELECT name, birth FROM pet ORDER BY birth;
To sort in reverse order, add the DESC (descending) keyword to the name of the column you are sorting by:
mysql> SELECT name, birth FROM pet ORDER BY birth DESC;
You can sort on multiple columns. For example, to sort by type of animal, then by birth date within animal type with youngest animals first, use the following query:
mysql> SELECT name, species, birth FROM pet ORDER BY species, birth DESC;
By Michael Schroeder, Biotec 124
Pattern Matching
MySQL provides standard SQL pattern matching as well as a form of pattern matching based on extended regular expressions similar to those used by Unix utilities such as grep.
SQL pattern matching allows you to use `_' to match any single character and `%' to match an arbitrary number of characters (including zero characters). In MySQL, SQL patterns are case-insensitive by default. Some examples are shown here. Note that you do not use = or <> when you use SQL patterns; use the LIKE or NOT LIKE comparison operators instead.
To find names beginning with `b':
mysql> SELECT * FROM pet WHERE name LIKE "b%";
To find names containing exactly five characters, use the `_' pattern character:
mysql> SELECT * FROM pet WHERE name LIKE “_____”;
By Michael Schroeder, Biotec 125
Counting Rows
For example, you might want to know how many pets each owner has, Counting the total number of animals you have is the same question as “How many rows
are in the pet table?”
The COUNT() function counts the number of non-NULL results, so the query to count your animals looks like this:
mysql> SELECT COUNT(*) FROM pet;
You can use COUNT() if you want to find out how many pets each owner has:
mysql> SELECT owner, COUNT(*) FROM pet GROUP BY owner; +---------+---------------+ | owner | COUNT(*) | +---------+---------------+ | Benny | 2 || Diane | 2 | | Gwen | 3 | | Harold | 2 | +---------+---------------+
By Michael Schroeder, Biotec 126
Examples of some common queries
CREATE TABLE shop ( article INT(4) UNSIGNED ZEROFILL DEFAULT ‘0000' NOT NULL, dealer CHAR(20) DEFAULT ‘’ NOT NULL, price DOUBLE(16,2) DEFAULT '0.00’ NOT NULL, PRIMARY KEY(article, dealer));
INSERT INTO shop VALUES (1,'A',3.45),(1,'B',3.99),(2,'A',10.99),(3,'B',1.45),(3,'C',1.69), (3,'D',1.25),(4,'D',19.95);
mysql> SELECT * FROM shop; +--------------+---------+--------+ | article | dealer | price | +--------------+---------+--------+ | 0001 | A | 3.45 | | 0001 | B | 3.99 | | 0002 | A | 10.99 | | 0003 | B | 1.45 | | 0003 | C | 1.69 | | 0003 | D | 1.25 | | 0004 | D | 19.95 | +--------------+---------+--------+
The maximum value for a columnThe row holding the maximum of a certain columnMaximum of column per groupThe rows holding the group-wise maximum of a certain field
By Michael Schroeder, Biotec 127
“What's the highest price?”SELECT MAX(price) AS price FROM shop; +---------+ | price | +---------+ | 19.95 | +---------+
“Find number, dealer, and price of the most expensive article.”
In ANSI SQL (and MySQL Version 4.1) this is easily done with a subquery:
SELECT article, dealer, price FROM shop WHERE price = (SELECT MAX(price) FROM shop)
In MySQL versions prior to 4.1, you have to do it in two steps: 1. Get the maximum price value from the table with a SELECT statement. 2. Using this value compile the actual query: SELECT article, dealer, price FROM shop WHERE price=19.95
Or, Using User Variables ( @variable-name, @temp := 5 )Select @max_price := max(price) from shop;Select article, dealer, price from shop where price = @max_price;
By Michael Schroeder, Biotec 128
Maximum of Column per Group
“What's the highest price per article?”
SELECT article, MAX(price) AS price FROM shop GROUP BY article;
+---------+---------+ | article | price | +---------+---------+ | 0001 | 3.99 | | 0002 | 10.99 | | 0003 | 1.69 | | 0004 | 19.95 | +---------+---------+
By Michael Schroeder, Biotec 129
The Rows Holding the Group-wise Maximum of a Certain Field
“For each article, find the dealer(s) with the most expensive price.”
In ANSI SQL (MySQL Version 4.1 or greater), do it with a subquery
SELECT article, dealer, price
FROM shop s1
WHERE price=(SELECT MAX(s2.price)
FROM shop s2
WHERE s1.article = s2.article);
By Michael Schroeder, Biotec 130
But, In MySQL versions prior to 4.1, it has to be done in several steps, with a temporary table (It doesn’t support nested-query \subquery).
CREATE TEMPORARY TABLE tmp
( article INT(4) UNSIGNED ZEROFILL DEFAULT '0000' NOT NULL,
price DOUBLE(16,2) DEFAULT '0.00' NOT NULL);
LOCK TABLES shop read;
INSERT INTO tmp
SELECT article, MAX(price) FROM shop GROUP BY article;
SELECT shop.article, dealer, shop.price
FROM shop, tmp
WHERE shop.article=tmp.article AND shop.price=tmp.price;
UNLOCK TABLES; DROP TABLE tmp;