+ All Categories
Home > Documents > Databases A/Prof Alan Fekete University of Sydney.

Databases A/Prof Alan Fekete University of Sydney.

Date post: 04-Jan-2016
Category:
Upload: cecily-webster
View: 216 times
Download: 0 times
Share this document with a friend
34
Databases A/Prof Alan Fekete University of Sydney
Transcript
Page 1: Databases A/Prof Alan Fekete University of Sydney.

Databases

A/Prof Alan Fekete

University of Sydney

Page 2: Databases A/Prof Alan Fekete University of Sydney.

Overview

• Data, databases, and database management systems

• Querying a database with SQL

• Creating a database with a RDBMS

• Tradeoffs

Page 3: Databases A/Prof Alan Fekete University of Sydney.

Data

• Facts about the world that are worth knowing– Eg NGC 450 is at 18.876,-0.861– Eg The AAT has a 3.9 m reflector– The redshift of NGC 450 was measured with the

AAT’s 2dF spectrograph

• Data is very valuable– Data can be analysed to produce theories, or to

test them– It takes a lot of effort to gather data

Page 4: Databases A/Prof Alan Fekete University of Sydney.

Persistence

• Values of variables in a running program are stored in the memory (RAM) – They are lost when the program ends, or the

machine crashes

• Computers have persistent storage (hard disk, flash memory etc) whose values persist

• The operating system organizes the persistent storage into files– So store valuable data in files

Page 5: Databases A/Prof Alan Fekete University of Sydney.

Sharing

• Once someone has collected data and got it into a computer, they can use it for many purposes, and many others can use it

• The naïve approach: copy the file around– Or email, or set up an ftp site, etc

• This is very bad, if the data might change– More data (more galaxies, more facts about each

galaxy!)– Corrections to existing data

• Copies can become inconsistent; so share one store instead

Page 6: Databases A/Prof Alan Fekete University of Sydney.

A Database

• A store of data relevant to some domain, stored in a computer, and shared (between application programs or between users)

• Typically, the data is structured– Many facts in each of a few types– Eg many galaxies, each has name, ra, dec,

redshift etc

Page 7: Databases A/Prof Alan Fekete University of Sydney.

Clients and servers

• The database server is the machine where the database is stored

• Each user must run some presentation client code on their machine, to display the data– Usually this is just a web browser

• There may be other machines in-between– Eg web server, application server– Note that a machine may be a server which receives some

requests, and also a client of others as it tries to respond!• We focus on the database server, and the “database client”

where the application code is running that understands the data

Page 8: Databases A/Prof Alan Fekete University of Sydney.

Two approaches

• Purpose-written application code, accessing data stored in files, using knowledge of the structure of the files

• Use a database management system (DBMS) which manages the data and provides higher-level facilities to access it– You may write

applications that call the DBMS, or even use it directly, interactively

This is common in science

This is standard in industry,and is becoming more frequent in science

Page 9: Databases A/Prof Alan Fekete University of Sydney.

DBMS Features

• A query language– Describe the data to retrieve– Can also describe updates etc

• A data definition language– Describe the logical structure of the data

• Storage management– Control the layout in files, and with facilities to

improve performance

• Access control

Page 10: Databases A/Prof Alan Fekete University of Sydney.

Different Data Models

• Each database has a model or schema which tells what sort of data is stored– Eg each galaxy has id, name, ra, dec, etc

• Using a DBMS means that the data model must fit the style supported by the DBMS

• The most common DBMSs all support the relational style of data model

Page 11: Databases A/Prof Alan Fekete University of Sydney.

Relational Data Models

• Database consists of tables, each with a name

• Each table has many rows, each with exactly the same structure– several attributes, each with a name and type

(integer, float, string of length 20, etc)– Each row has a value for each attribute

• Rows are distinguished by the values of a primary key (one or more attributes)

Page 12: Databases A/Prof Alan Fekete University of Sydney.

Example

• Schema: – galaxy(gal_id INTEGER,

gal_ra FLOAT, gal_decl FLOAT, gtype STRING)

– observation(obs_id INTEGER, obs_ra FLOAT, obs_decl FLOAT, obs_flux FLOAT, gal_id INTEGER, source STRING)

• An Instance

1 13.1 7.2 S

2 14.7 5.3 C

3 11.2 4.8 S

4 15.3 4.6 S

1 14.6 5.4 13.4 2 A

2 11.2 4.9 9.6 3 A

3 14.5 4.3 13.3 2 B

4 11.1 4.8 9.7 3 C

Note use of “foreign key” to connect information in different tables

Page 13: Databases A/Prof Alan Fekete University of Sydney.

Views

• You can define a view on a database

• Just like a table, but the data isn’t inserted or stored explicitly– It is computed when needed from the other

tables

• A view can be used in queries just like a table

Page 14: Databases A/Prof Alan Fekete University of Sydney.

Queries

• User can write a query, and run it– Answer is a collection of relevant data

• Query may be typed interactively, or written into a program for repeated use – Perhaps with parameters that vary

• Answer may be viewed on the screen, imported into a program, printed, or stored back in the database

Page 15: Databases A/Prof Alan Fekete University of Sydney.

SQL

• A standardized language in which queries can be written against databases in relational DBMSs– But each platform has slight variations

• SQL is declarative– Query describes what data is needed– RDBMS has the job of working out how to trawl

through the database to compute the answer

Page 16: Databases A/Prof Alan Fekete University of Sydney.

Select-from-where

SELECT gal_id, gtype

FROM galaxyWHERE gal_ra

between 13.0 and 16.0

AND gal_decl between 4.0 and 6.0;

1 13.1 7.2 S

2 14.7 5.3 C

3 11.2 4.8 S

4 15.3 4.6 S

2 C

4 S

Answer is:

Page 17: Databases A/Prof Alan Fekete University of Sydney.

Remember

• FROM tells which table to use

• WHERE tells which rows you are interested in– A boolean condition that is evaluated for

each row

• SELECT says which attributes (columns) you want in the answer

Note query is stated SELECT-FROM-WHERE

Page 18: Databases A/Prof Alan Fekete University of Sydney.

Limiting the answers

• Real scientific databases have tables with many many rows

• A query which returns too many is expensive to run– And it’s hard to understand the answer

• Especially when exploring the data, get a small sample first– Ask for the top few rows, perhaps with respect to some

order on the selected rows

• Syntax details are not standard– MySQL has “LIMIT n” clause at end of query

Page 19: Databases A/Prof Alan Fekete University of Sydney.

Joins• A query can

combine data from several tables– use the foreign keys

to pick values from the related row

SELECT observation.obs_id, galaxy.gal_ra, galaxy.gal_decl, galaxy.gtypeFROM galaxy JOIN observation ON observation.gal_id =

galaxy.gal_idWHERE observation.source=‘A’;

1 14.6 5.4 13.4 2 A 14.7 5.3 C

2 11.2 4.9 9.6 3 A 11.2 4.8 S

3 14.5 4.3 13.3 2 B 14.7 5.3 C

4 11.1 4.8 9.7 3 C 11.2 4.8 S

From observation From galaxy

Page 20: Databases A/Prof Alan Fekete University of Sydney.

Unstructured joins

• Sometimes you want to take all combinations of rows from tables, not just the rows related through foreign key– Then filter by a condition

SELECT galaxy.gal_id, observation.obs_idFROM galaxy, observationWHERE ABS(galaxy.gal_ra - observation.obs_ra)<0.2AND ABS(galaxy.gal_decl -

observation.obs_decl)<0.3;

Page 21: Databases A/Prof Alan Fekete University of Sydney.

Aggregation

• Combine the values from a column, among all the rows that satisfy WHERE clause– MAX, MIN, SUM, AVG,

COUNT– COUNT(DISTINCT

colname) is usefully different if values are duplicated in the column

SELECT MAX(obs_flux)

FROM observationWHERE gal_id = 3;

Find the largest flux in any observation of galaxy 3

Page 22: Databases A/Prof Alan Fekete University of Sydney.

Grouping

• Divide the rows into groups, based on the values in some column(s)

• Aggregate the values in some other column(s), for each group

• Display the grouping column along with the aggregate

SELECT source, COUNT(DISTINCT obs_id)

FROM observationWHERE obs_flux > 10.0GROUP BY source;

For each source, report how many observations that have high flux were made by that source.

Page 23: Databases A/Prof Alan Fekete University of Sydney.

Subqueries

• SQL is block-structured• A subquery can be used

in the FROM clause, instead of an existing table

• A subquery can be used to produce a value used in the WHERE clause– eg in a comparison

SELECT obs_idFROM observationWHERE obs_flux = (SELECT

MAX(obs_flux) FROM

observation);

Find the observation with the highest flux

Page 24: Databases A/Prof Alan Fekete University of Sydney.

Creating a database

• Decide on the platform to use

• Decide on the schema that will represent your data

• Create a database and tables

• Load the data

• Allow users to access it

Page 25: Databases A/Prof Alan Fekete University of Sydney.

Relational DBMS platforms

• Enterprise solutions (Oracle, IBM DB2, SQL Server) are very expensive– And require professional administration too

• Vendors also offer light-weight cheap/free variants

• Free open-source platforms: PostreSQL, MySQL, Cloudscape

Page 26: Databases A/Prof Alan Fekete University of Sydney.

Table Design

• One table for each type of object in the domain– Columns for each attribute of importance– Make sure there is a primary key

• Invent a new identifier for this purpose if necessary

• Use foreign keys to represent relationships between objects– Or have a separate “association table” where one

object can be related to many, in each direction

Page 27: Databases A/Prof Alan Fekete University of Sydney.

Normalisation

• It is important that the schema doesn’t allow the same facts to be stored repeatedly in different rows, or different tables– Eg observation(obs_id, obs_ra,…,gal_id, gal_ra,

…, gtype, …) would have same gal_ra stored in several rows of observations of this galaxy

– This would risk inconsistency when updates occur

• There is theory about this, but sensibly defining one table per object will avoid the problems

Page 28: Databases A/Prof Alan Fekete University of Sydney.

Declaring tables

• Syntax isn’t quite standard– Details of types etc

• MySQL variant isCREATE TABLE galaxy ( gal_id INT UNSIGNED, gal_ra FLOAT, gal_decl FLOAT, gtype VARCHAR(3), PRIMARY KEY (gal_id)) ;

Page 29: Databases A/Prof Alan Fekete University of Sydney.

Indices

• The performance of queries can vary greatly depending on how the data is stored in the DBMS

• Having an index on a column usually speeds up queries that select rows with given values (or ranges of values) in that column– SQL has CREATE INDEX statement

Page 30: Databases A/Prof Alan Fekete University of Sydney.

Permissions

• Each DBMS can control which users are allowed to run different sorts of statements on each table separately

• Typically, for scientific data one allows SELECT access to all users (or all registered users)– INSERT access to a few trusted users– DELETE, UPDATE only for the administrators

• Usually have a separate account, so even the admin people can’t do this accidentally while acting as scientists

Page 31: Databases A/Prof Alan Fekete University of Sydney.

Transactions

• Nasty problems can occur if some users are modifying data while others are looking at it– Or worse, if several modify concurrently

• This is serious for enterprise applications, but not common in scientific situations, as updates are rare

• A transaction is a whole collection of SQL statements that should be done as a single block– Not interrupted by others, and completely undone

unless the whole collection succeeds

Page 32: Databases A/Prof Alan Fekete University of Sydney.

Loading data

• SQL has INSERT statement which adds rows to a table

INSERT INTO galaxy (gal_id, gal_ra, gal_decl, gtype)VALUES (1, 13.1, 7.2, 'S'), (2, 14.7, 5.3, 'C'), (3, 11.2, 4.8, 'S'), (4, 15.3, 4.6, 'S');

Page 33: Databases A/Prof Alan Fekete University of Sydney.

Bulk load

• If data is already in a file in a well-known format (eg csv), then DBMS has non-standard commands to import it directly into a table

• MySQLLOAD DATA LOCAL INFILE “gal.csv” INTO TABLE galaxy

Page 34: Databases A/Prof Alan Fekete University of Sydney.

Trade-offs

Compared to writing applications against data stored directly in files, using DBMS to manage the data:

• Allows users to perform unpredicted (ad-hoc) queries – without being programmers, and without knowing the structure in

files

• Gives better support as data schema evolves– Existing queries can continue to run

• Has more predictable performance– Easier to avoid very slow execution– But you may not be able to get very fast execution

• Has better security– But considerable overhead in administration


Recommended