Introduction to SQL and ADQL Tom McGlynn NASA/GSFC (with thanks to Maria Nieto- Santisteban and...

Post on 27-Mar-2015

227 views 4 download

Tags:

transcript

Introduction to SQL and ADQL

Tom McGlynnNASA/GSFC

(with thanks to Maria Nieto-Santisteban and Gretchen Greene)

What are SQL and ADQL?

• SQL (‘sequel’ sometimes) is Structured Query Language a standard for creating and getting information from relational databases.

• ADQL (Astronomy Data Query Language) is an adaptation of SQL to meet specific needs for astronomical queries, especially positional queries.

• These enable astronomers to make sophisticated queries of astronomical databases.

Lots of places to find info

• NVO Book chapters on SQL• Web site with lots of links about SQL:

– http://www.thefreecountry.com/documentation/onlinesql.shtml• On-line tutorials

• http://nvo-twiki.stsci.edu/twiki/pub/Main/NVOSS3CourseNotes/SQL2006.html– http://www.w3schools.com/sql/default.asp– http://www.sql-tutorial.net/– http://www.firstsql.com/tutor.htm– GIYF

• ADQL standard – http://www.ivoa.net/Documents/latest/ADQL.html

• SDSS Online database– http://casjobs.sdss.org/CasJobs/

• OpenSkyQuery– http://openskyquery.net/Sky/skysite/

Background: The kinds of databases

• Network databases: pointers– Internal data in programs

• Hierarchical database: structure– XML files

• Relational databases: common indices– Relational database management systems

Network database

Dora

Supervisor

Jamie

CMS

Joe

Salary:

75,000

Admin

Roles

Personnel

Team

The WorldWideWeb is the most successful of all network databases.

Hierarchical Database

<Company> <Team name=CMS> <TeamLeader> <Person name=Dora> <Salary>75,000</Salary> </Person> </TeamLeader> <TeamMembers> <Person name=Joe><Salary>… <Person name=Jamie><Salary>… </TeamMembers> </Team> ….</Company>

With XML, hierarchical databases are making a comeback. (e.g., Carnivore registry)

Relational Database

Personnel

UID Name Salary

1 Dora 75,000

2 Joe 30,000

3 Jamie 66,000

Teams

TeamID Name LeadUID

1 CMS 1

2 Test 13

TeamMembers

TeamID UID

1 2

1 3

What about astronomy?

• Relational DBMS’s used by all major astronomical data providers:– ADS, MAST, IPAC, SDSS, …– Only RDBMS can scale to size of modern

astronomical tables (100’s of columns, 109 rows)

– Flexible interactions between tables

• Standard SQL provides limited support for positional queries

• Some RDBMS’s have support for objects

RDBMS servers

• MySQL: free, widely used, fast• Postgres: free widely used, better standards

compliance, object support• Sybase (commercial)• SQLServer (Microsoft)• Gazillions more• MySQL installation at NOAO available for use

in summer school projects -- or just download a copy to run on your machine. (http://dev.mysql.com/downloads/)

Web availability

• CASJOBS– SQL based, SDSS database. Allows users to

generate their own tables.

• OpenSkyQuery– ADQL based, lots of missions but more

fragile.

Using CASJOBS

1. Connect to CASJOBS web site:1. http://casjobs.sdss.org/CasJobs/

2. Get account or login3. Build and query dat

This is a production service and does not always respond in ‘webtime’.

Basic SQL commands

CREATE TABLE tablename (col1 type1, col2 type2, …)

DROP TABLE tablename;

INSERT INTO TABLE tablename (col1,col2,…) VALUES(val1,val2,…)

DELETE FROM TABLE tablename WHERE condition

UPDATE tablename SET col1=val1,col2=val2,… WHERE condition

SELECT fields FROM tables WHERE conditions ORDER BY col1,col2

Create an account in CASJOBS

Select MYDB to get access your private database

Then use CREATE TABLE command

Table name

Column names

Column types

MySQL doesn’t like this name .Must be escaped as `dec`

Types

• Numeric types– int,bigint,smallint– real,float– As in C, the size of types is not well standardized– Typically lots of aliases for various sizes of integers and

floating point numbers

• Character types: char(n), varchar(n), text– Use varchar for long, variable length strings– Use text for very long strings that you won’t need to

compare with others (e.g., file content)

• Business oriented types (money, dates, decimal values)– Some are occasionally useful

Add data to small table: The INSERT command

Use single quotes for strings.

Values in same order as in create statement.

RA,Dec normally stored as decimal degrees

How do I get rid of rows in a table?

DELETE FROM table WHERE conditionsDELETE FROM stars WHERE ra is null

To delete all rows (but not delete the table entirely)

DELETE from table

How do I modify a table?

• Update values in existing rows:– UPDATE table SET field=value,field=value,… WHERE condition UPDATE stars SET ra=1.14983, dec=-31.243 WHERE starid=49

• Adding columns.– Not supported by standard SQL. Create a new table with new

columns and copy old values. Some DB’s support ALTER TABLE … ADD COLUMN …

• Copying tables (differs a bit from DMBS to DBMS)– Create second table thenSELECT INTO table2 * FROM table1 orINSERT INTO TABLE2 (SELECT * FROM table1)

Null values: no name for these stars

Specify parameters to fill

Include an explicit null in the list

What’s in the table?

Status of my tables

Let’s try to query the table we created: use the SELECT command

But it fails! Try again…

This query works. CasJobs doesn’t like null values. The lesson is that one needs to be chary of nulls (and that CasJobs is not a full featured DB).

CasJobs put the results in another table for us. We click on MYDB in the top bar, then the created table name, then on Sample to see the results.

The SELECT statement

The SELECT statement is used to query and existing table or set of tables

A typical query is:SELECT field1,field2,… FROM table1, table2, …

WHERE condition1 AND/OR condition2 AND/OR … ORDER BY sortfield1,sortfield2,…

The list of fields to be returned can usually be specified as ‘*’ to get all the fields in the table.

There are also GROUP BY and HAVING clauses for advanced queries.

The FROM clause

• The table, or list of tables, to be queried.– … FROM mystars …

• There may be a ‘database’ specified, a collection of related tables– … FROM mydb.mystars …

• Each table in the list may have an alias that can be used to identify that table elsewhere– … FROM mystars m ….

Fields (all other clauses)Can be:• Simple name of column in table: mag• Expression: ra-dec• A constant: 3.14159, or ‘pi’

• You can name the result if you like using AS , e.g.select ra AS myra,ra-dec AS diff from mystarswill return columns named myra and diff• When there are multiple tables, table columns can be distinguished

by table names and aliases.– Select t1.ra-t2.ra from mystars t1, yourstars, t2 where t1.id=t2.id– Select mystars.ra-yourstars.ra from mystars,yourstars where

mystars.id=yourstars.id

• One difference between SQL and ADQL is that ADQL requires all column names to specify their origin table, not just where there is an ambiguity.

Conditions

• Comparisons:– mag < 6– mag between 4 and 6– name=‘star1’– ra < 3*dec– sin(ra)*cos(dec)-cos(ra)*sin(dec) > .234– name like ‘star%’

• ‘%’ is the wild card. ‘_’ matches one character.

• Lists– name in (‘star1’, ’star2’, ‘star2’)

• Null tests:– name is null

• Negation– not between 4 and 6– name is not null

• Compound conditions– (mag < 6 and mag > 10 and mag>2*flux)

Querying multiple table: Joins

Joins ‘join’ the rows in two or more tables– One to one relationship

• The two tables are logically one bigger table• A salary table and address table with one row for each

employee.

– Many to one• Components• Checks to Checking account

– Each check belongs to exactly one account but each account can have many checks

– Many to Many• Observations/Sources

– An single observation may include many sources– A single source may be seen in many observations

How does a query work…

Table1a11,a12,a13,…a21,a22,a23,..a31,a32,a33,..…

Table2b11,b12,b13,…b21,b22,b23,..b31,b32,b33,..…

Naïve view:First looking at the FROM clause the database creates the product of all the rows of all the participating tables.

If there are n rows in table 1 and m in table 2, then there are nxm in the join. For many way joins this can very large very quickly!

Data base optimization is all about trying to reduce the number of rows that have to be looked at.

a11,a12,a13…,b11,b12,b13…a11,a12,a13,…b21,b22,b23,…a11,a12,a13,…b31,b32,b33,…a21,a22,a23,…b11,b12,b13,…a21,a22,a23,…b11,b12,b13,…a21,a22,a23,…b11,b12,b13,…a21,a22,a23,…b11,b12,b13,…a31,a32,a33,…b11,b12,b13,…a31,a32,a33,…b11,b12,b13,…a31,a32,a33,…b11,b12,b13,……

WHERE clause: Filtering the intermediate table

The WHERE clause filters the rows in the product table.

If a condition involves only one table it can be applied to the inputs before cross-product.

If a condition applies to both tables, they can sometimes be organized (indexed) such that only a few rows in one table need to be checked for each row in the other.

Specifying the output

• The selection list tells the DBMS which columns to include (or generate) on the output

• The sort tells the DBMS the order.

Astronomy and RDBMS’s

• RDBMS’s expect exact joins on keys (user id, SSN’s, product number) which matches business use.– Sometimes correct for astronomical tables, e.g., targets

in a proposal.

• Many astronomical joins are soft: nearby in position and/or time looking for counterparts.– 1-D is not a problem, but 2-D soft joins are hard.– GIS systems provide some non-standard features– ADQL extensions– Indexing of tables and planning of queries requires lots

of work to do efficiently.

Create sample tables

create table pets (owner varchar(12), pet varchar(12), species varchar(12))

insert into pets values('fred', 'fido', 'dog')insert into pets values('joe', 'ruff', 'dog')insert into pets values('mary', 'ears', 'dog')insert into pets values('andie', 'max', 'dog')insert into pets values('joe', 'silvertip', 'cat')insert into pets values('marlene', 'buster', 'cat')insert into pets values('bill', 'meow', 'cat')insert into pets values('simone', 'furball', 'cat')insert into pets values('mary', 'silver', 'horse')insert into pets values('nell', 'fairmane', 'horse')insert into pets values('effie', 'belle', 'horse')insert into pets values('nell', 'slither', 'salamander')insert into pets values('nell', 'hairy', 'tarantula')

create table vets (vet varchar(12), species varchar(12))

insert into vets values('merriwether', 'dog')insert into vets values('parell', 'dog')insert into vets values('parell', 'cat')insert into vets values('parell', 'horse')insert into vets values('nestor', 'salamander')

Sample Queries:

Which vets should each owner know?select p.owner,v.vet from pets p, vets vwhere p.species = v.speciesorder by p.owner

Are any pets not covered by a vet?select pet from petswhere species not in (select species from vets)

How many pets can each vet treat?select v.vet,count(p.pet) as pcnt from pets p, vets v where v.species = p.speciesgroup by v.vet

Upload SPOCS data

Query for Sun-like stars

SELECT s.spocs, s.Name, s.Teff,s.Log_g,s.M_o_H FROM spocs s WHERE s.Teff BETWEEN 5720 AND 5820

AND s.Log_g BETWEEN 4.34 AND 4.54 AND s.m_o_h BETWEEN -0.1 AND +0.1

ORDER BY s.Teff

Correlate with SDSS: Can it be done?

SELECT p.objID,p.field, p.ra, p.decFROM dr2..PhotoObj p,mydb..spocs sWHERE s._dej2000 between p.dec-.001 and

p.dec+.001and s._raj2000 between p.ra-.001 and p.dec+.001

Making this possible is what ADQL is really about!

~7 ½ minutes to run this little query.

Other SQL topics

• TOP/LIMIT/SET ROWCOUNT– Different ways to limit the rows output

• Functions and procedures• Indices, clusters, and ensuring efficient access• How do we maintain a dynamic database?

– Referential integrity, triggers, transactions.– Less critical to typical, relatively static astronomical

databases.

• Groups and group functions– AVG, MIN, MAX, COUNT, SUM– GROUP BY and HAVING clauses

What to do if DB needed in project

• Databases and accounts set up and available for use at NOAO.

• Lots of free databases available. If you want help setting one up just ask….