YouTunes: A Music Library Data Set · • Early data management systems represented data using...

transcript

YouTunes: A Music Library Data Set

Online (read-only) spreadsheet

The Worksheets

• Employees: One line per employee in your company • Songs: One line per song in your store • Customers: One line per customer using your service • Invoice: One line per invoice

A Quick Introduction to Analysis with Spreadsheets

What we’ll cover

• Getting started: viewing data • Naming cells • Writing formulae • Built-in functions • Getting help • Formatting your spreadsheet

Getting Started: Viewing Data

• We’ll be using the same data here as you’ll be using in Assignment 1.

• The data is available here.

Absolute and Relative Addresses

• If there are no $ signs, then addresses are relative and the cells referenced will change when you copy and paste a cell.

• The $ sign indicates that you want absolute addressing – that is retain the exact row and/or column when you cut and paste.

• This will become crucial when we start applying formulae to entire rows/columns.

Walkthrough: Sample Spreadsheet Analyses

YouTune Analysis

1. You’d like to get a better sense of your orders. What is your average invoice size?

2. You are considering creating album pricing. Doing this requires that you get a sense of how many tracks are on each album. On the album sheet, create a list of album titles with the number of tracks on each album.

3. It’s time to establish an employee recognition program. What is the name of the employee that has been with you the longest?

4. How long has the employee in #3 worked for you?

From Spreadsheets to DatabasesAn Introduction to Relational Databases and SQL

Relational Databases

• A Bit of History • The Relational Model • YouTunes Data is Relational Data

• Where beginning equals 1960’s • Computers

• Centralized systems • Spiffy new data channels let CPU and IO overlap. • Persistent storage is on drums. • Buffering and interrupt handling done in the OS. • Making these systems fast is becoming a research focus.

• Data • What did data look like?

In the Beginning

• Indexed sequential access method • Pioneered by IBM for its mainframes • Fixed length records • Each record lives at a specific location • Rapid record access even on a sequential medium (tape)

• All indexes are ‘secondary’ • Allow key lookup, but … • Do not correspond to physical organization • Key is to build small, efficient index structures

• Fundamental access method in COBOL

Organizing Data: ISAM

• Early data management systems represented data using something called a network model.

• Data are represented by collections of records (today we would call those key/data pairs).

• Relationships among records are expressed via links between records (today we can think of those links as pointers).

• Applications interacted with data by navigating through it: • Find a record • Follow links • Find other records • Repeat

Organizing Data: The Network Model

• Records composed of attributes. • Attributes are single-valued. • Links connect exactly two records. • Represent N-way relationships via link records

The Network Model: Inside Records

ID nameTrack

ID nameArtist

ID nameAlbum

• The Network Model had some problems • Applications had to know the structure of the data • Changing the representation required a massive rewrite • Fundamentally: the physical arrangement was tightly coupled to the

application and the application logic. • 1968: Ted Codd proposes the relational model

• Decouple physical representation from logical representation • Store “records” as “tables” • Replace links with implicit joins among tables

• The big question: could it perform?

The Relational Model: The Competition

The Relational Model

• Basic concepts: • A database consists of a collection of tables.

• Example:SKU Description Price

12345678 Perky-Pet Mason Jar Wild Bird Feeder $17.1590123456 No-No Greed Seed Ball Wild Bird Feeder $7.8078901234 Perky-Pet Squirrel-Be-Gone Wild Bird Feeder $19.9856789012 Wilderness Lantern Wild Bird Feeder $18.99

Column = field = attribute

columns have a type = domain

Row = tuple = record

Spreadsheet Data is (mostly) Relational

Relational Databases: Schemas and SQL

Topics for Today

• Structured (English) Query Language (SQL) • What is SQL • Data Definition Language • Data Manipulation Language

• Learning objectives • Create and delete tables • Use SELECT to retrieve data

SQL Commands• DDL: data definition language

• CREATE TABLE, DROP TABLE • CREATE INDEX, DROP INDEX • CREATE VIEW, DROP VIEW

• DML: data manipulation language • SELECT: retrieve tuples • UPDATE: modify tuples • DELETE: remove tuples • INSERT: add tuples

• Note: Note that relations were sets in relational algebra, thus relations have no duplicates SQL treats relations (tables) as bags (multisets), thereby allowing duplicates!

Create Table• What it does: creates a relation with the specified schema • Basic syntax:

CREATE TABLE relation_name(attr_name1 attr_type1,… attr_nameN attr_typeN); • The system we are using (SQLITE) has a somewhat unusual type system. It has only five

types (shown in bold below). The other types are the more traditional SQL types, which all “work” in SQLITE, but all the entries in a line are treated the same. • INTEGER: INT, INTEGER, TINYINT, SMALLINT, MEDIUMINT, BIGINT, UNSIGNED BIG INT, INT2,

INT8 • TEXT: CHARACTER(20), VARCHAR(255), CARYING CHARACTER(255), NCHAR(55), NATIVE

CHARACTER(70), NVARCHAR(100), TEXT, CLOB • BLOB: BLOB (no datatype) • REAL: REAL, DOUBLE, DOUBLE PRECISION, FLOAT • NUMERIC: DECIMAL(10,5), BOOLEAN, DATE, DATETIME

ExamplesCREATE TABLE songs (Name text, Composer text, Album text, Artist text);

CREATE TABLE employee(Manager text, LastName text, FirstName text, Title text, BirthDate DATETIME, HireDate DATETIME, Address text, City text, State text, Country text, PostalCode text, Phone text, Fax text, Email text);

Select• What it does: performs a query (read-only)

• Implements most of the relational-algebra operations (and then some) • Basic syntax:

SELECT attributes FROM table [WHERE selection predicate];

• Extended syntax: SELECT a1, a2, … FROM R1, R2, … WHERE selection-predicate GROUP BY attribute(s) ORDER BY attribute(s) LIMIT ;

Handy Options in Select• Use shorthands for tables

SELECT fields from Songs as S, Invoices as I … • Use built-in functions

SELECT MAX(InvoiceDate) from invoice; SELECT DateTime(‘now’); See documentation for a list

• Store results into tables CREATE TABLE new AS SELECT …

• Assign names to results • CREATE TABLE roster AS • SELECT FirstName || LastName as FullName from Employee;

Schema Normalization and Joins

• Notice that we have a lot of redundancy in our data. • Examples:

• Employees: a Manager who manages a lot of people has his/her name replicated for each employee

• Songs: Each album name appears once per song on that album • Invoices: Every time a customer purchases a song, we include

its price. • So what?

Why is Duplication Bad?

• Updating the data becomes difficult and/or expensive • If we discover a typo in an album name, we have to update

every song on that album. • What if our manager gets married? We’d have to update very

employee who worked for that manager. • Having to consistently update multiple data items is

typically a costly operation.

What’s a Schema Designer to do?

• The relational model and its query language, SQL, are intended to be used in a manner that avoids this duplication.

• In general, whenever we might be inclined to duplicate data, we create a separate table and associate data between the two tables.

• SQL lets you combine data from multiple tables using something called a join.

• A join lets you “connect” two tables based on their values. • Once you can do that, it’s relatively easy to get rid of some

of the duplication we have in our data. • Let’s tackle the big table first: invoices.

More Normalization

• The price information is still duplicated in every invoice item in which it appears.

• The song title (which is kind of long in some cases) also appears multiple times.

• Couldn’t we place those in the song relation? • Yes! And you’ll do that in class.

SQL: Modifying DataINSERT, DELETE, and UPDATE

INSERT: Adding Tuples (Rows)

• What it does: Adds a row to a table • Basic syntax:

INSERT INTO relation VALUES (v1, v2, v3 …) INSERT INTO relation(a1, a2) VALUES (v1, v2)

• Examples: INSERT INTO Songs VALUES (“Gagnam Style”, “Psy and Yoo Gun-hyung”, “Psy 6 (Six Rules), Part 1”, “PSY”); INSERT INTO Songs (Name, Artist) VALUES (“Gangnam Style”, “PSY”);

• If an attribute’s value is not given, use the default value. • If there is no default, set the value to NULL.

INSERT: Nested query version

• What it does: SELECTs data and inserts the result into a relation.

• Basic syntax: INSERT INTO relation SELECT …;

• Examples: INSERT INTO invoiceItem SELECT InvoiceId, Item1, Price1 FROM invoices;

• This is the query we used to create the initial invoiceItem table; for each of the remaining items, we did: INSERT INTO invoiceItem SELECT InvoiceId, Item2, Price2 FROM invoices where Item2 != “”;

DELETE: Removing Tuples

• What it does: Removes rows from a table • Basic syntax:

DELETE FROM relation WHERE predicate; • Example

DELETE FROM invoices WHERE InvoiceId =2;

UPDATE: Modifying Data

• What it does: Changes values in a table • Basic syntax:

UPDATE relation attr1=val1, attr2=val2, … WHERE predicate; • Examples:

UPDATE employees SET address = “2468 39th Avenue” WHERE FirstName = “Steve”;

UPDATE customer SET Company=“2U” WHERE CustomerId =“17”;

YouTunes: A Music Library Data Set · • Early data management systems represented data using...

Documents