Overview of Databases
Gerome MiklauCMPSCI 645 – Database Design & Implementation
UMass AmherstFeb 1, 2006
Some slide content courtesy of Zack Ives, Ramakrishnan & Gehrke, Dan Suciu, Ullman & Widom
Today
• Student information form• Overview of databases• Course topics• Course requirements
Databases & DBMS’s• A database is a large, integrated collection of
data.
• A database management system (DBMS) is a software package designed to store and manage databases, allowing:– Define the kind of data stored– Querying/updating interface– Reliable storage & recovery of 100s of GB– Control access to data from many concurrent users
Can filesystems do it?
• Schema for files is limited• No query language for data in files• Files can store large amounts of data, but
– no recovery from failure– no efficient access to items within file
• Concurrent access not safe
No
Evolution
• Early DBMS’s (1960’s), evolved from file systems.
• Data with many small items & many queries or modifications:– Airline reservations– Banking
Early DB systems
• Tree-based hierarchical data model• Graph-based network data model
• Encouraged users to think about data the way it was stored.
• No high level query language
Data model The data model includes basic assumptions about what’s
an “item” of data, how to represent it and interpret it.
The Relational Model•The relational data model (Codd, 1970):
– Data independence: details of physical storage are hidden from users
– High-level declarative query language• say what you want, not how to compute it. • mathematical foundation
– A theory of normalization guides the design of relations
Side-note: Turing Awards in Databases1973: Bachman, networked data model 1981: Codd, relational model1998: Jim Gray, transaction processing
DBMS Benefit #1: Generality and Declarativity
• The programmer or user does not need to know details like indices, sort orders, machine speeds, disk speeds, concurrent users, etc.
• Instead, the programmer/user programs with a logical model in mind
• The DBMS “makes it happen” based on an understanding of relative costs of different methods
Benefit #2: Efficiency and Scale
• Efficient storage of hundreds of GBs of data
• Efficient access to data
• Rapid processing of transactions
Benefit #3: Management of Concurrency and Reliability
• Simultaneous transactions handled safely.• Recovery of system data after system failure.
• More formally: the ACID properties– Atomicity - all or nothing– Consistency - sensible state not violated– Isolation - separated from effects– Durability - once completed, never lost
How Does One Build a Database?
• Start with a conceptual model• Design & implement schema• Write applications using DBMS and other
tools– Many ways of doing this (DBMS, API writers,
library authors, web server, etc.)– Common applications include PHP/JSP/servlet-
driven web sites• The DBMS takes care of query optimization
and execution
Conceptual Design
STUDENT COURSETakes
namesid cid name
PROFESSOR
Teaches
semester
fid name
Designing a Schema (Set of Relations)
• Convert to tables +constraints
• Then need to do “physical” design: the layout on disk, indices, etc.
sid name1 Jill2 Bo3 Maya
fid name1 Diao2 Saul8 Weems
sid cid1 6451 6833 635
cid name sem645 DB F05683 AI S05635 Arch F05
fid cid1 6452 6838 635
STUDENT Takes COURSE
PROFESSOR Teaches
Queries
• Find all courses that “Mary” takes
• What happens behind the scene ?– Query processor figures out how to answer
the query efficiently.
SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=“Mary” and S.sid = T.sid and T.cid = C.cid
Queries, behind the scene
Query execution plan:
SELECT C.nameFROM Students S, Takes T, Courses CWHERE S.name=“Mary” and S.sid = T.sid and T.cid = C.cid
Declarative SQL query
Students Takes
sid=sid
sname
name=“Mary”
cid=cid
Courses
The optimizer chooses the best execution plan for a query
An Issue: 80% of the World’s Data is Not in a DB!
Examples: – Scientific data
(large images, complex programs that analyze the data) – Personal data– WWW and email
(some of it is stored in something resembling a DBMS)Data management is expanding to tackle these
problems
DBMSs in the Real WorldA huge industry for 20% of the world’s data!• Big, mature relational databases
– IBM DB2, Oracle, Microsoft SQL Server– Adding advanced features, including “native XML” support
• “Middleware” above these systems– SAP, Siebel, PeopleSoft, dozens of special-purpose apps
• Integration and warehousing systems– BEA AquaLogic, DB2 Information Integrator
• Current trends:– Web services; XML everywhere– Smarter, self-tuning systems
Database Research
• One of the broadest, most exciting areas in CS!• A microcosm of CS in general
• languages, operating systems, concurrent programming, data structures, algorithms, theory, distributed systems, statistical techniques.
• Theory and systems well-integrated.
Recent Trends in Databases
• XML– Relational databases with XML support– Middleware between XML and relational databases– Large-scale XML message systems
• Main memory database systems• Peer data management• Stream data management• Model management, provenance• Security and privacy• Modeling uncertainty, probabilistic databases
What is the Field of Databases ?
• To an applied researcher (SIGMOD/VLDB/ICDE)– Query optimization– Query processing (yet-another join algorithm)– Transaction processing, recovery (but most stuff is already
done)– Novel applications: data mining, high-dimensional search
• To a theoretical researcher (PODS/ICDT/LICS)– Focus on the query languages– Query language = logic = complexity classes
Course topics
• Fundamentals: relational design, query languages.
• Database internals: storage, indexing, query processing, query optimization, transaction management.
• Theory: expressiveness of query languages, static analysis, complexity.
• XML and semi-structured data models.• Security: access control, privacy.• Advanced topics: streaming data,
information integration.
Prerequisites
• Official: undergrad course in DB or OS• Also:
– Elementary complexity theory
Grading
• Homework: 20%• Paper reviews: 10%• Project: 25%• Midterm: 20%• Final: 25%
Homework: 20%
• 4 assignments throughout the course– written problem sets– practical experience with SQL, XQuery
Paper Reviews: 10%
• Approximately 5 classic papers will be assigned
• Short written reviews are due before the day of class. Email to: – [email protected]
First paper review:Read Sec 1 of Codd’s paper Due Wed Feb 8th
Project: 25%• General theme: apply database principles to a new
problem• Suggested topics will be discussed next Monday• Groups of 2 preferred. 3 possible.• Project work will include:
– Reading some of the research literature– Implementation– Written report– In-class presentation
• Periodic consultation with one of the instructors
Exams
• Midterm (20%)– in-class, Monday, Apr 3
• Final (25%)– Thursday, May 25, 10:30am
Textbook
Database Management Systems Ramakrishnan and Gehrke
Other useful resources• Database systems: the complete book (Ullman,
Widom and Garcia-Molina)• Readings in Database Systems (Stonebraker and
Hellerstein)• Foundations of Databases (Abiteboul, Hull, Vianu)• Data on the Web (Abiteboul, Buneman, Suciu)• Parallel and Distributed DBMS (Ozsu and Valduriez)• Transaction Processing (Gray and Reuter)• Data and Knowledge based Systems (volumes I, II)
(Ullman)• Proceedings of SIGMOD, VLDB, PODS conferences.
Communication
• Instructors– Office hours: (see website)– Email: [email protected]
• Check the course webpage often• Course bulletin board, mailing list
31
Questions about the course?
32