1
L01:CourseOverview
CS3200sp18 s2:Databasedesign1/8/2018
2
Theworldisincreasinglydrivenbydata…
Thisclassteachesthebasicsofhowtouse&managedata.
3
Increasinglymanycompaniesseethemselvesasdatadriven.
4
KeyQuestionsWeWillAnswer
• Howcanwecollectandstorelargeamountsofdata?- Bybuildingtoolsanddatastructurestoefficientlyindexandservedata
• Howcanweefficientlyquerydata?- Bycompilinghigh-leveldeclarativequeriesintoefficientlow-levelplans
• Howcanwesafelyupdatedata?- Bymanagingconcurrentaccesstostateasitisreadandwritten
• Howdodifferentdatabasesystemsmanagedesigntrade-offs?- e.g.,atscale,inadistributedenvironment?
5
Whenyou’llusethismaterial
• Buildingalmostanysoftwareapplication- e.g.,mobile,cloud,consumer,enterprise,analytics,machinelearning- Corollary:everyapplicationyouuseusesadatabase- Bonus:everyprogramconsumesdata(evenifonlytheprogramtext!)
• Performingdataanalytics- Businessintelligence,datascience,predictivemodeling- (Evenifyou’reusingPandashttps://pandas.pydata.org/,you’reusingrelational
algebra!)
• Buildingdata-intensivetoolsandapplications- Manycoreconceptspowerdeeplearningframeworkstoself-drivingcars
6
Today’sLecture
1. Introduction,admin&setup
2. Overviewoftherelationaldatamodel
3. OverviewofDBMStopics:Keyconcepts&challenges
7
Whatyouwilllearnaboutinthissection
1. MotivationforstudyingDBs
2. Administrativestructure
3. Courselogistics
4. Overviewoflecturecoverage
5. SomethoughtsonPedagogy
8
BigDataLandscape…InfrastructureisChanging
http://www.bigdatalandscape.com/
New tech.Same Principles.
9
Some"birth-years".WhenwasSQLborn?
• 2004:Facebook
• 1998:Google• 1995:Java,Ruby• 1993:WorldWideWeb• 1991:Python
• 1985:Windows
10
Some"birth-years"
• 2004:Facebook
• 1998:Google• 1995:Java,Ruby• 1993:WorldWideWeb• 1991:Python
• 1985:Windows
• 1974:SQL
11
Whyshouldyoustudydatabases?
• Mercenary- makemore$$$:- StartupsneedDBtalentrightaway=lowemployee#- Massiveindustry…
• Intellectual:- Science:datapoortodatarich
• Noideahowtohandlethedata!- Fundamentalideasto/fromallofCS:
• Systems,theory,AI,logic,stats,analysis….
ManygreatcomputersystemsideasstartedinDB.
12
Whatthiscourseis(andisnot)
• Discussfundamentalsofdatamanagement- Howtodesigndatabases,querydatabases,buildapplicationswiththem.- Howtodebugthemwhentheygowrong!- NothowtobeaDBAorhowtotuneOracle12g.
• We’llcoverhowdatabasemanagementsystemswork
• Andsome(butnotallof)theprinciplesofhowtobuildthem
13
Whoweare…
• Instructor(me)WolfgangGatterbauer- FacultyintheDATAlab(https://db.ccis.northeastern.edu/)- FirstyearatNortheastern!- TaughtbeforeatUniversityofWashingtonandCMU'sbusinessschool- Research:theoreticfoundationsforscalabledatamanagement- Officehours:W2:00-4:00,WVH 450
14
TeachingAssistants
15
https://course.ccs.neu.edu/cs3200sp18s2/
Not:https://course.ccs.neu.edu/cs3200sp18s3/
16
Communicationw/CourseStaff
• Piazza
• Officehours
• Byappointment!TAs OHs to be listed on the course website!
Meetinglocation:TBD:(either4th flooror1st floorWVH)
17
Piazza
The goal is to get you to answer each other’s questions so you can benefit and learn from each other.
18
Pleaseusethissimplewaytoletmeknowwhatworksornot!
https://goo.gl/sLJJeH
Piazzaisvisibletoeveryoneinthisclass.Thisformonlytome
19
Important!
• StudentswithdocumenteddisabilitiesshouldsendintheiraccommodationletterfromtheDisabilityResourceCenterat20DodgeHallbytheendofthisweektome.
20
Lectures
• Lectureslidescoveressentialmaterial- Thisisyourbestreference.- Wearetryingtogetawayfrombook,butdohavepointers
• Trytocoversamethinginmanyways:Lecture,lecturenotes,homework,exams(noshock)- Attendancemakesyourlifeeasier…
21
Attendance
• Idislikemandatoryattendance…butinthepastwenoticed…- PeoplewhodidnotattenddidworseL- PeoplewhodidnotattendusedmorecourseresourcesL- PeoplewhodidnotattendwerelesshappywiththecourseL
• Inpreviousschool:mandatoryattendance• Thisyear:voluntary(tostart!)-- reserverighttochange
22
GradedElements
• Gradiance quizzes+participation(10%)
• Homeworks (25%)
• Groupproject(25%)
• Threeexams(40%=10%+10%+20%)
Homeworks aretypicallydueWednesdayendofday,andarepostedatleast1weekbeforeduedate
23
Un-GradedElements
• Readingsprovidedtohelpyou!- Onlyitemsinlecture,homework,orprojectarefairgame.
• In-classactivitiesaremainlytohelp/befun!- Willoccurduringclass- notgraded,butcountaspartoflecturematerial(fairgameas
well)
24
Whatisexpectedfromyou
• Attendlectures- Ifyoudon’t,it’satyourownperil
• Beactiveandthinkcritically- Askquestions,postcommentsonforums
• Doprogrammingandhomeworkprojects- Startearlyandbehonest
• Studyforexams
25
InterestedinResearch?
26
• R.Li,M.Riedewald,Xinyan DengSubmodularity ofDistributedJoinComputation
PosterpresentationatNortheastDatabaseday2018
PaperatSIGMOD 2018
• R.Li,AdityaGhosh,M.Riedewald,W.GatterbauerOptimizingDataPartitioningforDistributedBandJoins
• P.Ojha,PaulLangton,W.GatterbauerScalableCompatibilityEstimationinLargeNetworkData
Inprogress http://queryviz.com
27
Lectures:1sthalf- fromauser’sperspective
1. SQL:Relationaldatamodels&Queries- ~5lectures- HowtomanipulatedatawithSQL,adeclarativelanguage
• reducedexpressivepowerbutthesystemcandomoreforyou
2. DatabaseDesign:Designtheoryandconstraints- ~6lectures- Designingrelationalschematokeepyourdatafromgettingcorrupted
3. Transactions:Syntax&supportingsystems- ~3lectures- Aprogrammer’sabstractionfordataconsistency
28
Lectures:2ndhalf- understandinghowitworks
4. Databaseinternals:QueryProcessing- ~7lectures- Indexing- ExternalMemoryAlgorithms(IOmodel)forsorting,joins,etc.- Basicsofqueryoptimization(CostEstimates)- Relationalalgebra
5. NoSQL- ~0-2lectures- Key-ValueStores- (MoreinCS6240:Large-ScaleParallelDataProcessing)
29
https://course.ccs.neu.edu/cs3200sp18s2/sched.html
30
Studyingmaterial:"Underwhichstudyconditiondoyouthinkyoulearnbetter?"
Source:Karpicke&Blunt,"RetrievalPracticeProducesMoreLearningthanElaborativeStudyingwithConceptMapping,"Science,2011.
Judgedperformance(=whatpeoplethink)
Actualperformance(=whatisactuallyworking)
passivereading activeQ&A
31Source:http://5.mshcdn.com/wp-content/gallery/the-year-2000-as-imagined-in-1900/future.jpg
Theyear2000imaginedin1900
32
SequencingMaterial:"Underwhichteachingconditiondoyouthinkyoulearnbetter?"
Source:Bjork&Bjork,"Makingthingshardonyourself,butinagoodway:Creatingdesirabledifficultiestoenhancelearning," Psychologyandtherealworld(...),2011.
fromthetextbookfor70-451MIS
33
SpacedRepetition
1 day 3 days 1 week 1 month 6 months
correct
incorrect
EbbinghausForgettingCurve
LeitnerSystem(Pimsleur'sgraduatedintervalrecall)
Sources:http://www.wired.com/2008/04/ff-wozniak/,Gatterbauer&Suciu,"ManagingStructuredCollectionsofCommunityData,"CIDR2011.
34
The"SurferAnalogy"fortimemanagement
Source:http://stwww.surfermag.com/files/2013/10/Yak_Charlie-970x646.jpg
35
Today’sLecture
1. Introduction,admin&setup
2. Overviewoftherelationaldatamodel
3. OverviewofDBMStopics:Keyconcepts&challenges
36
Whatyouwilllearnaboutinthissection
1. DefinitionofDBMS
2. Datamodels&therelationaldatamodel
3. Schemas&dataindependence
37
WhatisaDBMS?
• Alarge,integratedcollectionofdata
• Modelsareal-worldenterprise- Entities(e.g.,Students,Courses)- Relationships(e.g.,Aliceisenrolledin145)
ADatabaseManagementSystem(DBMS) isapieceofsoftwaredesignedtostoreandmanagedatabases
38
AMotivating,RunningExample
• Considerbuildingacoursemanagementsystem(CMS):
- Students- Courses- Professors
- Whotakeswhat- Whoteacheswhat Relationships
Entities
39
Datamodels
• Adatamodel isacollectionofconceptsfordescribingdata
- Therelationalmodelofdata isthemostwidelyusedmodeltoday• MainConcept:therelation- essentially,atable
• Aschema isadescriptionofaparticularcollectionofdata,usingthegivendatamodel
- E.g.everyrelationinarelationaldatamodelhasaschemadescribingtypes,etc.
40
BruceLindsay,IBMResearch
Asquotedin:https://dl.acm.org/citation.cfm?id=1083803
“Relationaldatabasesarethefoundationofwesterncivilization”
41
ModelingtheCMS
• LogicalSchema- Students(sid:string,name:string,gpa:float)- Courses(cid:string,cname:string,credits:int)- Enrolled(sid:string,cid:string,grade:string)
sid Name Gpa101 Bob 3.2123 Mary 3.8
Students
cid cname credits564 564-2 4308 417 2
Coursessid cid Grade123 564 A
Enrolled
Relations
42
ModelingtheCMS
• LogicalSchema- Students(sid:string,name:string,gpa:float)- Courses(cid:string,cname:string,credits:int)- Enrolled(sid:string,cid:string,grade:string)
sid Name Gpa101 Bob 3.2123 Mary 3.8
Students
cid cname credits564 564-2 4308 417 2
Coursessid cid Grade123 564 A
Enrolled
Corresponding keys
43
OtherSchemata…
• ExternalSchema:(Views)- Course_info(cid:string,enrollment:integer)- Derivedfromothertables
• LogicalSchema:Previousslide
• PhysicalSchema:describesdatalayout- Relationsasunorderedfiles- Somedatainsortedorder(index)
Administrators
Applications
44
Dataindependence
• Concept:Applicationsdonotneedtoworryabouthowthedataisstructuredandstored
44
Logicaldataindependence:protectionfromchangesinthelogicalstructureofthedata
Physicaldataindependence:protectionfromphysicallayoutchanges
OneofthemostimportantreasonstouseaDBMS
I.e. should not need to ask: can we add a new entity or attribute without rewriting the application?
I.e. should not need to ask: which disks are the data stored on? Is the data indexed?
45
Today’sLecture
1. Introduction,admin&setup
2. Overviewoftherelationaldatamodel
3. OverviewofDBMStopics:Keyconcepts&challenges
46
Whatyouwilllearnaboutinthissection
1. Transactions
2. Concurrency&locking
3. Atomicity&logging
4. Summary
47
ChallengeswithManyUsers
• SupposethatourCMSapplicationserves1000’sofusersormore- whataresomechallenges?
DBMSallowsusertowriteprogramsasiftheyweretheonly user
Disk/SSDaccessisslow,DBMShidethelatencybydoingmoreCPUworkconcurrently
• Security:Differentusers,differentroles
• Performance:Needtoprovideconcurrentaccess
• Consistency:Concurrencycanleadtoupdateproblems
Wewon’tlookattoomuchinthiscourse,butisextremely important
48
Transactions
• Akeyconceptisthetransaction(TXN):anatomicsequenceofdb actions(reads/writes) Atomicity:Anaction
eithercompletesentirely ornotatall
Acct Balancea10 20,000a20 15,000
Acct Balancea10 17,000a20 18,000
Transfer$3kfroma10toa20:1. Debit$3kfroma102. Credit$3ktoa20
• Crashbefore1,• After1butbefore2,• After2.
Writtennaively,inwhichstatesis
atomicity preserved?
DB Always preserves atomicity!
49
Transactions
• Akeyconceptisthetransaction(TXN):anatomicsequenceofdb actions(reads/writes)- IfausercancelsaTXN,itshouldbeasifnothing
happened!
• TransactionsleavetheDBinaconsistent state- Usersmaywriteintegrityconstraints,e.g.,‘eachcourseis
assignedtoexactlyoneroom’
Atomicity:Anactioneithercompletesentirely ornotatall
Consistency:Anactionresultsinastatewhichconformstoallintegrityconstraints
However, note that the DBMS does not understand the real meaning of the constraints– consistency burden is still on the user!
50
Challenge:SchedulingConcurrentTransactions
• TheDBMSensuresthattheexecutionof{T1,…,Tn}isequivalenttosomeserial execution
• Onewaytoaccomplishthis:Locking- Beforereadingorwriting,transactionrequiresalockfrom
DBMS,holdsuntiltheend
• KeyIdea:IfTi wantstowritetoanitemxandTjwantstoreadx,thenTi,Tj conflict.Solutionvialocking:- onlyonewinnergetsthelock- loserisblocked(waits)untilwinnerfinishes
AsetofTXNsisisolated iftheireffectisasifallwereexecutedserially
WhatifTiandTj needXandY,andTi asksforXbeforeTj,andTj asksforYbeforeTi?->Deadlock!Oneisaborted…
AllconcurrencyissueshandledbytheDBMS…
51
EnsuringAtomicity&Durability
• DBMSensuresatomicity evenifaTXN crashes!
• Onewaytoaccomplishthis:Write-aheadlogging(WAL)
• KeyIdea:Keepalogofallthewritesdone.- Afteracrash,thepartiallyexecutedTXNs areundoneusing
thelog
51
Write-aheadLogging(WAL): Beforeanyactionisfinalized,acorrespondinglogentryisforcedtodisk
Weassumethatthelogison“stable”storage
AllatomicityissuesalsohandledbytheDBMS…
52
AWell-DesignedDBMSmakesmanypeoplehappy!
• EndusersandDBMSvendors- Reducescostandmakesmoney
• DBapplicationprogrammers- Canhandlemoreusers,faster,forcheaper,andwith
betterreliability/securityguarantees!
• Databaseadministrators(DBA)- Easiertimeofdesigninglogical/physicalschema,handling
security/authorization,tuning,crashrecovery,andmore…MuststillunderstandDBinternals
53
SummaryofDBMS
• DBMSareusedtomaintain,query,andmanagelargedatasets.- Provideconcurrency,recoveryfromcrashes,quickapplicationdevelopment,integrity,
andsecurity
• Keyabstractionsgivedataindependence
• DBMSR&Disoneofthebroadest,mostexcitingfieldsinCS.Fact!