CS639:DataManagementfor
DataScienceLecture3:PrinciplesofDataManagement
TheodorosRekatsinas1
2
Announcements• Mix-upwithduedatesL Itshouldbefixednow.• Nochangestothemidterm
• UpdatesandhintsonPA1assignmentonPiazza
• Questions?
Today’sLecture
1. DataManagement
2. DataModels
3. RDBMsandtheRelationalDataModel
3
1.DataManagement
4
Section1
• Datarepresentsthetraces ofreal-worldprocesses.
• Dataisvaluablebuthardandcostlytomanage• Storage,representationcomplexity,collection
• Datamanagementseekstoanswertwoquestions:• Whatoperationsdowewanttoperformonthisdata?• Whatfunctionalitydoweneedtomanagethisdata?
5
Section1
DataManagement
• Describereal-worldentitiesintermsofstoreddata• Create&persistentlystorelargedatasets• Efficientlyquery&update• Musthandlecomplexquestionsaboutthedata• Musthandlesophisticatedupdates• Performancematters
• Changestructure(e.g.,addattributes)• Concurrencycontrol:enablesimultaneousqueries,updatesetc• Crashrecovery• Accesscontrol,security,integrity
6
Section1
RequiredFunctionality
Itisdifficultandcostlytoimplementallthesefeatures!
• Relationaldatabasemanagementsystems• HDFS-basedsystems(e.g.,hadoop)• Streammanagementsystems:ApacheKafka• Others?
7
Section1
Systemsprovidingdatamanagementfeatures
8
Section1
2.DataModels
9
Section2
Whatyouwilllearnaboutinthissection
1. TypesofData
2. DataModels
10
Section2
• Structureddata
• Semi-structureddata
• Unstructureddata
11
Section2
Dataishighlyheterogeneous
Increasingamountsofdata
• Informationwithahighdegreeoforganization
• Alldataconformstoaschema.Ex:businessdata
• Easytoquery,searchover,aggregate
• Example:tablesinadatabase,tablesinexcel,etc.
12
Section2
Structureddata
• Somestructureinthedatabutimplicitandirregular
• Itcontains tagsorothermarkerstoseparatesemanticelementsandenforcehierarchiesofrecordsandfieldswithinthedata
• Example:JSON,HTML,XML
13
Section2
Semi-structureddata
• Informationthateitherdoesnothaveapre-defined structure orisnotorganizedinapre-definedmanner.
• Text,video,images,etc.
• Abundantandextremelyvaluable.Hardtoquery,aggregate,analyze,search.
14
Section2
Unstructureddata
• Adatamodelisacollectionofconceptsfordescribingdata
• Aschema isadescriptionofaparticularcollectionofdata,usingthegivendatamodel
• Adatamodel enablesuserstodefinethedatausinghigh-levelconstructswithoutworryingaboutmanylow-leveldetailsofhowdatawillbestoredondisk.
15
Section2
DataModel
16
Section2
Levelsofabstraction
17
Section2
Datamodels
• Relational• Key/Value• Graph• Document• Column-family• Array/Matrix• Hierarchical• Network
Mostdatabasemanagementsystems
• Relational• Key/Value• Graph• Document• Column-family• Array/Matrix• Hierarchical• Network
18
Section2
Datamodels
NoSQL
• Relational• Key/Value• Graph• Document• Column-family• Array/Matrix• Hierarchical• Network
19
Section2
Datamodels
Machinelearning,Scientificapplications
• Relational• Key/Value• Graph• Document• Column-family• Array/Matrix• Hierarchical• Network
20
Section2
Datamodels
Obsolete/Rare
3.RDBMsandtheRelationalDataModel
21
Section3
Whatyouwilllearnaboutinthissection
1. DefinitionofDBMS
2. Datamodels&therelationaldatamodel
3. Schemas&dataindependence
22
Section3
WhatisaDBMS?
• Alarge,integratedcollectionofdata
• Modelsareal-worldenterprise• Entities(e.g.,Students,Courses)• Relationships(e.g., AliceisenrolledinCS564)
ADatabaseManagementSystem(DBMS) isapieceofsoftwaredesignedtostoreandmanagedatabases
23
Section3
24
AMotivating,RunningExample
• Considerbuildingacoursemanagementsystem(CMS):
• Students• Courses• Professors
• Whotakeswhat• Whoteacheswhat
Entities
Relationships
Section3
Datamodels• Adatamodelisacollectionofconceptsfordescribingdata
• Therelationalmodelofdata isthemostwidelyusedmodeltoday• MainConcept:therelation- essentially,atable
• Aschema isadescriptionofaparticularcollectionofdata,usingthegivendatamodel
• E.g.everyrelation inarelationaldatamodelhasaschema describingtypes,etc.
25
Section3
ModelingtheCourseManagementSystem• LogicalSchema• Students(sid:string,name:string,gpa:float)• Courses(cid:string,cname:string,credits:int)• Enrolled(sid:string,cid:string,grade:string)
sid Name Gpa101 Bob 3.2123 Mary 3.8
Students
cid cname credits564 564-2 4308 417 2
Coursessid cid Grade123 564 A
Enrolled
Relations
26
Section3
ModelingtheCourseManagementSystem• LogicalSchema• Students(sid:string,name:string,gpa:float)• Courses(cid:string,cname:string,credits:int)• Enrolled(sid:string,cid:string,grade:string)
sid Name Gpa101 Bob 3.2123 Mary 3.8
Students
cid cname credits564 564-2 4308 417 2
Coursessid cid Grade123 564 A
Enrolled27
Correspondingkeys
Section3
OtherSchemata…
• PhysicalSchema:describesdatalayout• Relationsasunorderedfiles• Somedatainsortedorder(index)
• LogicalSchema:Previousslide
• ExternalSchema:(Views)• Course_info(cid:string,enrollment:integer)• Derivedfromothertables
Applications
Administrators
28
Section3
DataindependenceConcept: Applicationsdonotneedtoworryabouthowthedataisstructuredandstored
Logicaldataindependence:protectionfromchangesinthelogicalstructureofthedata
Physicaldataindependence:protectionfromphysicallayoutchanges
OneofthemostimportantreasonstouseaDBMS 29
I.e.shouldnotneedtoask:canweaddanewentityorattributewithoutrewritingtheapplication?
I.e.shouldnotneedtoask:whichdisksarethedatastoredon?Isthedataindexed?
Section3
• Structure:Thedefinitionofrelationsandtheircontents.
• Integrity:Ensurethedatabase’scontentssatisfyconstraints.
• Manipulation:Howtoaccessandmodifyadatabase’scontents.
RelationalModel
30
Section3
31
TablesintheRelationalModel
PName Price Manufacturer
Gizmo $19.99 GizmoWorks
Powergizmo $29.99 GizmoWorks
SingleTouch $149.99 Canon
MultiTouch $203.99 Hitachi
ProductArelation ortable isamultiset oftupleshavingtheattributesspecifiedbytheschema
Let’sbreakthisdefinitiondown
Section3
32
TablesintheRelationalModel
PName Price Manufacturer
Gizmo $19.99 GizmoWorks
Powergizmo $29.99 GizmoWorks
SingleTouch $149.99 Canon
MultiTouch $203.99 Hitachi
Product
Amultiset isanunorderedlist(or:asetwithmultipleduplicateinstancesallowed)
List:[1,1,2,3]Set:{1,2,3}Multiset:{1,1,2,3}
i.e.nonext(),etc.methods!
Section3
33
TablesintheRelationalModel
PName Price Manufacturer
Gizmo $19.99 GizmoWorks
Powergizmo $29.99 GizmoWorks
SingleTouch $149.99 Canon
MultiTouch $203.99 Hitachi
Product Anattribute (orcolumn)isatypeddataentrypresentineachtupleintherelation
Attributesmusthaveanatomictype,i.e.notalist,set,etc.
Section3
34
TablesintheRelationalModel
PName Price Manufacturer
Gizmo $19.99 GizmoWorks
Powergizmo $29.99 GizmoWorks
SingleTouch $149.99 Canon
MultiTouch $203.99 Hitachi
Product
Atuple orrow isasingleentryinthetablehavingtheattributesspecifiedbytheschemaAlsoreferredtosometimesasarecord
Section3
35
TablesintheRelationalModel
PName Price Manufacturer
Gizmo $19.99 GizmoWorks
Powergizmo $29.99 GizmoWorks
SingleTouch $149.99 Canon
MultiTouch $203.99 Hitachi
Product
Thenumberoftuplesisthecardinality oftherelation
Thenumberofattributesisthearity oftherelation
Section3
n-ary Relation=
Table with n columns
36
DataTypesinRelationalModel
• Atomictypes:• Characters:CHAR(20),VARCHAR(50)• Numbers:INT,BIGINT,SMALLINT,FLOAT• Others:MONEY,DATETIME,…
• Everyattributemusthaveanatomictype• Hencetablesareflat
Section3
37
TableSchemas
• Theschema ofatableisthetablename,itsattributes,andtheirtypes:
• Akey isanattributewhosevaluesareunique;weunderlineakey
Product(Pname: string, Price: float, Category: string, Manufacturer: string)
Product(Pname: string, Price: float, Category: string, Manufacturer: string)
Section3
Keyconstraints
• Akeyisanimplicitconstraintonwhichtuplescanbeintherelation
• i.e.iftwotuplesagreeonthevaluesofthekey,thentheymustbethesametuple!
1.Whichwouldyouselectasakey?2.Isakeyalwaysguaranteedtoexist?3.Canwehavemorethanonekey?
Akey isaminimalsubsetofattributes thatactsasauniqueidentifierfortuplesinarelation
Students(sid:string, name:string, gpa: float)
Section3
NULLandNOTNULL
• Tosay“don’tknowthevalue”weuseNULL• NULLhas(sometimespainful)semantics,moredetailslater
sid name gpa123 Bob 3.9143 Jim NULL Say,Jimjustenrolledinhisfirstclass.
WemayconstrainacolumntobeNOTNULL,e.g.,“name”inthistable
Students(sid:string, name:string, gpa: float)
Section3
ForeignKeyconstraints
• Aforeignkey specifiesthatanattributefromonerelationhastomaptoatupleinanotherrelation.
Section3
ForeignKeyconstraints
student_id aloneisnotakey- whatis?
sid name gpa101 Bob 3.2123 Mary 3.8
student_id cid grade
123 564 A123 537 A+
Students Enrolled
Wesaythatstudent_id isaforeignkey thatreferstoStudents
Students(sid: string, name: string, gpa: float)
Enrolled(student_id: string, cid: string, grade: string)
• Supposewehavethefollowingschema:
• Andwewanttoimposethefollowingconstraint:• ‘Onlyrealstudentsmayenrollincourses’ i.e.astudentmustappearintheStudentstabletoenrollinaclass
Section3
SummaryofSchemaInformation
• SchemaandConstraintsarehowdatabasesunderstandthesemantics(meaning)ofdata
• Theyarealsousefulforoptimization
Section3
DATAMANIPULATIONLANGUAGES(DML)
• Howtostoreandretrieveinformationfromadatabase.
• Procedural:Thequeryspecifiesthe(high-level)strategytheDBMSshouldusetofindthedesiredresult.
• WewillseeSQLandRelationalAlgebra
Section3