CompSci 516DataIntensiveComputingSystems
Lecture7Storageand
Index
Instructor:Sudeepa Roy
1DukeCS,Fall2017 CompSci516:DatabaseSystems
Announcements• HW1deadlinethisweek:– Dueon09/21(Thurs),11:55pm,nolatedays
• Projectproposaldeadline:– Preliminaryideaandteammembersduebytonight09/18(Mon),11:55pm byemailtotheinstructor
– Proposaldueonsakai by09/25(Mon),11:55pm
• Everyoneshouldbeinagroupnow– otherwiselettheinstructorknowasap
DukeCS,Fall2017 CompSci516:DatabaseSystems 2
ReadingMaterial
• [RG]– Storage:Chapters8.1,8.2,8.4,9.4-9.7– Index:8.3,8.5– Tree-basedindex:Chapter10.1-10.7– Hash-basedindex:Chapter11
Additionalreading• [GUW]
– Chapters8.3,14.1-14.4
DukeCS,Fall2017 CompSci516:DatabaseSystems 3
Acknowledgement:Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.Ramakrishnan andDr.Gehrke.
Storage(contd.fromLecture6)
DukeCS,Fall2017 CompSci516:DatabaseSystems 4
Recap
• TypicalDBMShierarchy• Diskandmainmemory/bufferpool• Unit=pageorblock– pagereplacementstrategies– dirtybit– pin
DukeCS,Fall2017 CompSci516:DatabaseSystems 5
Today
• Howarepagesstoredinafile?• Howarerecordsstoredinapage?– Fixedlengthrecords– Variablelengthrecords
• Howarefieldsstoredinarecord?– Fixedlengthfields/records– Variablelengthfields/records
DukeCS,Fall2017 CompSci516:DatabaseSystems 6
FilesofRecords
• PageorblockisOKwhendoingI/O,buthigherlevelsofDBMSoperateonrecords,andfilesofrecords
• FILE:Acollectionofpages,eachcontainingacollectionofrecords
• Mustsupport:– insert/delete/modifyrecord– readaparticularrecord(specifiedusingrecordid)– scanallrecords(possiblywithsomeconditionsontherecordstoberetrieved)
DukeCS,Fall2017 CompSci516:DatabaseSystems 7
FileOrganization
• Fileorganization:Methodofarrangingafileofrecordsonexternalstorage– Onefilecanhavemultiplepages– Recordid(rid)issufficienttophysicallylocatethepagecontainingtherecordondisk
– Indexes aredatastructuresthatallowustofindtherecordidsofrecordswithgivenvaluesinindexsearchkeyfields
• NOTE:Severalusesof“keys”inadatabase– Primary/foreign/candidate/superkeys– Indexsearchkeys
DukeCS,Fall2017 CompSci516:DatabaseSystems 8
AlternativeFileOrganizationsManyalternativesexist,eachidealforsomesituations,and
notsogoodinothers:• Heap(randomorder)files: Suitablewhentypicalaccessisa
filescanretrievingallrecords• SortedFiles:Bestifrecordsmustberetrievedinsome
order,oronlya“range”ofrecordsisneeded.• Indexes:Datastructurestoorganizerecordsviatreesor
hashing– Likesortedfiles,theyspeedupsearchesforasubsetofrecords,
basedonvaluesincertain(“searchkey”)fields– Updatesaremuchfasterthaninsortedfiles
DukeCS,Fall2017 CompSci516:DatabaseSystems 9
Unordered(Heap)Files
• Simplestfilestructurecontainsrecordsinnoparticularorder
• Asfilegrowsandshrinks,diskpagesareallocatedandde-allocated
• Tosupportrecordleveloperations,wemust:– keeptrackofthepages inafile– keeptrackoffreespaceonpages– keeptrackoftherecords onapage
• Therearemanyalternativesforkeepingtrackofthis
DukeCS,Fall2017 CompSci516:DatabaseSystems 10
HeapFileImplementedasaList
• TheheaderpageidandHeapfilenamemustbestoredsomeplace
• Eachpagecontains2`pointers’plusdata• Problem?
– toinsertanewrecord,wemayneedtoscanseveralpagesonthefreelisttofindonewithsufficientspace
HeaderPage
DataPage
DataPage
DataPage
DataPage
DataPage
DataPage Pages with
Free Space
Full Pages
DukeCS,Fall2017 CompSci516:DatabaseSystems 11
HeapFileUsingaPageDirectory
• Theentryforapagecanincludethenumberoffreebytesonthepage.
• Thedirectoryisacollectionofpages– linkedlistimplementationofdirectoryisjustonealternative– Muchsmallerthanlinkedlistofallheapfilepages!
DataPage 1
DataPage 2
DataPage N
HeaderPage
DIRECTORY
DukeCS,Fall2017 CompSci516:DatabaseSystems 12
Howdowearrangeacollectionofrecordsonapage?
• Eachpagecontainsseveralslots– oneforeachrecord
• Recordisidentifiedby<page-id,slot-number>
• Fixed-LengthRecords• Variable-LengthRecords
• Forboth,thereareoptionsfor– Recordformats(howtoorganizethefieldswithinarecord)– Pageformats(howtoorganizetherecordswithinapage)
DukeCS,Fall2017 CompSci516:DatabaseSystems 13
PageFormats:FixedLengthRecords
• Recordid=<pageid,slot#>• Packed:movingrecordsforfreespacemanagementchangesrid;maynotbe
acceptable• Unpacked:useabitmap– scanthebitarraytofindanemptyslot• Eachpagealsomaycontainadditionalinfoliketheidofthenextpage(notshown)
Slot 1Slot 2
Slot N
. . . . . .
N M10. . .M ... 3 2 1
PACKED UNPACKED, BITMAP
Slot 1Slot 2
Slot N
FreeSpace
Slot M11
number of records
numberof slots
DukeCS,Fall2017 CompSci516:DatabaseSystems 14
PageFormats:VariableLengthRecords
• Needtofindapagewiththerightamountofspace– Toosmall– cannotinsert– Toolarge– wasteofspace
• ifarecordisdeleted,needtomovetherecordssothatallfreespaceiscontiguous– needabilitytomoverecordswithinapage
• Canmaintainadirectoryofslots(nextslide)– Slotcontains<record-offset,record-length>– deletion=setrecord-offsetto-1
• Record-idrid=<page,slot-in-directory>remainsunchanged
DukeCS,Fall2017 CompSci516:DatabaseSystems 15
PageFormats:VariableLengthRecords
• Canmoverecordsonpagewithoutchangingrid– so,attractiveforfixed-lengthrecordstoo
• Store(record-offset,record-length)ineachslot• rid-sunaffectedbyrearrangingrecordsinapage
Page iRid = (i,N)
Rid = (i,2)
Rid = (i,1)
Pointerto startof freespace
SLOT DIRECTORY
N . . . 2 120 16 24 N
# slots
DukeCS,Fall2017 CompSci516:DatabaseSystems 16
RecordFormats:FixedLength
• Eachfieldhasafixedlength– forallrecords– thenumberoffieldsisalsofixed– fieldscanbestoredconsecutively
• Informationaboutfieldtypessameforallrecordsinafile– storedinsystemcatalogs
• Findingi-th fielddoesnotrequirescanofrecord– giventheaddressoftherecord,addressofafieldcanbeobtained
easily
Base address (B)
L1 L2 L3 L4
F1 F2 F3 F4
Address = B+L1+L2
DukeCS,Fall2017 CompSci516:DatabaseSystems 17
RecordFormats:VariableLength• Cannotusefixed-lengthslotsforrecords• Twoalternativeformats(#fieldsisfixed):
• Second offers direct access to i-th field, efficient storage of nulls (special don’t know value); small directory overhead
• Modification may be costly (may grow the field and not fit in the page)
4 $ $ $ $
FieldCount
Fields Delimited by Special Symbols
F1 F2 F3 F4
F1 F2 F3 F4
Array of Field Offsets
1.usedelimiters
2.useoffsetsatthestartofeachrecord
DukeCS,Fall2017 CompSci516:DatabaseSystems 18
Indexes
DukeCS,Fall2017 CompSci516:DatabaseSystems 19
Indexes
• Anindexonafilespeedsupselectionsonthesearchkeyfieldsfortheindex– Anysubsetofthefieldsofarelationcanbethesearchkeyforan
indexontherelation.– “Searchkey”isnotthesameas“key”
key=minimalsetoffieldsthatuniquelyidentifyatuple
• Anindexcontainsacollectionofdataentries,andsupportsefficientretrievalofalldataentries k*withagivenkeyvaluek
DukeCS,Fall2017 CompSci516:DatabaseSystems 20
RememberTerminology
• Indexsearchkey(key):k– Usedtosearcharecord
• Dataentry:k*– Pointedtobyk– Containsrecordid(s)orrecorditself
• Recordsordata– Actualtuples– Pointedtobyrecordids
DukeCS,Fall2017 CompSci516:DatabaseSystems 21
INDEXdoesthis
AlternativesforDataEntryk*inIndexk
• Inadataentryk*wecanstore:1. (Alternative1)Theactualdatarecordwithkeyvalue k,
or2. (Alternative2)<k,rid>
• rid=recordofdatarecordwithsearchkeyvalue k,or
3. (Alternative3)<k,rid-list>• listofrecordidsofdatarecordswithsearchkeyk>
• Choiceofalternativefordataentriesisorthogonaltotheindexingtechniqueusedtolocatedataentrieswithagivenkeyvaluek
DukeCS,Fall2017 CompSci516:DatabaseSystems 22
AlternativesforDataEntries:Alternative1
• Indexstructureisafileorganizationfordatarecords– insteadofaHeapfileorsortedfile
• HowmanydifferentindexescanuseAlternative1?• AtmostoneindexcanuseAlternative1
– Otherwise,datarecordsareduplicated,leadingtoredundantstorageandpotentialinconsistency
• Ifdatarecordsareverylarge,#pageswithdataentriesishigh– Impliessizeofauxiliaryinformationintheindexisalsolarge
• Inadataentryk*wecanstore:1. Theactualdatarecordwithkeyvalue k2. <k,rid>
• rid=recordofdatarecordwithsearchkeyvalue k3. <k,rid-list>
• listofrecordidsofdatarecordswithsearchkeyk>
DukeCS,Fall2017 CompSci516:DatabaseSystems 23
Advantages/Disadvantages?
AlternativesforDataEntries:Alternative2,3
• Dataentriestypicallymuchsmallerthandatarecords– So,betterthanAlternative1withlargedatarecords– Especiallyifsearchkeysaresmall.
• Alternative3morecompactthanAlternative2– butleadstovariable-sizedataentriesevenifsearchkeyshavefixedlength.
• Inadataentryk*wecanstore:1. Theactualdatarecordwithkeyvalue k2. <k,rid>
• rid=recordofdatarecordwithsearchkeyvalue k3. <k,rid-list>
• listofrecordidsofdatarecordswithsearchkeyk>
DukeCS,Fall2017 CompSci516:DatabaseSystems 24
Advantages/Disadvantages?
IndexClassification
• Primaryvs.secondary• Clusteredvs.unclustered• Tree-basedvs.Hash-based
DukeCS,Fall2017 CompSci516:DatabaseSystems 25
Primaryvs.SecondaryIndex
• Ifsearchkeycontainsprimarykey,thencalledprimaryindex,otherwisesecondary– Unique index:Searchkeycontainsacandidatekey
• Duplicatedataentries:– iftheyhavethesamevalueofsearchkeyfieldk– Primary/uniqueindexneverhasaduplicate– Othersecondaryindexcanhaveduplicates
DukeCS,Fall2017 CompSci516:DatabaseSystems 26
Clusteredvs.Unclustered Index
• Iforderofdatarecordsinafileisthesameas,or`closeto’,orderofdataentriesinanindex,thenclustered,otherwiseunclustered– Alternative1impliesclustered– Alternative2,3aretypicallyunclustered
• unlesssortedaccordingtothesearchkey
– Sometimes,clusteredalsoimpliesAlternative1• sincesortedfilesarerare
– Afilecanbeclusteredonatmostonesearchkey– Costofretrievingdatarecords(rangequeries)throughindexvaries
greatlybasedonwhetherindexisclusteredornot
DukeCS,Fall2017 CompSci516:DatabaseSystems 27
• SupposethatAlternative(2)isusedfordataentries,andthatthedatarecordsarestoredinaHeapfile
• Tobuildclusteredindex,firstsorttheHeapfile– withsomefreespaceoneachpageforfutureinserts– Overflowpagesmaybeneededforinserts– Thus,datarecordsare`closeto’,butnotidenticalto,sorted
Index entries
Data entries
direct search for
(Index File)(Data file)
Data Records
data entries
Data entries
Data Records
CLUSTERED UNCLUSTERED
DukeCS,Fall2017 CompSci516:DatabaseSystems 28
Clusteredvs.Unclustered Index
Methodsforindexing
• Tree-based• Hash-based
• (indetaillater)
DukeCS,Fall2017 CompSci516:DatabaseSystems 29
SystemCatalogs
• Foreachindex:– structure(e.g.,B+tree)andsearchkeyfields
• Foreachrelation:– name,filename,filestructure(e.g.,Heapfile)– attributenameandtype,foreachattribute– indexname,foreachindex– integrityconstraints
• Foreachview:– viewnameanddefinition
• Plusstatistics,authorization,bufferpoolsize,etc.• (describedin[RG]12.1)
Catalogs are themselves stored as relations!DukeCS,Fall2017 CompSci516:DatabaseSystems 30