Date post: | 12-May-2015 |
Category: |
Technology |
Upload: | mongodb |
View: | 2,569 times |
Download: | 2 times |
1
Directory Layout
• Separate files per database • Aggressive preallocation • Files contain one or more extents
2
-rw------- 1 ben ben 64M May 1 19:14 test.0!-rw------- 1 ben ben 128M May 1 19:14 test.1!-rw------- 1 ben ben 256M May 1 18:25 test.2!-rw------- 1 ben ben 512M May 1 19:14 test.3!-rw------- 1 ben ben 1.0G May 1 19:14 test.4!-rw------- 1 ben ben 2.0G May 1 18:58 test.5!-rw------- 1 ben ben 16M May 1 19:14 test.ns!
Memory Mapping
STACK!…!
LIBS!
…!
test.ns!
test.0!
test.1!
…!
!…!
HEAP!
MONGOD!
NULL!
0x7fffffffffff
0x0
{ … }
Disk
Document Process Virtual Memory
Data Structures • DiskLoc
• Stores file number and offset of data on disk • Record *r = mmap base + DiskLoc.offset!• Max offset is 2^31 (2GB)!
• NamespaceDetails • Stores collection metadata!
• Extent!• Stores contiguous blocks within a namespace • Max extent size is 2GB
• Record!• Holds a BSON document or B-tree bucket • DeletedRecord overwrites a Record!• Includes Padding
Namespace Details
• Holds metadata about a collection or index • Stored in 1KB buckets in <dbname>.ns file • .ns file fixed size of 16MB • Maintains document count • Contains heads of linked lists
firstExtent lastExtent _indexes[] stats freeList[]
NamespaceDetails
Extent Structure
Extent length
xNext
xPrev
firstRecord
lastRecord
Extent length
xNext
xPrev
firstRecord
lastRecord
Extents
> db.foo.validate( { full : true } ).extents.forEach( function(z){ print( z.loc + "\t\t" + z.size ); } ) 0:3000 20480 0:12000 81920 0:26000 327680 0:76000 1310720 0:1da000 5242880 0:76a000 6291456 0:d6a000 7553024 0:16de000 9064448 0:1f83000 10878976 0:29e3000 13058048 1:2000 15671296 1:ef4000 18808832 1:29e4000 22573056
Index Extents
> db.system.namespaces.find() { "name" : "test.foo" } { "name" : "test.system.indexes" } { "name" : "test.foo.$_id_" } > db["foo.$_id_"].validate( { full : true } ).extents.forEach( function(z){ print( z.loc + "\t\t" + z.size ); } ) 0:9000 36864 0:1b6000 147456 0:6da000 589824 0:149e000 2359296 1:20e4000 9437184
Extents and Records
Extent length
xNext
xPrev
firstRecord
lastRecord
Data Record
length
rNext
rPrev
Document { _id: “foo”, ... }
Extents and Records
Extent length
xNext
xPrev
firstRecord
lastRecord
Data Record
length
rNext
rPrev
Document { _id: “foo”, ... }
Extents and Records
Extent length
xNext
xPrev
firstRecord
lastRecord
Data Record
length
rNext
rPrev
Document { _id: “foo”, ... }
Data Record
length
rNext
rPrev
Document { _id: “foo”, ... }
BSON Format
{ hello: “world” }
\x16\x00\x00\x00 \x02hello\x00 ! \x06\x00\x00\x00 world\x00\x00!
Doc Length Value Type
Value Length
Index Extents
Extent length
xNext
xPrev
firstRecord
lastRecord
Index Record
Bucket parent
numKeys
length
rNext
rPrev
Index Record
Bucket parent
numKeys
K
length
rNext
rPrev
{ Document }
Index Extents
Extent length
xNext
xPrev
firstRecord
lastRecord
Index Record
Bucket parent
numKeys
length
rNext
rPrev
Index Record
Bucket parent
numKeys
K
length
rNext
rPrev
{ Document }
4 9
1 3 5 6 8 A B
Journaling
• Write ahead logging • Operations written to journal before memory
mapped regions • Private view • Shared view
• Once journal written, data safe unless hardware problem
• By default, journal flushed every 100ms, 100mb of writes, or on write concern of j=true • User configurable with --journalCommitInterval
• Section contains single group commit • Applied all-‐or-‐nothing
Journal Format JHeader
JSectHeader [LSN 3]
DurOp
DurOp
DurOp
JSectFooter
JSectHeader [LSN 7]
DurOp
DurOp
DurOp
JSectFooter
…
Op_DbContext
length offset fileNo data[length]
length offset fileNo data[length]
length offset fileNo data[length]
Write Operation
Set database context for subsequent operations
Journal Performance
• On 99.9% read systems, no impact • Write performance degraded 5-30% when
journal on same drive • Separate drive as low as 3%
Journal Admin
• Journal stored in /dbpath/journal folder • If faster, three 1gb files may be preallocated • Can symlink to a different spindle • --journalCommitInterval* (2ms - 300ms) • When to journal
• Single node: required for data integrity • Replica set: at least 1 node • All nodes: removes possible need to resync
Fragmentation
• Files may become fragmented over time if documents change size
• Free lists also contribute to fragmentation • 2.0 reduced scanning to reasonable amounts • 2.2 will change allocation strategy • Need to re-write free list to do online compaction
Compaction
• 1.8 and previous: repairDatabase • 2.0+ : compact command
• Currently resets paddingFactor, but can be changed.
• Index (re)generation is now concurrent, so compaction can be N times faster
• Generally causes some extra allocation • Does not delete or truncate files
Planned Changes
• Split data and indexes into different files • Indexes could by symlinked to a different
drive (SSD) • Improved allocation strategy