MongoDB Journaling and the Storage Enginer

1

Directory Layout

•  Separate files per database •  Aggressive preallocation •  Files contain one or more extents

2

-rw------- 1 ben ben 64M May 1 19:14 test.0!-rw------- 1 ben ben 128M May 1 19:14 test.1!-rw------- 1 ben ben 256M May 1 18:25 test.2!-rw------- 1 ben ben 512M May 1 19:14 test.3!-rw------- 1 ben ben 1.0G May 1 19:14 test.4!-rw------- 1 ben ben 2.0G May 1 18:58 test.5!-rw------- 1 ben ben 16M May 1 19:14 test.ns!

Memory Mapping

STACK!…!

LIBS!

…!

test.ns!

test.0!

test.1!

…!

!…!

HEAP!

MONGOD!

NULL!

0x7fffffffffff

0x0

{ … }

Disk

Document Process Virtual Memory

Data Structures •  DiskLoc

•  Stores file number and offset of data on disk •  Record *r = mmap base + DiskLoc.offset!•  Max offset is 2^31 (2GB)!

•  NamespaceDetails •  Stores collection metadata!

•  Extent!•  Stores contiguous blocks within a namespace •  Max extent size is 2GB

•  Record!•  Holds a BSON document or B-tree bucket •  DeletedRecord overwrites a Record!•  Includes Padding

Namespace Details

•  Holds metadata about a collection or index •  Stored in 1KB buckets in <dbname>.ns file •  .ns file fixed size of 16MB •  Maintains document count •  Contains heads of linked lists

firstExtent lastExtent _indexes[] stats freeList[]

NamespaceDetails

Extent Structure

Extent length

xNext

xPrev

firstRecord

lastRecord

Extent length

xNext

xPrev

firstRecord

lastRecord

Extents

> db.foo.validate( { full : true } ).extents.forEach( function(z){ print( z.loc + "\t\t" + z.size ); } ) 0:3000 20480 0:12000 81920 0:26000 327680 0:76000 1310720 0:1da000 5242880 0:76a000 6291456 0:d6a000 7553024 0:16de000 9064448 0:1f83000 10878976 0:29e3000 13058048 1:2000 15671296 1:ef4000 18808832 1:29e4000 22573056

Index Extents

> db.system.namespaces.find() { "name" : "test.foo" } { "name" : "test.system.indexes" } { "name" : "test.foo.$_id_" } > db["foo.$_id_"].validate( { full : true } ).extents.forEach( function(z){ print( z.loc + "\t\t" + z.size ); } ) 0:9000 36864 0:1b6000 147456 0:6da000 589824 0:149e000 2359296 1:20e4000 9437184

Extents and Records

Extent length

xNext

xPrev

firstRecord

lastRecord

Data Record

length

rNext

rPrev

Document { _id: “foo”, ... }

Extents and Records

Extent length

xNext

xPrev

firstRecord

lastRecord

Data Record

length

rNext

rPrev


Extents and Records

Extent length

xNext

xPrev

firstRecord

lastRecord

Data Record

length

rNext

rPrev


Data Record

length

rNext

rPrev


BSON Format

{ hello: “world” }

\x16\x00\x00\x00 \x02hello\x00 ! \x06\x00\x00\x00 world\x00\x00!

Doc Length Value Type

Value Length

Index Extents

Extent length

xNext

xPrev

firstRecord

lastRecord

Index Record

Bucket parent

numKeys

length

rNext

rPrev

Index Record

Bucket parent

numKeys

K

length

rNext

rPrev

{ Document }

Index Extents

Extent length

xNext

xPrev

firstRecord

lastRecord

Index Record

Bucket parent

numKeys

length

rNext

rPrev

Index Record

Bucket parent

numKeys

K

length

rNext

rPrev

{ Document }

4 9

1 3 5 6 8 A B

Journaling

•  Write ahead logging •  Operations written to journal before memory

mapped regions •  Private view •  Shared view

•  Once journal written, data safe unless hardware problem

•  By default, journal flushed every 100ms, 100mb of writes, or on write concern of j=true •  User configurable with --journalCommitInterval

•  Section contains single group commit •  Applied all-‐or-‐nothing

Journal Format JHeader

JSectHeader [LSN 3]

DurOp

DurOp

DurOp

JSectFooter

JSectHeader [LSN 7]

DurOp

DurOp

DurOp

JSectFooter

…

Op_DbContext

length offset fileNo data[length]



Write Operation

Set database context for subsequent operations

Journal Performance

•  On 99.9% read systems, no impact •  Write performance degraded 5-30% when

journal on same drive •  Separate drive as low as 3%

Journal Admin

•  Journal stored in /dbpath/journal folder •  If faster, three 1gb files may be preallocated •  Can symlink to a different spindle •  --journalCommitInterval* (2ms - 300ms) •  When to journal

•  Single node: required for data integrity •  Replica set: at least 1 node •  All nodes: removes possible need to resync

Fragmentation

•  Files may become fragmented over time if documents change size

•  Free lists also contribute to fragmentation •  2.0 reduced scanning to reasonable amounts •  2.2 will change allocation strategy •  Need to re-write free list to do online compaction

Compaction

•  1.8 and previous: repairDatabase •  2.0+ : compact command

•  Currently resets paddingFactor, but can be changed.

•  Index (re)generation is now concurrent, so compaction can be N times faster

•  Generally causes some extra allocation •  Does not delete or truncate files

Planned Changes

•  Split data and indexes into different files •  Indexes could by symlinked to a different

drive (SSD) •  Improved allocation strategy

Download MongoDB

http://www.mongodb.org/downloads

Ben Becker [email protected]

Date post:	12-May-2015
Category:	Technology
Upload:	mongodb
View:	2,569 times
Download:	2 times

MongoDB Journaling and the Storage Enginer

Technology