MSN 2004
Network Memory Servers:An idea whose time has come
Glenford Mapp
David Silcott
Dhawal Thakker
MSN 2004
Motivation
• Networks are now much faster than disks
• Should be quicker to get data from the memory of another computer compared to using local disk
• Not a new idea - so what’s different?
MSN 2004
What’s different?• Networks are faster and cheaper
– Gigabit NICs are £35.00– We could also see 10G NICs in the near future
• Memory is also cheaper– 1GB = £100.00 – Likely to remain stable
• Availability of good “free” Oses– Linux and Free BSD
MSN 2004
Our approach is also different
• Previous approaches– Dominated by the Distributed Shared Memory
crowd (Apollo System)– DSM never became mainstream
• lots of fundamental changes to OS platform required
• Exotic Hardware (e.g Scalable Coherent Interconnect or SCI)
• Network Memory became a casualty of this failure
MSN 2004
Previous Approach cont’d• Remote paging was also one of the key
areas (SAMSON project, NYU)
• Idle machines approach– Use memory of other machines in the network
when no one is logged on but get off when the person returns
– Very complex -• how do you give guarantees to everyone
MSN 2004
Our Approach
• Applied Engineering Approach– what are the real numbers in this area
• Use the power of the Network– use standard networking approach– No DSM, no virtual memory plug-ins
• Client-Server approach– Dedicated servers with loads of memory
MSN 2004
Design of the Network Memory Server (NMS)
• NMS has an independent interface– Can interface with any OS
• not like Network Block Device (NBD) in Linux
• NMS is stateless– Does not keep track of previous interactions
• Actions of the NMS are regarded as atomic– Either complete success or total failure
MSN 2004
Design of NMS cont’d
• NMS deals with blocks of data– Has no idea how the blocks are being used
• Not like NFS
• Each block is uniquely identified by a block_id allocated by the NMS
• Each client is uniquely identified by a client_id
MSN 2004
Block_ids
• 64-bit entities– 32 minor index– 16 major index– 16 bit security tag
• generated when the blocks are created
• checked before any read/write operation on a block
MSN 2004
NMS calls
• GetblockMemory(client_id, size, nblocks, options)– Creates a number of blocks of a certain size
with consecutive block_ids• returns the starting Block_id
• options - backup
• Release(client_id, block_id, nblocks)– Releases a number of consecutive block_ids
MSN 2004
NMS calls cont’d• WriteBlockMemory(client_id, block_id,
offset, length, *buf)– writes data in buffer to a block on the server
• ReadBlockMemory(client_id, block_id, offset, length, *buf)– reads data from a block on the server into a
buffer
MSN 2004
NMS calls cont’d
• GetClientid(password)– creates a new client
• GetMasterBlock(password, client_id)– returns a number of blocks of sector/block_id
mappings
• StoreMasterBlock(block_id, client_id, password, nblocks) – stores a number of sector/block_id mappings
MSN 2004
NMS Client
• How does a client use the NMS?– What interface is presented to the OS
• Interface is one that is used to support hard disks. In Linux, we use the block device interface
• So the OS thinks of the NMS service as a fast hard disk
MSN 2004
NMS Client cont’d
• So the OS tells the NMS client to read and write sectors.
• NMS client will take sectors and map them onto blocks which it gets from the NMS
• When block device is unmounted, we must store the sector/block_id mappings on the NMS
MSN 2004
NMS Cont’d
• The StoreMasterBlock call stores these mappings on the NMS
• When the device is remounted, it must first get the sector/block_id mappings from the NMS and rebuild the sector table.
• The GetMasterBlock call retrieves the mappings from the NMS
MSN 2004
NMS Client Cache
• Client also has a cache of blocks that are used to store recently used sectors– this is a secondary cache as the main caching is
really done by the Unix Buffer Cache
• Design decision to keep our cache as a simple round-robin cache -– replace the next item pointed to in the cache
MSN 2004
NMS Client Operations• Since we are not a normal disk, we do not
need to rearrange read and write operations
• So we attempt to read and write blocks as the requests come in.
• Also developed a write-out thread operation. So a special thread, called the Write-out thread writes modified blocks to the NMS
MSN 2004
NMS Client Implementation
Operating System
Block Device Interface
Sector / Block_idHash Table
Cache
Programs
Unix Buffer Cache
Write-Out Queue
(Two levels)
NMS Block Device
MSN 2004
Getting a sectorIs sector in Hash table
YesIs it in the cache
Is it a readYes
Return Rubbish
Get Block_idFrom NMS. Put Entry inHash Table
Is it a read
Get Data from NMS Server; putin cache entry
Is the cache full
Replace Entry
Has replaced entrybeen modified
Put it on WriteOut Queue
Get New Cache Entry
Read from/ Writeto Cache Entry
OKWrite Data toCache Entry
Yes
No
No
Yes
Yes
No
No
Yes
No
No
MSN 2004
Structures on NMS Server
Client_id Hash Table
Block_idHash Table(Two-level)
Allocated Memory
Memory for Clients
Memory for InternalUse by the NMS
MSN 2004
Testing and Evaluation
• What do we really want to know
• What does it take to operate faster than a hard disk?– Can you use standard hardware (Middlesex)– Do you need special hardware (Cambridge)
• Level 5 Networks
• What are the key parameters in this space
MSN 2004
What do you measure• What happens if we change the block size
of the data transfer
• What happens if we change the number of units transferred in one transfer– Added multi-write operation
• Is local caching any good
• What is the network traffic like
MSN 2004
Using Iozone• Iozone is quite popular
– Measures the memory hierarchy
• Disk particulars– 60 GB, 2MB buffer, 7200 RPM, Seek Time 9.0 ms,
Average latency 4.16ms
• Network -– using Intel E1000 NICs and Netgear Gigabit
Switch (GS 104); using UDP port 6111
• NMS client and server implemented as Linux kernel modules
MSN 2004
Read Performance
0
200000
400000
600000
800000
1000000
1200000
1400000
0 50000 100000 150000 200000 250000 300000 350000
kB file
kB/s
ec
mw4, 2MB_cache, 1kB_msgdisksw, 2MB_cache, 4kB_msg
MSN 2004
Record Rewrite Performance
0
200000
400000
600000
800000
1000000
1200000
0 50000 100000 150000 200000 250000 300000
kB file
kB/s
ec
MW4 2mb cache, 1k
disk system
MSN 2004
Write Performance for Different Transfer sizes
0
50000
100000
150000
200000
250000
300000
0 50000 100000 150000 200000 250000 300000 350000
kB file
kB/s
ec
sw, 2MB_cache, 4kB_msgdisksw, 2MB_cache, 1kB_msgsw, 2MB_cache, 2kB_msg
MSN 2004
Write Performance for Multiples of 1K blocks
0
50000
100000
150000
200000
250000
300000
0 50000 100000 150000 200000 250000 300000
kB file
kB/s
ec
mw4, 2MB_cache, 1kB_msgdiskmw12, 2MB_cache, 1kB_msgmw8, 2MB_cache, 1kB_msgmw16, 2MB_cache, 1kB_msg
MSN 2004
Write Performance for extreme configurations
0
50000
100000
150000
200000
250000
300000
0 50000 100000 150000 200000 250000 300000 350000
kB file
kB/s
ec
disk
mw17k, 2MB_cache, 4kB_msg
mw32k, 8MB_cache, 4kB_msg
sw, 2MB_cache, 1kB_msg
MSN 2004
Maximum data transfer rate
82
83
84
85
86
87
88
50 100 150 200 250
Filesize(MB)
Ra
te(M
b/s
ec
)
Received
Sent
MSN 2004
Buffer cache Hits
0
20
40
60
80
100
120
100 150 200 250
Filesize(MB)
% b
loc
k c
ac
he
hit
s
BCH MAX
BCH MIN
MSN 2004
Conclusions and Future
• We can beat the disk
• Will compare these results with those using Level 5 hardware (Rip Sohan, LCE)
• Open source release planned
• Developing a Network Storage Server
• Building prototypes – running Linux and Windows using NMS