Copyright © 2013 Fusion-io, Inc. All rights reserved.
Running NoSQL Natively on Flash Fusion-io SDK
Torben Mathiasen & Salvatore Buccoliero
The Future of High Performance Storage?
June 18, 2013 2
Everyone seems to agree.
ioMemory ▸ A New Memory tier called ioMemory
• Leverages the best advantages of DRAM and rotating drives
▸ High Speed like DRAM
▸ Persistence and Large capacity of Spinning Hard Drives
▸ PCIe based NAND Flash storage ▸ Micro-second level Disk Access Latency - 15µs
▸ Very high data throughput - 1,5GB/s
▸ Very high IOPS – 400.000 random write/s
▸ Scalable – stay ahead of data / performance demands
▸ Advanced wear-leveling algorithm
▸ N+1 Chip level redundancy (think RAID protection on card)
▸ 100% data integrity protection in case of power loss
▸ Endurance is PBW – TB’s written daily for more than 8 years!
Manufactured by Fusion-io - OEM’ed by
ioMemory vs Disk
June 18, 2013 Fusion-io Confidential 4
= 4,000 x
800,000 IOPs 150-200 IOPs
Analytics Search
ORACLE Text
Messaging
MQ
Workstation
Databases
INFORMIX
Virtualization
KVM
HPC
GPFS
HDInsight
Big Data
Security/Loggi
ng
Backup
Development Web
LAMP
Caching
Where to use ioMemory
HDInsight
The compute performance problem
▸ “Compute power continues to outpace performance delivered by Storage.”
▸ “Problem is not getting better, its getting worse.”
Processor:
•Multi Core
•Higher
Bandwidth
Memory:
•Larger Footprint
•Higher
Bandwidth
Storage:
• Minor
Throughput
improvements
• Currently
Solved with
more Disks
TIME
PE
RF
OR
MA
NC
E
Legacy solution to the data supply problem:
Each option requires significant increase
in CAPEX and OPEX, and does not
fully address the problem.
$$
$$
$ • Add more disk
• Add more memory
• Add more servers
• Optimize the application
Networked storage data supply chain from application to flash
June 18, 2013 7
▸ 9 Intermediary components required
▸All adding access delay, cost, complexity, and lowering reliability
(especially the super capacitors)
▸Requests must do a round trip touching everything TWICE…
Application
Server
Processor
Network
Switch
Storage
Appliance
Processor
Disk RAID
Controller
SAS/SATA
Bus and
Protocol
SSD
Embedded
CPU
SSD
RAM
Battery/Sup
er
Capacitors
NAND
Flash
Network
Adapter
Network
Adapter
SSD data supply chain from application to flash
June 18, 2013 8
▸ 5 Intermediary components required
▸All adding access delay, cost, complexity, and lowering reliability
(especially the super capacitors)
Application
Server
Processor
Disk RAID
Controller
SAS/SATA
Bus and
Protocol
SSD
Embedded
CPU
SSD
RAM
Battery/Sup
er
Capacitors
NAND
Flash
A horse in front of a Ferrari?
June 18, 2013 9
Fusion-io Approach From Application…. to Flash
June 18, 2013 10
▸ 0 Intermediary components required
▸No need for super capacitors because data is not
"buffered” in DRAM
NAND
Flash
Application
Server
Processor
The landscape of sub second Timings
June 18, 2013 11
L1-L3
Cache 10 ns DRAM
100 ns
Fusion-io
15 µs
SAN
4 ms
Blink of an eye
1/10 second
Get cup of Coffee
2,5 minutes
Fly to Thailand
11,1 hours Heartbeat
1 second
Very Small - Kb Very expensive
Volatile Non volatile
HOW FAST DO YOU GET DATA TO THE FACTORY?
Multiplier is 10m - 10.000.000
Direct Cut Through Architecture
June 18, 2013 Fusion-io Confidential 12
PC
Ie
DRAM
Host
CPU
App
OS
LEGACY APPROACH FUSION DIRECT APPROACH
PC
Ie
SA
S
DRAM
Data path
Controller
NAND
Host
CPU
RAID
Controller
App
OS
The goal of every I/O operation is to move data to and from DRAM and the device
SC
Super
Capacitors
Fusion-io is not a SSD device
June 18, 2013 13
Fusion-io Confidential 14 June 18, 2013
Usage Models – Baby Steps
• Moving specific components of the database
to the ioDrives:
▸ Tempdb database
▸ Indexes
▸ Frequently accessed tables
▸ Transaction logs
▸ Partition tables
Fusion-io Confidential 15 June 18, 2013
All In
• If database size permits, placing entire
database system on Fusion-ioDrives
provides maximum performance benefit
16 June 18, 2013
n Node Cluster
Clustering / HA with No Shared Storage!
▸ Perfect NoSQL Model
▸ MSFT SQL Server Always On
▸ Oracle Dataguard
▸ SIOS DataKeeper
▸ Advantages
• Fast replication
• Just another block storage device
Application Mirroring
Individual storage
Getting Performance To ESXi
▸ External storage for virtual machines too costly
▸ Fusion-io delivers IOPS to hosts and virtual machines
SAN/NFS Storage
IOPS
SAN/NFS Storage
IOPS
ioTurbine software caches frequently used data on ioDrives
Reduce SAN/NFS
costs
Fusion-io Confidential 18 June 18, 2013
Fusion-io as a storage appliance server
▸ Standard HP, IBM, DELL Servers
▸ Rack or Blades…
▸ SHARED STORAGE
▸ FC
▸ ISCSI
▸ HA MIRROR
IoTurbine with ION cache store (distributed cache)
▸ Best For
• Enterprise class shared
caching
• Large scale server farms
• Ideal where servers
cannot accommodate
local ioMemory
• Each ESX(i) Host will
have a unique ION LUN
presented
Cache Management
Cache Store LUNs
Delivered from ION
Primary Storage
OS Support
June 18, 2013 20
RHEL 5.6, 5.7, 5.8, 6.0, 6.1, 6.2
SLES 10.4, 11, 11.1
OEL 5.7, 6.0, 6.1, 6.2
CentOS 5.6, 5.7, 6.0, 6.1, 6.2
Debian Squeeze
Fedora 15, 16
openSUSE 12.1
Ubuntu 10.04, 11.10
Linux
Solaris 10 x64 U8, U9, U10
OpenSolaris 2009.06 x64
OSX 10.6 and later
FreeBSD 8,9
Unix
Windows Server 2003 SP2
Windows 7 64 bit
Windows 8 64 bit (in Oct)
Windows Server 2008 R1 SP2
Windows Server 2008 R2
Windows Server 2012 (in Oct)
Windows
VMware ESX 4.0, 4.1
VMware ESXi 4.0, 4.1
VMware ESXi 5.0, 5.1
Windows 2008 R2 with Hyper-V
Hypervisors
support.fusionio.com
Flash Offers A New Architectural Choice
June 18, 2013 21
Milliseconds 10-3 Microseconds 10-6 Nanoseconds 10-9
CPU Cache DRAM
Disk Drives
Server-based Flash
Evolution of Flash Performance
June 18, 2013 22 #Cassandra13
FLASH AS
MEMORY
FLASH AS
DISK
Lets look at some charts
June 18, 2013 23
Adding 3x the DRAM does not really improve things
June 18, 2013 24
HBase Server
June 18, 2013 25
▸ A typical server…
CPU Cores: 32 with HT
Memory: 128 GB
Is your working set larger than 128GB?
HBase Cluster
June 18, 2013 26
▸ With NoSQL Databases, we tend to scale out for
DRAM
Combined Resources
CPU Cores: 96
Memory: 384 GB
More cores than needed to serve reads and writes.
The HBase BucketCache (HBase-7404)
June 18, 2013 27
Committed to HBase trunk. Will be in 0.96 release, backport patch for
0.94 available.`
Victim cache for LRUBlockCache – Move fast ioMemory close to DRAM cache
+
https://issues.apache.org/jira/browse/HBASE-7404
BucketCache Configuration
June 18, 2013 28
▸ In hbase-site.xml
<property> <name>hbase.bucketcache.ioengine</name> <value>file:/path
/to/bucketcache.dat</value> </property>
<property> <name>hbase.bucketcache.size</name> <!-- 2TB: unit is MB
-->
<value>2097152</value>
</property>
BucketCache Warm-up
June 18, 2013 Fusion-io Confidential 29
RE
AD
OP
S D
UR
ING
CA
CH
E W
AR
M-U
P
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
10
5
70
1
13
0
16
90
2
25
0
28
10
3
37
0
39
30
4
49
0
50
50
5
61
0
61
70
6
73
0
72
90
7
85
0
84
10
8
97
0
95
30
1
00
90
1
06
50
1
12
10
1
17
70
1
23
30
1
28
90
1
34
50
1
40
10
1
45
70
1
51
30
1
56
90
1
62
50
1
68
10
1
73
70
1
79
30
1
84
90
1
90
50
1
96
10
2
01
71
2
07
31
2
12
91
2
18
51
2
24
11
2
29
71
2
35
31
2
40
91
2
46
51
2
52
11
2
57
71
2
63
31
2
68
91
read ops/sec
Fusion-io Software Development Kit
30
Traditional Storage
Proprietary Storage OS
Storage Media
Native Flash Translation Layer
Storage Media
Software Defined Storage
Applications
Block I/O
Block I/O
Enhanced I/O Atomic Writes / directFS
Key-Value Store API
Memory Access Extended Memory
Auto Commit Memory
DirectFS Linux file system
31
Native Flash Translation Layer
block allocation, mapping, recycling
ACID updates, logging/journaling, crash-recovery
directFS file metadata mgmt
Kernel block layer
kernel-space
user-space
Ext3
file metadata mgmt,
block allocation, mapping, recycling,
ACID updates, logging/journaling, crash-recovery
Primitive Interfaces
Application
Linux VFS (virtual file system) abstraction layer
directFS: Speed Through Simplicity
32
0 10000 20000 30000 40000 50000 60000 70000
directFS
ReiserFS
Ext4
Btrfs
XFS
L INES OF CODE
Atomic writes – Transactional I/O
▸ System call tells DirectFS that all I/O to this file should be
treated as atomic
▸ Avoids the partial page write problem
▸ Accepted by T10 technical committee for SCSI standard
▸ Minimal application changes required
33
Percona Server, MariaDB, MySQL 5.6
▸ Efficient XtraDB/InnoDB storage engine
▸ Well optimized for seek-less storage like flash
▸ Many config parameters to fine-tune performance
▸ What else can be done?
• Lock contention can still be improved as seen by using
multiple instances with the same storage device
• Tapping into the native performance of flash by
exposing key FTL features to the application
34
MySQL Writes Comparison
35
Traditional MySQL Writes MySQL with Atomic Writes
Page
C Page
B
Page
A
Buffer
DRAM
Buffer
SSD (or HDD) Database
Database
Server
Page
C
Page
B
Page
A
Page
C
Page
B
Page
A
Page
C
Page
B
Page
A
Application
initiates updates
to pages A, B,
and C.
1
MySQL copies
updated pages to
memory buffer.
2
MySQL writes
to double-write
buffer on the
media.
3
Once step 3 is
acknowledged,
MySQL writes
the updates to
the actual
tablespace.
4
ioMemory Database
Page
C
Page
B
Page
A
DRAM
Buffer
Page
C
Page
B
Page
A
Application
initiates updates
to pages A, B,
and C.
1
MySQL copies
updated pages
to memory
buffer.
2
MySQL writes to
actual tablespace,
bypassing the
double-write buffer
step due to
inherent atomicity
guaranteed by the
intelligent device.
3
Database
Server
Page
C Page
B
Page
A
Atomic benchmarks
First, lets sum up the MySQL benefits here:
• Writing only 50% of the data otherwise required for
ACID compliance
That’s pretty much it…but it gives us
▸ Twice the flash endurance
▸ Much better latency because of fewer syscalls
▸ Much better application throughput due to less I/O
▸ Better concurrency due to fewer locks
36
Atomics 50% more TPC-C throughput
37
Fusion-io advanced development Storage Class Memory
38
Small Capacity DRAM (volatile)
$$/GB Big Capacity Flash
$/GB
Memory-speed persistence
Byte-addressable vs. block addressable
Small Capacity SCM
(persistent) Server Virtual Memory
Server
SCM research
▸ Lets look at keeping a database log using memory
semantics
▸ Goal is to reduce latency, cost of flushing data to a
persistent state and further minimize writes
▸ SCM testing using modified Innosim toool
39
SCM logger interface
▸ logger_open()
Open and initialize logging infrastructure within the FTL
▸ logger_close()
Clean-up
▸ logger_append()
Append to head of log at memory speeds. This basically
translates to a memcpy()
▸ logger_sync()
Serialize data using assembler ‘mfence’ instruction 40
Practical Database Use Case: MySQL
41
8,000
16,000 15,750
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
B A S E L I N E S C E N A R I O 1 S C E N A R I O 2
INN
OS
IM O
PS
/SE
C
Nearly as fast
as disabling
the transaction
log completely.
Log transaction through block I/O No Logging Log to Fusion-io ACM
The coming shift in software development
42
▸As an SSD, flash accelerates applications.
At full maturity, Non-Volatile Memory
will transform software development.
Native Flash API availability
▸ Percona Server: 5.5.31
▸ MariaDB mainline: 5.5.31
▸ Oracle MySQL:
• https://code.launchpad.net/~tmathiasen/mysql-server/mysql-5.5-fio
▸ Cassandra atomics implementation in progress
▸ DirectFS public beta expected July 22nd
43
f u s i o n i o . c o m | R E D E F I N E W H A T ’ S P O S S I B L E
T H A N K Y O U