Date post: | 12-Nov-2014 |
Category: |
Technology |
Upload: | bill-pijewski |
View: | 947 times |
Download: | 1 times |
Running ZFS withouta system pool
Bill PijewskiSoftware Engineer, Joyent@pijewski
Tuesday, October 2, 2012
Agenda
•Why ZFS is important to Joyent
• Evolution of USB and PXE boot architectures
• Running with no system pool
Tuesday, October 2, 2012
ZFS at Joyent
•We run a production cloud with many servers in datacenters worldwide
• Two kinds of zones (covered in detail in other talks):
• Zones: sparse zones share libraries with the platform
• VMs: fully virtualized GNU/Linux, Windows, FreeBSD, etc. machines
• Use small number of NFS machines to provide additional storage capacity in each datacenter
Tuesday, October 2, 2012
ZFS for Zones and VMs
• Zones are allocated two ZFS datasets
• One dataset for data in that zone
• Another for core files -- to prevent cores from exceeding quota
• VMs have a ZFS volume into which the VM image is installed, plus one or more additional volumes presented to guest as disks
• Guest filesystems are installed into volumes
Tuesday, October 2, 2012
ZFS in different contexts
• For Joyent, two main contexts: SmartOS and SDC
• SmartOS: community distribution, illumos + lightweight virtualization tools
• SmartDataCenter (SDC): SmartOS + full cloud management and orchestration stack
Tuesday, October 2, 2012
Important ZFS features
• As with all ZFS users, we take for granted rely on end-to-end data integrity
• Copy-on-write architecture: snapshots, clones
• Compression
• Space management tools: quotas and reservations
• Replication to move customers around between different machines
Tuesday, October 2, 2012
Delegated administration!
• In our next SDC release, we enable delegated administration
• Allows customers to:
• Take snapshots outside of Joyentʼs API
• Create child datasets
• Snapshot and clone datasets
• Replicate or migrate data between instances
• Open work: basic limits on delegated activity to avoid DOS
Tuesday, October 2, 2012
ZFS Performance
• SSDs for ZIL
• ARC
•We hold back some portion of a serverʼs total memory, knowing that a good portion of this memory will be consumed by the ARC
• Committing memory achieves greater I/O performance
• ZFS I/O throttle for QoS controls
• For more information, check out Brendan Greggʼs excellent talk next door
Tuesday, October 2, 2012
Read-only system pool
• At Fishworks, we decided to have a read-only system pool
• Necessary for OS install as well as analytics data
• Simplified some things:
• No unnecessary customizations from customers
• Discouraged hot patching
• Other disadvantages:
• Upgrade, rollback, and factory reset were tricky
Tuesday, October 2, 2012
SmartOS USB Boot
• Instead of installing OS to root disks, SmartOS boots from a USB key
• Entire kernel and userland fit in about 200 MB (compressed)
• Other software can be installed from pkgsrc
• Single ZFS pool for all zones
Tuesday, October 2, 2012
USB Boot Advantages
• All disks are available for zone/VM storage, thereby increasing both performance and capacity
• Encourages users to provision a zone for each application rather than using the global zone
• Discourages customization and one-off patching
• Fast to get up and running
• Easy to “bring your OS with you”
Tuesday, October 2, 2012
SmartDataCenter (SDC) Architecture
• Two kinds of servers: head nodes and compute nodes
• Head nodes run management, provisioning, monitoring, and boot services
• Compute nodes contain customer zones
• Head nodes are similar to SmartOS installs
• Each compute node PXE boots its platform from the head node
• Both head nodes and compute nodes have a single ZFS pool
Tuesday, October 2, 2012
SDC Diagram
Headnode
CN 0
......
PXE
CN 1
CN 2
DC 0
Headnode
CN 10
......
PXE
CN 11
CN 12
DC 1
Headnode
CN 20
......
PXE
CN 21
CN 22
DC 2
Tuesday, October 2, 2012
PXE Boot Advantages
• Ben Rockwood, 10/1/2012:“Apparently other people spend time installing software. I think that's stupid.”
• As with SmartOS, operators encouraged to put applications in zones instead of global
• Upgrade = rollback = reboot, nothing more
• Newer platforms can be staged and machines rebooted later
• Any machine which hits a known fixed problem will automatically boot onto fresh platform
Tuesday, October 2, 2012
Storage pools!
•Most OSes assume the existence of a “system” pool -- a pool onto which the OS, applications, and configuration information is installed
• Joyent moving away from single-vdev pools backed by hardware RAID
• Embracing hybrid storage pool (HSP) using an SSD for the ZFS intent log (ZIL)
• Everything else worked on RAID-Z pools except for saving a crash dump
Tuesday, October 2, 2012
RAID-Z Crash Dump
• Problem: have only one RAID-Z or mirrored pool but cannot save crash dump on said pool
• Implement crash dumps on RAID-Z (majority of work) and pools with multiple vdevs
• Not necessarily to save parity bits for crash dump data:
• Crash dump is immediately saved upon reboot
• Needs to be reliable, simple, and (hopefully) fast
Tuesday, October 2, 2012
Why no parity bits?
• Since DVAs on the dump device are preallocated, use those 128K blocks for each write
•Most calls into dump entry point are not block aligned
• Rather than write variable size, use original 128K
• I first calculated parity bits, only my test machine took three hours to save a crash dump
• No parity calculated -- on a pool with n vdevs, each write could require n-1 (synchronous) reads
Tuesday, October 2, 2012
Other system components
• Swap device (thankfully) supports RAID-Z pools
• /var, /opt have their own datasets
• /etc not persistent
• /root also not persistent, again incentivizing people to configure applications in zones rather than using the GZ
Tuesday, October 2, 2012
Summary
• The single ZFS pool has simplified Joyentʼs deployment
• Delegated administration has given customers more power
• ZFS has been and will continue to be a crucial component of our architecture for many years
Tuesday, October 2, 2012
Questions?
Tuesday, October 2, 2012
Running ZFS withouta system pool
Bill PijewskiSoftware Engineer, Joyent@pijewski
Tuesday, October 2, 2012
Backup slides
Tuesday, October 2, 2012
ZFS 101
• ZFS is a copy-on-write filesystem from Sun originally shipped with Solaris 10
•Many innovative features: data compression, snapshot/rollback, ZFS send/receive, SSD integration
• Enterprise-grade reliability and data integrity
• Two main components relevant here:
• ZFS pools
• ZFS datasets
Tuesday, October 2, 2012
ZFS Pools
• Aggregate disks into a single storage pool from which “datasets” are allocated
• No parted/LVM needed
•Mix both spinning disks and SSDs:
• L2ARC: extends filesystem buffer cache
• ZIL: absorbs synchronous write activity
Tuesday, October 2, 2012
ZFS Datasets
• Datasets are a tree of blocks within the storage pool, presented as either:
• A filesystem (file interface)
• A volume (block interface)
• Datasets can be flexibly resized, and volumes can even be thinly provisioned
• Administrative controls on datasets
Tuesday, October 2, 2012
Zones and VMs
• A zone is a lightweight software-virtualized container
• Uses the systemʼs OS platform
• Allocated its own ZFS filesystem (more in a sec)
• A VM is a hardware-virtualized container for GNU/Linux, Windows, BSD, etc.
• Uses its own ZFS volume
• VMʼs filesystem installed into ZFS volume
• Both machines have resource controls for CPU, memory, and disk I/O
Tuesday, October 2, 2012
Advantages of ZFS
• Snapshots: zone/VM backup and recovery
• Space management: reservations and quota flexibly allocate space between zones
• Delegated administration: each tenant can administer their own dataset:
• Set compression level and other properties
• Take snapshots of application data
• Generate send streams for replication/backup
Tuesday, October 2, 2012
Advantages of ZFS (2)
• Data integrity: verifies data of VM guest filesystems (ext4, XFS, NTFS, etc.)
•Multiple storage configurations available: mirrored, RAID-Z2, and others
• System fully supported on any storage configurations, can even take a crash dump to a RAID-Z pool
Tuesday, October 2, 2012