Introduction to Stacki
Greg Bruno, PhDVP Engineering, StackIQ
Open Source Stack Installer
Stacki is a very fast and ultra reliable Linux server provisioning tool … at scale. With zero prerequisites for taking systems from bare metal to a ping and prompt.
PayPal
Hadoop @ PayPal
12 x 2TB SATA data drives
48 nodes each rack
1GBE-10GBE NICs
24 x 900GB 6G SAS 10K data drives
24 nodes each rack
10GBE NIC
8 x 4TB NR-SAS data drives
10 GBE NIC
BayArea
SaltLakeCity
LasVegas
DATACENTERS
• 3,000 nodes and growing• 60+ initial server racks• Heterogeneous HW
across multiple DCs
Data Science Infrastructure Footprint
48 nodes each rack
Automation Challenge
Spinout creates some datacenter automation challenges …
• Smaller team but even more to do• Rethink automation• Distributed systems have tons of local drives which require
time consuming disk formatting and partitioning, and hardwareRAID config on masternodes
• New provisioning solution needs to easily, flexibly integrate w/ other commercial, open source, and homegrown management tools
• Can 100s or 1000s of nodes be (re)provisioned as quickly asone or a few? (e.g., drive failures mean replacing entire hostfrom O/S to disk to network to firmware to … etc)
Stacki @ PayPal
Ambari HDPHealth Detection
Integration
IPMI/iLOOS Disk Network DHCP / DNS /TFTP
Ansible
- Disk Array Controller Configuration- Disk Partitioning Configuration
“Stacki + Ansible = Happiness. :D” – Stacki mailing list 8/11/15
Quick, Early Success
14 Minutes*To Fully Provision 6 Racks of Bare Metal (288 Servers)
Includes wiping alldisks then fullypartitioning & formatting ~3500 drives
And Now…
Upgrades all firmware automatically
Executes Ansible scripts on all hosts
Hadoop packages installed
* Versus hours with other hyperscale management tools, or days to weeks with traditional tools and processes
How We Solve the Problem
History • San Diego Supercomputer Center
• 1986 - National Science Foundation • Along with NCSA only two non-classified centers • Mission: serve computational scientists
• Rocks • 2000 - First cluster group inside SDSC • Version 1.0 released that November as open source • 10k+ clusters world-wide
• StackIQ • 2006 - Commercial support for Rocks • 2011 - Venture Backed • Focus on next generation clustered systems (Data, Cloud)
• Stacki - 2015 • June – released as open source • July – first hyper-scale user
Must Haves
Make it – Automatic◦ Think about it, test it. Deploy it. ◦ People don’t scale, software does. Free your people – allow ops guys to be ops/analysis guys, move them from single machine view to
global machine view.
Make it – Repeatable◦ State of the environment is guaranteed. Does not require homogeneity of hardware or functionality. Make compute environments
homogenous on heterogeneous hardware and software.◦ Really, nothing is homogenous. Environment maybe, behavior of that environment on different machines while predictable will not be the
same across all hardware. Stacki gets you flexibility and predictability.
Make it – Reliable◦ You always get what you want when you want it. You can make reasonable estimates of need because you’ve made the environment
predictable and repeatable. Just like science!
Make it – Comprehensive◦ Manage application layer(s) down to kernels and device configuration with one tool. Never hit the network unconfigured.◦ Provide turn-key deployment with reasonable default settings and ability to customize / re-wire as desired.
Stacki Positioning
DevOps / Configuration Tool
DHCP /DNS / TFTPNetworkDiskOS
In-housedevelopeddeployment
tools
- Disk Array Controller Configuration- Disk Partitioning Configuration
Datacenter Architecture
Frontend
Network
Backend Backend Backend Backend
em1 em1em1 em1
em1
Download and Boot the ISO
Go to www.stacki.com and download the ISO ◦ It’s 1.8 GB ◦ “stacki” pallet plus stripped down CentOS 6.6
Boot the ISO on the host that will be your frontend
Frontend Services
Services to build backend nodes ◦ DHCP ◦ TFTP ◦ Named (optional)
Services to access backend nodes ◦ SSH key management ◦ Parallel execution shell
Host Configuration Spreadsheet
Frontend
Network
Backend Backend Backend Backend
em1 em1em1 em1
em1
Backend Installation
Save your Host Configuration spreadsheet as a CSV Import CSV on frontend ◦ “stack load hostfile file=hosts.csv”
Tell backend nodes to install on their next PXE boot ◦ “stack set host boot backend action=install”
PXE boot all backend nodes Done!
BitTorrent-Inspired Package Installation
Stacki
Customizing Your Hosts
Advanced Networking
Via Host Configuration spreadsheet, you can configure: ◦ Bonded interfaces ◦ VLANs ◦ Bridging ◦ Any combo of the above
Manage hosts in multiple subnets ◦ Build a single cluster from hosts in multiple subnets ◦ Manage hosts in multiple datacenters
Host Configuration Spreadsheet
Disk Controller Configuration Spreadsheet
Disk Partition Configuration Spreadsheet
Multiple Distributions
A frontend houses a default distribution ◦ Based on stripped down CentOS 6.6 or 7.1 ◦ Used to build backend nodes
Can add any number of new distributions to a frontend ◦ E.g., RHEL 6.x based distro, CentOS 6.5, etc.
Assign any backend node to any distro
Why is this hard and important?
Datacenter Architecture
Frontend
Network
Backend Backend Backend Backend
em1 em1em1 em1
em1
Datacenter Host Software Stack
DevOps / Configuration Tool
DHCP /DNS / TFTPNetworkDiskOS
In-housedevelopeddeployment
tools
- Disk Array Controller Configuration- Disk Partitioning Configuration
The “Step 0” Problem Check namenodes are
empty Format/start HDFS
Create all directories
Create all metastores
Start services (Hbase, Hive, Oozie, Sqoop, Impala, etc)
Deploy client configuration Configure database
Setup/assign monitors (activity, services, and host)
Test database connections
Validate/resolve hostnamesConsistent host timezones
No bad kernel versions running
(CDH) version consistency
Java version consistencyDaemons versions consistency
Mgmt Agents versions consistency
Host specification/SSH ports
MUCH MORE …
DHCP Server/Client setup TFTP/PXE configuration
Server OS installation
Node OS Install
RAID configuration
Boot configuration System/data disk partitioning
Monitoring system setup and config
Lights Out/IPMI setup
User accounts added and syncedSSH keys on all hosts
Network node configuration
Config Mgmt install and configuration
Route configurationOS upgrades/updates
Site specific software and configuration
Host specification/SSH ports
Security
Firewall setupCluster Mgmt utility Database install and config
Multiple network configPackage installation MUCH MORE …
Clusters are Different
Adding new servers does require coordination
Newly added servers must: • Have same software stack as original
servers • Have same configuration as original
servers • Know about original servers
And, original servers must: • Know about new servers
Result: The management complexity added to the Operations staff is “exponential”
Exponential Complexity
Number of Servers
Man
agem
ent C
ompl
exity
General Data Center
Clusters
The Pain Curve
Number of Servers
Man
agem
ent C
ompl
exity
General Data Center
Clusters
PAIN
The Pain Threshold
The pain threshold differs for every organization Function of:
• cluster(s) size • number of people in Operations • Operations staff cluster expertise
Moore’s Law
50 1 2 3 4
8
1
2
3
4
5
6
7
Time (Years)
Den
sity
18 monthdoubling
Moore’s Law and Infrastructure Value
What it Means for You
50 1 2 3 4
100
0
10
20
30
40
50
60
70
80
90
Time (Years)
Valu
e (%
)
3 months90% value
18 months50% value
Time is Money
The clock starts ticking when hosts land on your loading dock
Without your applications online, you have an paper weight that consumes power, cooling, and management’s attention
Try It Out
stacki.com
Download - www.stacki.com
Source & Docs - github.com/StackIQ/stacki/wiki
Discuss - groups.google.com/forum/#!forum/stacki
PayPal’s Options
Bring what we used at former parent company eBay with us.
Build our own soups-to-nuts bespoke bare metal provisioning tool.
Find the perfect open source tool that we can use and grow with.
Not Possible
Not Optimal
Not Likely
Quick, Early Success
2 Weeks Instead of 2 YearsTo Build a Scale-out Management Solution
1. Installed Stacki Frontend (base management server) Ran test installations of backend servers 1. Single Server test 2. Full Rack test (48 nodes)
2. Updated distribution (CentOS 6.6) to install additional packages
3. Integrated IPMI information into Stacki 1. Can now ssh into all IPMI consoles from the Stacki
frontend host using <hostname>.ipmi 4. Re-ran with PayPal kickstart changes/additions and was
able to image 6 racks in 14 minutes, including: 1. Nuking disks/partitions and running a full format of all
data drives
5. Updated the Stacki post-boot piece to do the following: 1. Upgrade firmware if host needs it 2. Runs PayPal Ansible playbook, which:
1. Installs additional packages 2. Creates user accounts 3. Disables unused services 4. Sets up resolver/ntp/syslog-ng/sudoers/limits.
d/sysctl/etc. 5. Installs/configures Ambari agents 6. Checks data drive mounts, fstab 7. Prepares the rack to be added to a Hadoop
cluster
PayPal development with Stacki includes:
DevOps Agnostic
DevOps / Configuration Tool
DHCP /DNS / TFTPNetworkDiskOS
In-housedevelopeddeployment
tools
- Disk Array Controller Configuration- Disk Partitioning Configuration
The “Step 0” Problem Check namenodes are
empty Format/start HDFS
Create all directories
Create all metastores
Start services (Hbase, Hive, Oozie, Sqoop, Impala, etc)
Deploy client configuration Configure database
Setup/assign monitors (activity, services, and host)
Test database connections
Validate/resolve hostnamesConsistent host timezones
No bad kernel versions running
(CDH) version consistency
Java version consistencyDaemons versions consistency
Mgmt Agents versions consistency
Host specification/SSH ports
MUCH MORE …
DHCP Server/Client setup TFTP/PXE configuration
Server OS installation
Node OS Install
RAID configuration
Boot configuration System/data disk partitioning
Monitoring system setup and config
Lights Out/IPMI setup
User accounts added and syncedSSH keys on all hosts
Network node configuration
Config Mgmt install and configuration
Route configurationOS upgrades/updates
Site specific software and configuration
Host specification/SSH ports
Security
Firewall setupCluster Mgmt utility Database install and config
Multiple network configPackage installation MUCH MORE …
App Config
Site Config
HW Install
System Performance ValidationBare Metal Installers
Hadoop Mgmt Tool
Upgrades/Patching
Disk Configuration
Monitoring Tool
Configuration Tool
Network/Site Config ToolsSystems Mgmt Tool
Others …
MANUAL
SEMI-AUTOMATED TOOLCHAIN(w/o StackIQ)
w/StackIQFULLY AUTOMATED
StackIQ Boss
Configuration Database
Server appliance types (e.g. data, namenode, tomcat, …)
Number of CPUs Disk partitioning
Hardware RAID config
PCI bus information …
And other System Attributes
Attributes
Global ◦ stack set attr
Appliance ◦ stack set appliance attr
OS ◦ stack set os attr
Host ◦ stack set host attr
Kickstart Profiles
Zoom In
Starting from the Empty Set
{ }
{ os }
© 2009 UC Regents
{ os, core }
© 2009 UC Regents
{ os, core, kernel }
© 2009 UC Regents
{ os, core, kernel, mapr }
© 2009 UC Regents
Manage the Deltas
{os, core, kernel, mapr} {os, core, kernel, horton}
© 2009 UC Regents
stacki.com
@masonkatz