Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | edan-charles |
View: | 23 times |
Download: | 0 times |
Riding the Dot-Com Wave:A Case Study in ExtremeVMScluster Scalability
CETS2001 Session 1255
Wednesday, Sept. 12, 2:45 pm, 303B
Keith Parris
Topics
• Scale of workload and configuration growth
• Changes made to scale the cluster
• Challenges to extreme scalability and high availability
• Surprises along the way
• Lessons learned
Hardware Growth
• 1981: 1 Alpha Microsystems PC• 1983: 1 VAX 11/750• 1993: 1 MicroVAX 4100• 1996: 4 VAX 7700s, 1 8400• 1997: 6 VAX 7700s, 2 VAX 7800s, 2 8400s• 1999: 18 GS-140s• 2001: 2 clusters; one with 12 GS-140s, the other
with 3 GS-140s and 2 GS-160s
Workload Growth Rate
• As measured in yearly peak transaction counts– 1996-1997: 2X– 1997-1998: 2X– 1998-1999: 2X– 1999-2000: 3X
• We’ll focus on these years
Scaling the Cluster: Memory
• Went from 1 GB to 20 GB on systems
Scaling the Cluster: CPU
• Upgraded VAX 7700 nodes by adding CPUs
• Upgraded key nodes from VAX 7700 to VAX 7800 CPUs
• Ported application from VAX to Alpha• Went from 2-CPU 8400s to 6-CPU GS-
140s, then added 12-CPU GS-160s– From 200 Mhz EV4 chips to 731 Mhz EV67
Scaling the Cluster: I/O
• Went from JBOD to RAID, and raised number of members in RAID arrays over time
• Added RMS Global Buffers• 3600 RPM magnetic disks to 5400 RPM to 7200
RPM to 10K RPM• Put hot files onto large arrays of Solid-State
Disks• Upgraded from CMD controllers to HSJ40s;
added writeback cache; upgraded to HSJ50s and doubled # of controllers; upgraded to HSJ80s
Scaling the Cluster: I/O
• Changed from shadowsets of controller-based stripesets to use host-based RAID 0+1 arrays of controller-based mirrorsets
– Avoided any single controller being a bottleneck by spreading RAID array members across multiple controllers
– Provides faster time-to-repair for shadowset member failures
– Provided faster cross-site shadow copy times
Shadowsets of Stripesets• Volume shadowing
thinks it sees large disks
• Shadow copies and merges occur sequentially across entire surface
• Failure of 1 member implies full shadow copy of stripeset to fix
Controller-based stripeset
Controller-based stripeset
Host-based shadowset
Host-based RAID 0+1 arrays• Individual disks are
combined first into shadowsets
• Host-based RAID software combines the shadowsets into a RAID 0 array
• Shadowset members can be spread across multiple controllers
Host-based shadowset
Host-based shadowset
Host-based shadowset
Host-based RAID 0+1 array
Host-based RAID 0+1 arrays• Shadow copies and
merges occur in parallel on all shadowsets at once
• Failure of 1 member requires full shadow copy of only that member to fix
Host-based shadowset
Host-based shadowset
Host-based shadowset
Host-based RAID 0+1 array
Scaling the Cluster: I/O & Locking
• Implemented Fast_Path on CI
• Tried Memory Channel and failed– CPU 0 saturation in interrupt state occurred
when lock traffic moved from CI (with Fast_Path) to MC (without Fast_Path)
• Went from 2 CI star couplers to 6– Distributed lock traffic across CIs
Scaling the Cluster:
• Implemented Disaster-Tolerant Cluster– Effectively doubled hardware: CPUs, I/O
Subsystems, Memory– Significantly improved availability– But relatively long inter-site distance added
inter-site latency as a new potential factor in performance issues
Scaling the Cluster:Datacenter space
• Multi-site clustering and Volume Shadowing provided the opportunity to move to larger datacenters, without downtime, on 3 separate occasions
Challenges:
• Downtime cost $Millions per event
• Longer downtime meant even-larger risk– Had to resist initial pressure to favor quick
fixes over any diagnostic efforts that would lengthen downtime
• e.g. crash dump files
Challenges:
• Network focus in application design rather than Cluster focus– Triggered by history of adding node after node,
connected by DECnet, rather than forming a VMS Cluster early on
– Important functions assigned to specific nodes• Failover and load balancing problematic
– Systems had to boot/reboot in specific order
Challenges:
• Web interface design provided quick time-to-market using screen scraping, but had fragile 3-process chain with link to Unix
Challenges:
• Fragile 3-process chain with link to Unix– Failure of Unix program, TCP/IP link, or any
of the 3 processes on VMS caused all 3 VMS processes to die, incurring:
• Process run-down and cleanup• Process creation and image activations for 3 new
processes to replace the 3 which just died
– Slowing response times could cause time-outs and initiate “meltdowns”
Challenges:
• Interactive system capacity requirements in an industry with historically batch-processing mentality:
– Can’t run CPUs to 100% with interactive users like you can with overnight batch jobs
Challenges:
• Adding Solid-State Disks– Hard to isolate hot blocks
• Ended up moving entire hot RMS files to SSD array
Challenges:
• Application program techniques which worked fine under low workloads failed at higher workloads
– Closing files for backups– ‘Temporary’ infinite loop
Challenges:
• Standardization on Cisco network hardware and SNMP monitoring
– Even on GIGAswitch-based inter-site cluster link
Challenges:
• Constant pressure to port to Unix:– Sun proponents continually told management:
• “We will be off the VAX in 6 months”
– Adversely affected VMS investments at critical times
• e.g. RZ28D disks, star couplers
Surprises Along the Way:
• As more Alpha nodes were added, lock tree remastering activity caused “pauses” of 10 to 50 seconds every 2-3 minutes
– Controlled with PE1=100 during workday
Surprises Along the Way:
• Shadowing patch c. 1997 changed algorithm for selecting disk for read operations, suddenly sending ½ of the read requests to the other site, 130 miles (4-5 milliseconds) farther away
– Subsequent patch kit allowed control of behavior with SHADOW_SYS_DISK SYSGEN parameter
Surprises Along the Way:
• VMS may allow a lock master node to take on so much workload that CPU 0 ends up saturated in interrupt state later
– Caused CLUEXIT bugchecks and performance anomalies
• With the help of VMS Source Listings and advice from VMS Engineering, wrote programs to spread lock mastership of the hot files across a set of several nodes, and held them there using PE1
Surprises Along the Way:
• CI Load Sharing code never got ported from VAX to Alpha
• Nodes crashing and rebooting changed assignments of which star couplers were used for lock traffic between pairs of nodes
– Caused unpredictable performance anomalies
• CSC and VMS Engineering came to the rescue with a program called MOVE_REMOTENODE_CONNECTIONS to set up a (static) load-balanced configuration
Surprises Along the Way:
• As disks grew larger, default extent sizes and RMS bucket sizes grew by default as files were CONVERTed onto larger disks using default optimize script
– Data transfer sizes gradually grew by a factor of 14X over 4 years
– Solid-state disks don’t benefit from increased RMS bucket sizes like magnetic disks do
• Fixed by manually selecting RMS bucket sizes for hot files on solid-state disks
Challenges Left in VMScluster Scalability and High Availability• Can’t enlarge disks or RAID arrays on-
line• Can’t re-pack RMS files on-line• Can’t de-fragment open files (with DFO)• Disks are getting lots bigger but not as
much faster– I/Os per second per gigabyte is actually
falling
Lessons Learned:
• To provide good system performance one must gain knowledge of application behavior
Lessons Learned:
• High-availability systems require:– Best possible people to run them– Best available vendor support:
• Remedial
• Engineering
Lessons Learned:
• Many problems can be avoided entirely (or at least deferred) by providing “reserve” computing capacity
• Avoids saturation conditions– Avoids error paths and other seldom-exercised code
paths
• Provides headroom for peak loads, and to accommodate rapid workload growth when procurement efforts have long lead times
Lessons Learned:
• Staff size must grow with workload growth and cluster size, but with VMS clusters, not at as high a rate
• Staff size went from 1 person to 8 people (plus vendor HW/SW support) with 24X workload growth
Lessons Learned:
• Visibility of system workload and system performance is key, to:
– Spot surges in workload– Identify bottlenecks as each new one arises
• Provide quick turn-around of performance info into changes and optimizations
– Overnight, and even mid-day
Lessons Learned:
• With present technology, some scheduled downtime will be needed:– to optimize performance
– to do hardware upgrades & maintenance
• You’re going to have to have some downtime: do you want to schedule some or just deal with it when it happens on its own?– Scheduled downtime helps prevent or minimize
unscheduled downtime
Lessons Learned:
• Despite the redundancy within a cluster, a VMS Cluster viewed as a whole can be a Single Point of Failure
– Solution: Use multiple clusters, with the ability to shift customer data quickly between them if needed
– Can hide scheduled downtime from users
Lessons Learned:
• It was impossible to optimize system performance by system tuning alone
– Deep knowledge of application program behavior had to be gained, by:
• Code examination
• Constant discussions with development staff
• Observing system behavior under load
Lessons Learned:
• Application code improvements are often sorely needed, but their effect on performance can be hard to predict; they may actually hurt things, or make dramatic order-of-magnitude improvements
– They are also often found due to serendipity or sudden inspiration, so it’s also hard to plan them or to predict when they might occur
Lessons Learned:
• Effect of hardware upgrades is easier to predict: double the hardware will double the cost, and will generally provide close to double the performance
– Order-of-magnitude improvements are harder` to obtain, and more expensive
Success Factors:
• Excellent people
• Best technology
• Quick procurement, preferably proactive
• Top-notch vendor support– Services (CSC, MSE)– VMS Engineering; Storage Engineering
Speaker Contact Info
Keith ParrisE-mail: [email protected] or
[email protected]: http://www.geocities.com/keithparris/ and
http://encompasserve.org/~kparris/
Integrity Computing, Inc.2812 Preakness WayColorado Springs, CO 80916-4375(719) 392-6696