Evolution of IT Infrastructure For Fusion Control Systems
Presentation to
14th International Conference on Accelerator & Large Experimental Physics Control Systems (ICALEPCS)
October 6-11, 2013
Tim Frazier Chief Information Officer
NIF & Photon Science
LLNL-PRES-644303
NIF’s IT architecture is based on four principles
• Individual component failure should not cause infrastructure failure Separate workloads & have more than one running at all
times • Technology should be easily replaceable Avoid an attachment to physical things
• Achieving self-similarity should guide the selection of technology Like model numbers wherever possible
• Use data to forecast resource consumption Create repositories of long-term metrics for analysis
Frazier - ICALEPCS Conference San Francisco, October 6-11, 2013 2 NIF-0000-00000s2.ppt
NIF has consolidated its server footprint by 40%
Frazier - ICALEPCS Conference San Francisco, October 6-11, 2013 3 NIF-0911-22970s2.ppt
Partnership with key technology providers has been integral to our success
330 servers
423 servers
228 blade
servers 123
blade servers
Our physical footprint has been reduced by 50%
Frazier - ICALEPCS Conference San Francisco, October 6-11, 2013 4 NIF-0911-22970s2.ppt
SPARC-to-Intel migration made possible by port from Ada to Java
High-density, virtualized servers
Single-purpose to multi-purpose servers made possible by Virtualization (Xen)
We have kept up with customer demand despite the 40% consolidation in footprint
Frazier - ICALEPCS Conference San Francisco, October 6-11, 2013 5 NIF-0911-22970s2.ppt
Low cost of ownership virtual machines enable single-purpose hosts
Virtualization of our Integrated Computer Control System (ICCS) is nearly complete
6 NIF-0311-21172s2.ppt
Shot Director
Bundle 1
Collaboration Server
Subsystem Shot
Supervisors
Injection Laser
Beam Control
Laser Diagnostics
Bundle 2
Collaboration Server
Subsystem Shot
Supervisors
Injection Laser
Beam Control
Laser Diagnostics
NIF
Collaboration Server
Subsystem Shot
Supervisors
Target Diagnostics
Alignment
LPOM
Industrial Controls
Bundles 3-24
. . .
[1]
[50]
[450]
Supervisory Layer
Device Control
Layer
Subsystem Shot
Supervisors
Subsystem Shot
Supervisors
Subsystem Shot
Supervisors
Subsystem Shot
Supervisors
Subsystem Shot
Supervisors
Subsystem Shot
Supervisors
. . .
Injection Laser
Beam Control
Laser Diagnostics
Injection Laser
Beam Control
Laser Diagnostics
Injection Laser
Beam Control
Laser Diagnostics
Injection Laser
Beam Control
Laser Diagnostics
. . .
Collaboration Server
Common
Collaboration Server
Framework Servers
[1300]
Subsystem Shot
Supervisors
Subsystem Shot
Supervisors Analysis Servers
Inspection Systems
Framework Servers
Analysis Servers
Subsystem Shot
Supervisors
Subsystem Shot
Supervisors
Subsystem Shot
Supervisors
Complete
Alignment
Inspection Systems
LPOM
In Progress
Collaboration Server
Collaboration Server
Collaboration Server
Injection Laser
Injection Laser
Beam Control
Beam Control
Laser Diagnostics
Laser Diagnostics
Target Diagnostics
Industrial Controls
(Consoles)
Shot Director
To build an infrastructure, you need building blocks
Frazier - ICALEPCS Conference San Francisco, October 6-11, 2013 7 NIF-0911-22970s2.ppt
Ethernet Switch Cisco 6509
Ethernet Switch Cisco 5548
Filer with disks NetApp 3250
Diskless Blade Servers
HP BL460c
Fiber DCX Switch
Proto-type a single environment
Frazier - ICALEPCS Conference San Francisco, October 6-11, 2013 8 NIF-0911-22970s2.ppt
Ethernet Switch Cisco 6509
Ethernet Switch Cisco 5548
Filer with disks NetApp 3250
Blade Servers HP BL460c
Fiber DCX Switch
8 Gb Fiber network
10 Gb Ethernet network
Diskless blades
Create a segmented infrastructure
Frazier - ICALEPCS Conference San Francisco, October 6-11, 2013 9 NIF-0911-22970s2.ppt
Ethernet Switch Cisco 6509
Fiber DCX Switch
Sandbox Dev – Int - QA Production Controls Services
Computational workload is very densely packed onto hypervisors
Frazier - ICALEPCS Conference San Francisco, October 6-11, 2013 10 NIF-0911-22970s2.ppt
6-to-1 @ 27%
utilization
~3-to-1 @ 17%
utilization
Controls2 : Control system virtual machines ShotProd2 : Production, non-control system virtual machines General02 : Development/Integration/QA virtual machines Production02 : Production, non-control system virtual machines
9-to-1 @ 33%
utilization
4-to-1 @ 17%
utilization
Memory limits the packing factor for hypervisors
Frazier - ICALEPCS Conference San Francisco, October 6-11, 2013 11 NIF-0911-22970s2.ppt
Controls2 : Control system virtual machines ShotProd2 : Production, non-control system virtual machines General02 : Development/Integration/QA virtual machines Production02 : Production, non-control system virtual machines
Non-controls environments run with less margin
We rely on many tools to manage our infrastructure
Asset
3PAR desk manager
AssetDB DNS
Active Directory
F5 Manager NetApp
Manager
Splunk Enterprise Manager
IPAM (netping) Brocade Manager
Statseeker
Storage Infrastructure
Network Infrastructure
Server Infrastructure
We have developed metrics to measure discrepancies between tools & performance outliers
Systemic problems can be revealed by trending metrics over time
Frazier - ICALEPCS Conference San Francisco, October 6-11, 2013
Two tools are used monitor & manage our server infrastructure
Frazier - ICALEPCS Conference San Francisco, October 6-11, 2013 14 NIF-0911-22970s2.ppt
Oracle Enterprise Manager Performance,
Configuration Management & Incident Management
Splunk Log file mining & alerting
See John Fisher’s Poster “Monitoring of the National
Ignition Facility Control System THPCC082”
Agent-based monitoring provides a wealth of information beyond host performance management
Frazier - ICALEPCS Conference France, October 10-14, 2011 15 NIF-0911-22970s2.ppt
Management Service
Management Database
Management Agent
Database Application Host
Database metrics Application metrics
Configuration Management
User-Defined Metrics
Final thoughts (informed by hindsight)
• Self-similarity, more than any other quality, has enabled us to grow our infrastructure by 100% without a corresponding increase in staff
• Segmentation, more than any other design principle, has enabled
us to increase reliability by isolating performance degradation & component failures
• Virtualization, more than any other technology, has enabled us to
grow our infrastructure by 100% while at the same time consolidating our physical footprint by 50%
• Data provided by tools, specifically agent-based collection, more
than any other asset, have provided the knowledge needed to manage our infrastructure
• It is time for your questions!
Frazier - ICALEPCS Conference San Francisco, October 6-11, 2013 16 NIF-0911-22970s2.ppt