1
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Center for Computation & Technology
What is the CCT?
Research
Education
Economic Development
Mission
Resources
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Linux Clusters Institute:The HPC Revolution 2004
“HPC Revolution: Growing Pains”
Brian Ropers-Huilman
Manager, HPC Operations
Center for Computation and Technology atLouisiana State University
2
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Agenda
History of CCT
Clusters at CCT
The Upgrade
The Future
Lessons Learned
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Agenda
History of CCTHistory of CCT
Clusters at CCT
The Upgrade
The Future
Lessons Learned
3
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Vision 20/20
Governor M. J. “Mike” Foster
Vision 20/20 Plan
$22 Million annually to the state for five years
Focus on Information Technology to drive Economic Development
Focus on Higher Education
LSU received $6.975 Million annually
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
LSU CAPITAL
LSU CAPITAL (Center for Applied Information Technology and Learning) – Fiscal agency
Education
Graham Hall IT-intensive Residential College
Digital Securities Trading Room
Research
Faculty appointments
IT Apprenticeship program
4
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Infrastructure
Campus infrastructure upgraded with new monies
Complete 1 Gb backbone upgrade
Redundant backbone
Increase Internet bandwidth
H.323 video-conferencing
VoIP
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
A Center is Born
Dr. Ed Seidel recruited to direct a new research center
Collaborative, interdisciplinary research center that uses technology to:
Manage Facilities
Educate
Research
Develop the Economy
5
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
CCT Vision
Ed's Vision:
To create a multidisciplinary research center and computational facilities on par with the national laboratories
HPC plays a central role
Grid technologies
High speed networks
Quality people
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
CCT Currently
Director's office
HPC
Clusters, grids, storage
Focus Areas
Numerical relativity, grids, LCAT, frameworks, visualization
Programs
Post-doc, visitor, faculty, scholarships
6
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Agenda
History of CCT
Clusters at CCTClusters at CCT
The Upgrade
The Future
Lessons Learned
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Clusters at CCT
SuperMike
26 October 2001 initial consideration
18 December 2001 invitation to bid
8 March 2002 Atipa Technologies wins the bid
23 May 2002 full machine on-site
10 July “nearing completion”
3 August 2002 first HPL
Fall 2002 first users
7
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Clusters at CCT
Time Lines:
2 months to conceive
3 months for bid
2 months to assemble and ship
2 months to integrate
1 month for benchmark___
10 months from conception to reality
6th to 17th place on Top500
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Clusters at CCT
Time Lines:
2 months to conceive
3 months for bid
2 months to assemble and ship
2 months to integrate
1 month for benchmark___
10 months from conception to reality10 months from conception to reality
6th to 17th place on Top500
8
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Clusters at CCT
SuperMike
512 nodes
Dual Xeon 1.8
2 GB RAM
80 GB HD
Myrinet 'B'
Fast Ethernet
RedHat 7.2
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Clusters at CCT
SuperMike benchmarks
12 March 2002 initial benchmarks estimate 2.1 TFLOPS (6th on November 2001 Top500)
3 August 2002 HPL at 2.101 TF
13th on 3 August 2002 Top500
2nd for academic institutions
17th on November 2002 Top500 at 2.207 TF
9
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Clusters at CCT
Initially
1 Systems Administrator
1 Scientific Computing Support
5 major users
Maintained for almost a year
User base grows to 20+ core users, over 100 accounts
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Clusters at CCT
SuperHelix
128 nodes
Dual Xeon 1.8
2 GB RAM
80 GB HD
Myrinet 'B'
Fast Ethernet
RedHat 7.2
MiniMike
16 nodes
Dual Xeon 1.8
2 GB RAM
80 GB HD
Myrinet 'B'
Fast Ethernet
RedHat 7.2
10
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Clusters at CCT
Cluster Issues:
Component failures
Hard-drives
Interconnect
Stability
Interconnect
Parallel filesystem
Software compatibilities
compilers and libraries
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Agenda
History of CCT
Clusters at CCT
The UpgradeThe Upgrade
The Future
Lessons Learned
11
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Upgrade Decision
SuperComputer 2003
Discussions with vendors
Snowball effect
Scale? - three distinct clusters
Biggest decision:
Local or Remote?
Answer: Remote
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Upgrade DecisionUpgrade Decision
SuperComputer 2003
Discussions with vendors
Snowball effect
Scale? - three distinct clusters
Biggest decision:
Local or Remote?
Answer: Remote
12
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Upgrade Decision
SuperComputer 2003
Discussions with vendors
Snowball effect
Scale? - three distinct clusters
Biggest decision:
Local or Remote?
Answer: RemoteAnswer: Remote
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
SuperMike Upgrade
Complete network change
Private to public
One full subnet
542 nodes
Xeon 1.8 -> Xeon 3.06(except test cluster)
Tyan 2720 -> Tyan 2723(except head and storage)
Myricom B -> Myricom D
13
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
SuperMike Upgrade
Storage
Integration of:
local SCSI
enterprise SAN
new “local” SAN
Open system to allocations policy
Support
Documentation
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
SuperMike Upgrade
The “understated” issues:
Power (16 additional 20A lines)
Power control (old units not capable of increase power consumption)
Heat (complete redesign of floor tiles)
Firmware upgrades (automatic is not “automagic”)
Full software reconfiguration
Data migration
14
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
SuperMike Upgrade
The “understated” issues:The “understated” issues:
Power (16 additional 20A lines)
Power control (old units not capable of increase power consumption)
Heat (complete redesign of floor tiles)
Firmware upgrades (automatic is not “automagic”)
Full software reconfiguration
Data migration
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
SuperMike Upgrade
Was it worth it?
256 nodes shipped on 26 Janall nodes not back until 19 Mar
Collapsing 512-node work into 128-node cluster
128 node HPL run (gcc):
976 GF -> 3.9 TF(16th on November 2003 list)
15
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Agenda
History of CCT
Clusters at CCT
The Upgrade
The FutureThe Future
Lessons Learned
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
National Lambda Rail
LSU one of 16 original university members
4 x 10 Gbps Ethernet lambdas
Controlled provision
Experimental use
Growing fast
Access point next month
16
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
National Lambda Rail
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
LONI / LARN
Louisiana Optical network InitiativeLouisiana Advanced Research Network
State-wide deployment between major universities
“Extremely optimistic” about state funding
Deployment over FY '04-'05
17
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
LONI / LARN
Intent is to:
Deploy clusters at each member university
Clusters of clusters
Grid
Possible evaluation of MPICH-G2 across state-wide network
Push cluster computing further into the state
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
LONI / LARN
Intent is to:
Deploy clusters at each member Deploy clusters at each member universityuniversity
Clusters of clusters
Grid
Possible evaluation of MPICH-G2 across state-wide network
Push cluster computing further into the state
18
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
LONI / LARN
Intent is to:
Deploy clusters at each member university
Clusters of clustersClusters of clusters
GridGrid
Possible evaluation of MPICH-G2 across state-wide network
Push cluster computing further into the state
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
LONI / LARN
Intent is to:
Deploy clusters at each member university
Clusters of clusters
Grid
Possible evaluation of MPICHPossible evaluation of MPICH--G2 G2 across stateacross state--wide networkwide network
Push cluster computing further into the state
19
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Agenda
History of CCT
Clusters at CCT
The Upgrade
The Future
Lessons LearnedLessons Learned
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Lessons Learned
Commodity clusters are notcommodities
Deployments always take longer
Consider upgrades very carefullyLay everything out with the vendor
Plan, plan, plan
““IT Happens”IT Happens”
20
Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:
Center for Computation & Technology
What is the CCT?
Research
Education
Economic Development
Mission
Resources