+ All Categories
Home > Documents > Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU...

Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU...

Date post: 12-Aug-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
20
Brian Ropers-Huilman, CCT at LSU5 th LCI at TACC, “HPC Revolution: Center for Computation & Technology What is the CCT? Research Education Economic Development Mission Resources Brian Ropers-Huilman, CCT at LSU5 th LCI at TACC, “HPC Revolution: Linux Clusters Institute: The HPC Revolution 2004 “HPC Revolution: Growing Pains” Brian Ropers-Huilman Manager, HPC Operations Center for Computation and Technology at Louisiana State University
Transcript
Page 1: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

1

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Center for Computation & Technology

What is the CCT?

Research

Education

Economic Development

Mission

Resources

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Linux Clusters Institute:The HPC Revolution 2004

“HPC Revolution: Growing Pains”

Brian Ropers-Huilman

Manager, HPC Operations

Center for Computation and Technology atLouisiana State University

Page 2: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

2

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Agenda

History of CCT

Clusters at CCT

The Upgrade

The Future

Lessons Learned

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Agenda

History of CCTHistory of CCT

Clusters at CCT

The Upgrade

The Future

Lessons Learned

Page 3: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

3

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Vision 20/20

Governor M. J. “Mike” Foster

Vision 20/20 Plan

$22 Million annually to the state for five years

Focus on Information Technology to drive Economic Development

Focus on Higher Education

LSU received $6.975 Million annually

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

LSU CAPITAL

LSU CAPITAL (Center for Applied Information Technology and Learning) – Fiscal agency

Education

Graham Hall IT-intensive Residential College

Digital Securities Trading Room

Research

Faculty appointments

IT Apprenticeship program

Page 4: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

4

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Infrastructure

Campus infrastructure upgraded with new monies

Complete 1 Gb backbone upgrade

Redundant backbone

Increase Internet bandwidth

H.323 video-conferencing

VoIP

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

A Center is Born

Dr. Ed Seidel recruited to direct a new research center

Collaborative, interdisciplinary research center that uses technology to:

Manage Facilities

Educate

Research

Develop the Economy

Page 5: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

5

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

CCT Vision

Ed's Vision:

To create a multidisciplinary research center and computational facilities on par with the national laboratories

HPC plays a central role

Grid technologies

High speed networks

Quality people

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

CCT Currently

Director's office

HPC

Clusters, grids, storage

Focus Areas

Numerical relativity, grids, LCAT, frameworks, visualization

Programs

Post-doc, visitor, faculty, scholarships

Page 6: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

6

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Agenda

History of CCT

Clusters at CCTClusters at CCT

The Upgrade

The Future

Lessons Learned

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Clusters at CCT

SuperMike

26 October 2001 initial consideration

18 December 2001 invitation to bid

8 March 2002 Atipa Technologies wins the bid

23 May 2002 full machine on-site

10 July “nearing completion”

3 August 2002 first HPL

Fall 2002 first users

Page 7: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

7

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Clusters at CCT

Time Lines:

2 months to conceive

3 months for bid

2 months to assemble and ship

2 months to integrate

1 month for benchmark___

10 months from conception to reality

6th to 17th place on Top500

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Clusters at CCT

Time Lines:

2 months to conceive

3 months for bid

2 months to assemble and ship

2 months to integrate

1 month for benchmark___

10 months from conception to reality10 months from conception to reality

6th to 17th place on Top500

Page 8: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

8

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Clusters at CCT

SuperMike

512 nodes

Dual Xeon 1.8

2 GB RAM

80 GB HD

Myrinet 'B'

Fast Ethernet

RedHat 7.2

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Clusters at CCT

SuperMike benchmarks

12 March 2002 initial benchmarks estimate 2.1 TFLOPS (6th on November 2001 Top500)

3 August 2002 HPL at 2.101 TF

13th on 3 August 2002 Top500

2nd for academic institutions

17th on November 2002 Top500 at 2.207 TF

Page 9: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

9

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Clusters at CCT

Initially

1 Systems Administrator

1 Scientific Computing Support

5 major users

Maintained for almost a year

User base grows to 20+ core users, over 100 accounts

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Clusters at CCT

SuperHelix

128 nodes

Dual Xeon 1.8

2 GB RAM

80 GB HD

Myrinet 'B'

Fast Ethernet

RedHat 7.2

MiniMike

16 nodes

Dual Xeon 1.8

2 GB RAM

80 GB HD

Myrinet 'B'

Fast Ethernet

RedHat 7.2

Page 10: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

10

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Clusters at CCT

Cluster Issues:

Component failures

Hard-drives

Interconnect

Stability

Interconnect

Parallel filesystem

Software compatibilities

compilers and libraries

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Agenda

History of CCT

Clusters at CCT

The UpgradeThe Upgrade

The Future

Lessons Learned

Page 11: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

11

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Upgrade Decision

SuperComputer 2003

Discussions with vendors

Snowball effect

Scale? - three distinct clusters

Biggest decision:

Local or Remote?

Answer: Remote

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Upgrade DecisionUpgrade Decision

SuperComputer 2003

Discussions with vendors

Snowball effect

Scale? - three distinct clusters

Biggest decision:

Local or Remote?

Answer: Remote

Page 12: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

12

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Upgrade Decision

SuperComputer 2003

Discussions with vendors

Snowball effect

Scale? - three distinct clusters

Biggest decision:

Local or Remote?

Answer: RemoteAnswer: Remote

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

SuperMike Upgrade

Complete network change

Private to public

One full subnet

542 nodes

Xeon 1.8 -> Xeon 3.06(except test cluster)

Tyan 2720 -> Tyan 2723(except head and storage)

Myricom B -> Myricom D

Page 13: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

13

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

SuperMike Upgrade

Storage

Integration of:

local SCSI

enterprise SAN

new “local” SAN

Open system to allocations policy

Support

Documentation

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

SuperMike Upgrade

The “understated” issues:

Power (16 additional 20A lines)

Power control (old units not capable of increase power consumption)

Heat (complete redesign of floor tiles)

Firmware upgrades (automatic is not “automagic”)

Full software reconfiguration

Data migration

Page 14: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

14

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

SuperMike Upgrade

The “understated” issues:The “understated” issues:

Power (16 additional 20A lines)

Power control (old units not capable of increase power consumption)

Heat (complete redesign of floor tiles)

Firmware upgrades (automatic is not “automagic”)

Full software reconfiguration

Data migration

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

SuperMike Upgrade

Was it worth it?

256 nodes shipped on 26 Janall nodes not back until 19 Mar

Collapsing 512-node work into 128-node cluster

128 node HPL run (gcc):

976 GF -> 3.9 TF(16th on November 2003 list)

Page 15: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

15

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Agenda

History of CCT

Clusters at CCT

The Upgrade

The FutureThe Future

Lessons Learned

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

National Lambda Rail

LSU one of 16 original university members

4 x 10 Gbps Ethernet lambdas

Controlled provision

Experimental use

Growing fast

Access point next month

Page 16: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

16

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

National Lambda Rail

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

LONI / LARN

Louisiana Optical network InitiativeLouisiana Advanced Research Network

State-wide deployment between major universities

“Extremely optimistic” about state funding

Deployment over FY '04-'05

Page 17: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

17

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

LONI / LARN

Intent is to:

Deploy clusters at each member university

Clusters of clusters

Grid

Possible evaluation of MPICH-G2 across state-wide network

Push cluster computing further into the state

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

LONI / LARN

Intent is to:

Deploy clusters at each member Deploy clusters at each member universityuniversity

Clusters of clusters

Grid

Possible evaluation of MPICH-G2 across state-wide network

Push cluster computing further into the state

Page 18: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

18

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

LONI / LARN

Intent is to:

Deploy clusters at each member university

Clusters of clustersClusters of clusters

GridGrid

Possible evaluation of MPICH-G2 across state-wide network

Push cluster computing further into the state

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

LONI / LARN

Intent is to:

Deploy clusters at each member university

Clusters of clusters

Grid

Possible evaluation of MPICHPossible evaluation of MPICH--G2 G2 across stateacross state--wide networkwide network

Push cluster computing further into the state

Page 19: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

19

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Agenda

History of CCT

Clusters at CCT

The Upgrade

The Future

Lessons LearnedLessons Learned

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Lessons Learned

Commodity clusters are notcommodities

Deployments always take longer

Consider upgrades very carefullyLay everything out with the vendor

Plan, plan, plan

““IT Happens”IT Happens”

Page 20: Brian Ropers-Huilman, CCT at LSU5 LCI at TACC, “HPC Revolution · Focus on Higher Education LSU received $6.975 Million annually Brian Ropers-Huilman, CCT at LSU5th LCI at TACC,

20

Brian Ropers-Huilman, CCT at LSU5th LCI at TACC, “HPC Revolution:

Center for Computation & Technology

What is the CCT?

Research

Education

Economic Development

Mission

Resources


Recommended