NERSC User Group Meeting
NERSC Status and Plansfor the
NERSC User Group MeetingFebruary 22,2001
BILL KRAMERDEPUTY DIVISION DIRECTOR
DEPARTMENT HEAD, HIGH PERFORMANCE COMPUTING [email protected]
510-486-7577
NERSC User Group Meeting
Agenda
• Update on NERSC activities
• IBM SP Phase 2 status and plans
• NERSC-4 plans
• NERSC-2 decommissioning
NERSC User Group Meeting
ACTIVITIES AND ACCOMPLISHMENTS
NERSC User Group Meeting
NERSC Facility Mission
To provide reliable, high-quality,To provide reliable, high-quality,state-of-the-art computing resourcesstate-of-the-art computing resources
and client support in a timely manner—and client support in a timely manner—independent of client location–independent of client location–
while wisely advancing the while wisely advancing the state of computational and computer science.state of computational and computer science.
NERSC User Group Meeting
2001 GOALS2001 GOALS• PROVIDE RELIABLE AND TIMELY SERVICEPROVIDE RELIABLE AND TIMELY SERVICE
—Systems Systems • Gross Availability, Scheduled Availability, MTBF/MTBI, MTTRGross Availability, Scheduled Availability, MTBF/MTBI, MTTR
—ServicesServices• Responsiveness, Timeliness, Accuracy, ProactivityResponsiveness, Timeliness, Accuracy, Proactivity
• DEVELOP INNOVATIVE APPROACHES TO ASSIST THE DEVELOP INNOVATIVE APPROACHES TO ASSIST THE CLIENT COMMUNITY EFFECTIVELY USE NERSC CLIENT COMMUNITY EFFECTIVELY USE NERSC SYSTEMS.SYSTEMS.
• DEVELOP AND IMPLEMENT WAYS TO TRANSFER DEVELOP AND IMPLEMENT WAYS TO TRANSFER RESEARCH PRODUCTS AND KNOWLEDGE INTO RESEARCH PRODUCTS AND KNOWLEDGE INTO PRODUCTION SYSTEMS AT NERSC AND ELSEWHEREPRODUCTION SYSTEMS AT NERSC AND ELSEWHERE
• NEVER BE A BOTTLENECK TO MOVING NEW NEVER BE A BOTTLENECK TO MOVING NEW TECHNOLOGY INTO SERVICE.TECHNOLOGY INTO SERVICE.
• INSURE ALL NEW TECHNOLOGY AND CHANGES INSURE ALL NEW TECHNOLOGY AND CHANGES IMPROVE (OR AT LEAST DOES NOT DIMINISH) IMPROVE (OR AT LEAST DOES NOT DIMINISH) SERVICE TO OUR CLIENTS.SERVICE TO OUR CLIENTS.
NERSC User Group Meeting
GOALS (CON’T)GOALS (CON’T)
• NERSC AND LBNL WILL BE A LEADER IN LARGE SCALE NERSC AND LBNL WILL BE A LEADER IN LARGE SCALE SYSTEMS MANAGEMENT SYSTEMS MANAGEMENT && SERVICES. SERVICES.
• EXPORT KNOWLEDGE, EXPERIENCE, AND TECHNOLOGY EXPORT KNOWLEDGE, EXPERIENCE, AND TECHNOLOGY DEVELOPED AT NERSC, PARTICULARLY TO AND WITHIN DEVELOPED AT NERSC, PARTICULARLY TO AND WITHIN NERSC CLIENT SITES.NERSC CLIENT SITES.
• NERSC WILL BE ABLE TO THRIVE AND IMPROVE IN AN NERSC WILL BE ABLE TO THRIVE AND IMPROVE IN AN ENVIRONMENT WHERE CHANGE IS THE NORMENVIRONMENT WHERE CHANGE IS THE NORM
• IMPROVE THE EFFECTIVENESS OF NERSC STAFF BY IMPROVE THE EFFECTIVENESS OF NERSC STAFF BY IMPROVING INFRASTRUCTURE, CARING FOR STAFF, IMPROVING INFRASTRUCTURE, CARING FOR STAFF, ENCOURAGING PROFESSIONALISM AND PROFESSIONAL ENCOURAGING PROFESSIONALISM AND PROFESSIONAL IMPROVEMENTIMPROVEMENT
CONSISTENTCONSISTENTSERVICE SERVICE
& SYSTEM& SYSTEMARCHITECTUREARCHITECTURE
SUCCESSSUCCESSFORFOR
CLIENTSCLIENTSANDAND
FACILITYFACILITY
TECHNOLOGYTECHNOLOGYTRANSFERTRANSFER
RELIABLE RELIABLE SERVICESERVICE
TIMELYTIMELYINFORMATIONINFORMATION
INNOVATIVE INNOVATIVE ASSISTANCEASSISTANCE
LARGELARGESCALESCALE
LEADERLEADERCHANGECHANGE
RESEARCH FLOWRESEARCH FLOW
STAFF STAFF EFFECTIVENESSEFFECTIVENESS
NEWNEWTECHNOLOGYTECHNOLOGY
WISEWISEINTEGRATIINTEGRATI
ONON
MISSIONMISSION EXCELLENTEXCELLENTSTAFFSTAFF
NEWNEWTECHNOLOGYTECHNOLOGY
NERSC User Group Meeting
Major AccomplishmentsSince last meeting (June 2000)
• IBM SP placed into full service April 4, 2000 – more later— Augmented the allocations by 1 M Hours in FY
2000— Contributed to 11M PE hours in FY 2000 – more
than doubling the FY 2000 allocation— SP is fully utilized
• Moved entire facility to Oakland – more later• Completed the second PAC allocation process with
lessons learned from the first year
NERSC User Group Meeting
Activities and Accomplishments
• Improved Mass Storage System— Upgraded HPSS— New versions of HSI— Implementing Gigabit Ethernet— Two STK robots added— Replaced 3490 with 9840 tape drives— Higher density and Higher speed tape drives
• Formed Network and Security Group• Succeeded in external reviews
— Policy Board— SCAC
NERSC User Group Meeting
Activities and Accomplishments
• Implemented new accounting system – NIM— Old system was
• Difficult to maintain• Difficult to integrate to new system• Limited by 32 bits• Not Y2K compliant
— New system• Web focused• Available data base software• Works for any type of system
• Thrived in a state of increased security — Open Model— Audits, tests
NERSC User Group Meeting
2000 Activities and Accomplishments
• NERSC firmly established as a leader in system evaluation— Effective System Performance (ESP) recognized
as a major step in system evaluation and is influencing a number of sites and vendors
— Sustained System Performance measures— Initiated a formal benchmarking effort to the
NERSC Application Performance Simulation Suite (NAPs) to possibly be the next widely recognized parallel evaluation suite
NERSC User Group Meeting
Activities and Accomplishments
• Formed the NERSC Cluster team to investigate the impact of SMP commodity clusters for High Performance, Parallel Computing to assure the most effective implementations of division resources related to cluster computing. — Coordinate all NERSC Division cluster computing
activities (research, development, advanced prototypes, pre-production, production, and user support).
— Initiated a formal procurement for mid-range cluster• In consultation with DOE, decided not to award as part
of NERSC program activities
NERSC User Group Meeting
HORST SIMONDivision Director
WILLIAM KRAMERDeputy Director
CENTER FOR CENTER FOR BIOINFORMATICSBIOINFORMATICS
& COMPUTATIONAL & COMPUTATIONAL GENOMICSGENOMICS
MANFRED ZORN
SCIENTIFIC DATASCIENTIFIC DATAMGMT RESEARCHMGMT RESEARCH
ARIE SHOSHANI
ADVANCED ADVANCED SYSTEMSSYSTEMS
TAMMY WELCOME
DIVISION ADMINISTRATORDIVISION ADMINISTRATOR& FINANCIAL MANAGER& FINANCIAL MANAGER
WILLIAM FORTNEY
DISTRIBUTED SYSTEMS DISTRIBUTED SYSTEMS DEPARTMENTDEPARTMENT
WILLIAM JOHNSTONDepartment Head
DEB AGARWAL, Deputy
HIGH PERFORMANCEHIGH PERFORMANCECOMPUTING DEPARTMENTCOMPUTING DEPARTMENT
WILLIAM KRAMERDepartment Head
CHIEF TECHNOLOGISTCHIEF TECHNOLOGIST
DAVID BAILEY
COMPUTATIONAL COMPUTATIONAL SYSTEMSSYSTEMS
JIM CRAW
COMPUTER COMPUTER OPERATIONS & OPERATIONS &
NETWORKING SUPPORTNETWORKING SUPPORT
WILLIAM HARRIS
MASS STORAGEMASS STORAGE
NANCY MEYER
USER SERVICESUSER SERVICES
FRANCESCA VERDIER
FUTURE INFRASTRUCTURE FUTURE INFRASTRUCTURE NETWORKING & SECURITYNETWORKING & SECURITY
HOWARD WALTER
HENP COMPUTINGHENP COMPUTING
DAVID QUARRIE
HIGH PERFORMANCEHIGH PERFORMANCECOMPUTING RESEARCH DEPARTMENTCOMPUTING RESEARCH DEPARTMENT
ROBERT LUCASDepartment Head
APPLIED NUMERICAL APPLIED NUMERICAL ALGORITHMSALGORITHMS
PHIL COLELLA
CENTER FOR CENTER FOR COMPUTATIONAL COMPUTATIONAL SCIENCE & ENGR.SCIENCE & ENGR.
JOHN BELL
FUTUREFUTURE TECHNOLOGIES TECHNOLOGIES
ROBERT LUCAS (acting)
IMAGING & IMAGING & COLLABORATIVE COLLABORATIVE
COMPUTINGCOMPUTING
BAHRAM PARVIN
SCIENTIFIC SCIENTIFIC COMPUTINGCOMPUTING
ESMOND NG
VISUALIZATIONVISUALIZATION
ROBERT LUCAS (acting)
SCIENTIFIC DATASCIENTIFIC DATAMANAGEMENTMANAGEMENT
ARIE SHOSHANI
COLLABORATORIESCOLLABORATORIES
DEB AGARWAL
DATA INTENSIVE DIST.DATA INTENSIVE DIST.COMPUTINGCOMPUTING
BRIAN TIERNEY (CERN)
WILLIAM JOHNSTON (acting)
DISTRIBUTED SECURITY DISTRIBUTED SECURITY RESEARCHRESEARCH
MARY THOMPSON
NETWORKINGNETWORKING
WILLIAM JOHNSTON (acting)
Rev: 02/01/01
NERSC DivisionNERSC Division
NERSC User Group Meeting
USER SERVICESUSER SERVICES
FRANCESCA VERDIER
Mikhail AvrekhHarsh Anand
Majdi BaddourahJonathan Carter
Tom DeBoniJed Donnelley
Therese EnrightRichard Gerber
Frank HaleJohn McCarthy
R.K. OwenIwona SakrejdaDavid Skinner
Michael Stewart (C)David TurnerKaren Zukor
HIGH PERFORMANCEHIGH PERFORMANCECOMPUTING DEPARTMENTCOMPUTING DEPARTMENT
WILLIAM KRAMERDepartment Head
Rev: 02/01/01
ADVANCEDADVANCEDSYSTEMSSYSTEMS
TAMMY WELCOME
Greg ButlerThomas DavisAdrian Wong
COMPUTATIONALCOMPUTATIONALSYSTEMSSYSTEMS
JAMES CRAW
Terrence Brewer (C)Scott Burrow (I)
Tina ButlerShane Canon
Nicholas CardoStephan Chan
William Contento (C)Bryan Hardy (C)
Stephen Luzmoor (C)Ron Mertes (I)
Kenneth OkikawaDavid Paul
Robert Thurman (C)Cary Whitney
COMPUTER COMPUTER OPERATIONS & OPERATIONS & NETWORKING NETWORKING
SUPPORTSUPPORT
WILLIAM HARRIS
Clayton Bagwell Jr.Elizabeth Bautista
Richard BeardDel Black
Aaron GarrettMark Heer
Russell HuieIan KaufmanYulok Lam
Steven LoweAnita NewkirkRobert NeylanAlex Ubungen
MASS STORAGEMASS STORAGE
NANCY MEYER
Harvard HolmesWayne HurlbertNancy Johnston
Rick Un (V)
(C) Cray (FB) Faculty UC Berkeley (FD) Faculty UC Davis (G) Graduate Student Research Assistant (I) IBM (M) Mathematical Sciences Research Institute (MS) Masters Student (P) Postdoctoral Researcher (SA) Student Assistant (V) Visitor * On leave to CERN
HENP HENP COMPUTINGCOMPUTING
DAVID QUARRIE
* CRAIG TULL (Deputy)
Paolo CalafiuraChristopher DayIgor Gaponenko
Charles Leggett (P)Massimo Marino
Akbar MokhtaraniSimon Patton
FUTURE FUTURE INFRASTRUCTURE INFRASTRUCTURE NETWORKING & NETWORKING &
SECURITYSECURITY
HOWARD WALTER
Eli DartBrent DraneyStephen Lau
NERSC User Group Meeting
APPLIED NUMERICAL ALGORITHMSAPPLIED NUMERICAL ALGORITHMS
PHILLIP COLELLA
Susan Graham (FB) Anton Kast Peter McCorquodale (P) Brian Van StraalenDaniel Graves Daniel Martin (P) Greg Miller (FD)
CENTER FOR BIOINFORMATICS & COMPUTATIONAL GENOMICSCENTER FOR BIOINFORMATICS & COMPUTATIONAL GENOMICS
MANFRED ZORN
Donn Davy Inna Dubchak * Sylvia Spengler
SCIENTIFIC COMPUTINGSCIENTIFIC COMPUTINGESMOND NG
Julian Borrill Xiaofeng He (V) Jodi Lamoureux (P) Lin-Wang WangAndrew Canning Yun He Sherry Li Michael Wehner (V)Chris Ding Parry Husbands (P) Osni Marques Chao YangTony Drummond Niels Jensen (FD) Peter Nugent Woo-Sun Yang (P)Ricardo da Silva (V) Plamen Koev (G) David Raczkowski (P)
CENTER FOR COMPUTATIONAL SCIENCE & ENGINEERINGCENTER FOR COMPUTATIONAL SCIENCE & ENGINEERING
JOHN BELL
Ann Almgren William Crutchfield Michael Lijewski Charles RendlemanVincent Beckner Marcus Day
FUTURE TECHNOLOGIESFUTURE TECHNOLOGIES
ROBERT LUCAS (acting)
David Culler (FB) Paul Hargrove Eric Roman Michael WelcomeJames Demmel (FB) Leonid Oliker Erich Stromeier Katherine Yelick (FB)
VISUALIZATIONVISUALIZATION
ROBERT LUCAS (acting)
Edward Bethel James Hoffman (M) Terry Ligocki Soon Tee Teoh (G)James Chen (G) David Hoffman (M) John Shalf Gunther Weber (G)Bernd Hamann (FD) Oliver Kreylos (G)
IMAGING & COLLABORATIVE COMPUTINGIMAGING & COLLABORATIVE COMPUTING
BAHRAM PARVIN
Hui H Chan (MS) Gerald Fontenay Sonia Sachs Qing YangGe Cong (V) Masoud Nikravesh (V) John Taylor
SCIENTIFIC DATA MANAGEMENTSCIENTIFIC DATA MANAGEMENT
ARIE SHOSHANI
Carl Anderson Andreas Mueller Ekow Etoo M. Shinkarsky (SA)Mary Anderson Vijaya Natarajan Elaheh Pourabbas (V) Alexander SimJunmin Gu Frank Olken Arie Segev (FB) John WuJinbaek Kim (G)
HIGH PERFORMANCE COMPUTING RESEARCH DEPARTMENTHIGH PERFORMANCE COMPUTING RESEARCH DEPARTMENT
ROBERT LUCAS
Department Head
(FB) Faculty UC Berkeley (FD) Faculty UC Davis (G) Graduate Student Research Assistant (M) Mathematical Sciences Research Institute (MS) Masters Student (P) Postdoctoral Researcher (S) SGI (SA) Student Assistant (V) Visitor * Life Sciences Div. – On Assignment to NSF
Rev: 02/01/01
NERSC User Group Meeting
FY00 MPP Users/Usage by Discipline
NERSC User Group Meeting
FY00 PVP Users/Usage by Discipline
NERSC User Group Meeting
NERSC FY00 MPP Usage by Site
NERSC User Group Meeting
NERSC FY00 PVP Usage by Site
NERSC User Group Meeting
FY00 MPP Users/Usage by Institution Type
NERSC User Group Meeting
FY00 PVP Users/Usage by Institution Type
NERSC User Group Meeting
VIS LAB
MILLENNIUM
SYMBOLICMANIPULATION
SERVER
REMOTEVISUALIZATION
SERVER
CRI T3E900 644/256
IBM SPNERSC-3
Processors 604/304 Gigabyte
memory
CRI SV1
FDDI/ETHERNET
10/100/Gigbit
SGIMAX
STRAT
HIPPI
IBMAnd STKRobots
DPSS PDSF
ESnet
HPSSHPSS
NERSC System Architecture
ResearchCluster
IBM SPNERSC-3 – Phase 2a
2532 Processors/ 1824 Gigabytememory
LBNL Cluster
NERSC User Group Meeting
Current Systems
System Description
IBM SP RS/6000 NERSC-3/Phase 2a – Seaborg – a 2532 processor systems using 16 CPU SMP Nodes, with the “Colony” Double/Double Switch. Peak performance is ~3.8 Tflop/s. 12 GB memory per computational node, 20 TB of usable globally accessible parallel disk.
NERSC-3/Phase 1 - Gseaborg- a 608-processor IBM RS/6000 SP system, with a peak performance of 410 gigaflop/s, 256 gigabytes of memory, and 10 terabytes of disk storage. NERSC Production Macine
Cray T3E NERSC-2 - Mcurie- a 696-processor, MPP system with a peak speed of 575 gigaflop/s, 256 megabytes of memory per processor, and 1.5 terabytes of disk storage. A peak CPU performance of 900 MFlops per processor. NERSC production machine.
Cray Vector Systems NERSC-2 - Three Cray SV1 machines. A total of 96 vector processors in the cluster, 4 gigawords of memory, and a peak performance of 83 gigaflop/s. The SV-1 (Killeen) is used for interactive computing. The remaining two SV1 machines (Bhaskara and Franklin) are batch-only machines.
HPSS – Highly Parallel Storage System
HPSS is a modern, flexible, performance-oriented mass storage system, designed and developed by a consortium of government and commercial entities. It is deployed at a number of sites and centers, and is used at NERSC for archival storage.
PDSF - Parallel Distributed Systems Facility
The PDSF is a networked distributed computing environment -- a cluster of workstations -- used by six large-scale high energy and nuclear physics investigations for detector simulation, data analysis, and software development. The PDSF includes 281 processors in nodes for computing, and eight disk vaults with file servers for 7.5 TB of storage.
PC Cluster Projects A 36-node PC Cluster Project Testbed, which is available to NERSC users for trial use; a 12-node Alpha "Babel" cluster, which is being used for Modular Virtual Interface Architecture (M-VIA) development and Berkeley Lab collaborations.
Associated with the LBNL 160 CPU (80 node) cluster system for mid range computing. Myrinet 2000 interconnect, 1 GB of memory per node, and 1 TB of shared, globally accessible disk.
NERSC User Group Meeting
Major Systems
• MPP — IBM SP – Phase 2a
• 158 16-way SMP nodes• 2144 Parallel Application
CPUs/12 GB per node• 20 TB Shared GPFS• 11,712 GB swap space - local
to nodes• ~8.6 TB of temporary scratch
space• 7.7 TB of permanent home
space— 4-20 GB home quotas
• ~240 Mbps aggregate I/O - measured from user nodes (6 HiPPI, 2 GE, 1 ATM)
— T3E-900 LC with 696 PEs - UNICOS/mk
• 644 Application Pes/256 MB per PE
• 383 GB of Swap Space - 582 GB Checkpoint File System
• 1.5 TB /usr/tmp temporary scratch space - 1 TB permanent home space
— 7- 25 GB home quotas, DMF managed
• ~ 35 MBps aggregate I/O measured from user nodes - (2 HiPPI, 2 FDDI)
• 1.0 TB local /usr/tmp
• Serial — PVP - Three J90 SV-1 Systems running
UNICOS• 64 CPUs Total/8 GB of Memory per
System (24 GB total)• 1.0 TB local /usr/tmp
— PDSF - Linux Cluster• 281 IA-32 CPUs• 3 LINUX and 3 Solaris file servers• DPSS integration• 7.5 TB aggregate disk space• 4 striped fast Ethernet connections to
HPSS
— LBNL – Mid Range Cluster• 160 IA-32 CPUs• LINUX with enhancements• 1 TB aggregate disk space• Myrinet 2000 Interconnect• GigaEthernet connections to HPSS
• Storage— HPSS
• 8 STK Tape Libraries• 3490 Tape drives• 7.4 TB of cache disk• 20 HiPPI Interconnects,
12 FDDI connects, 2 GE connections
• Total Capacity ~ 960TB• ~160 TB in use
— HPSS - Probe
NERSC User Group Meeting
T3E Utilization95% Gross Utilization
MPP Charging and UsageFY 1998-2001
0
2000
4000
6000
8000
10000
12000
14000
16000
30-S
ep-9
3
1-N
ov-
93
3-D
ec-
93
4-J
an-9
4
5-F
eb-9
4
9-M
ar-
94
10-A
pr-
94
12-M
ay-
94
13-J
un-9
4
15-J
ul-94
16-A
ug-9
4
17-S
ep-9
4
19-O
ct-9
4
20-N
ov-
94
22-D
ec-
94
23-J
an-9
5
24-F
eb-9
5
28-M
ar-
95
29-A
pr-
95
31-M
ay-
95
2-J
ul-95
3-A
ug-9
5
4-S
ep-9
5
6-O
ct-95
7-N
ov-
95
9-D
ec-
95
10-J
an-9
6
11-F
eb-9
6
14-M
ar-
96
15-A
pr-
96
17-M
ay-
96
18-J
un-9
6
20-J
ul-96
21-A
ug-9
6
22-S
ep-9
6
24-O
ct-9
6
25-N
ov-
96
27-D
ec-
96
28-J
an-9
7
Date
CPU
Hours
30-Day Moving Ave. Lost Time
30-Day Moving Ave. Pierre Free
30-Day Moving Ave. Pierre
30-Day Moving Ave. GC0
30-Day Moving Ave. Mcurie
30-Day Moving Ave. Overhead
80%
85%
90%
Max CPU Hours
80%
85%
90%
Peak
Goal
Systems Merged
AllocationStarvation
AllocationStarvation
Checkpointt - Start ofCapability Jobs
FullScheduling Functionality
4.4% improvement per month
NERSC User Group Meeting
SP Utilization
• In the 80-85% range which is above original expectations for first year
• More variation than T3E
NERSC User Group Meeting
Mcurie MPP Time by Job Size - 30 Day Moving Average
0
2000
4000
6000
8000
10000
12000
14000
16000
4/2
9/2
00
0
5/6
/20
00
5/1
3/2
00
0
5/2
0/2
00
0
5/2
7/2
00
0
6/3
/20
00
6/1
0/2
00
0
6/1
7/2
00
0
6/2
4/2
00
0
7/1
/20
00
7/8
/20
00
7/1
5/2
00
0
7/2
2/2
00
0
7/2
9/2
00
0
8/5
/20
00
8/1
2/2
00
0
8/1
9/2
00
0
8/2
6/2
00
0
9/2
/20
00
9/9
/20
00
9/1
6/2
00
0
9/2
3/2
00
0
9/3
0/2
00
0
10
/7/2
00
0
10
/14
/20
00
10
/21
/20
00
10
/28
/20
00
11
/4/2
00
0
11
/11
/20
00
11
/18
/20
00
11
/25
/20
00
Date
Ho
urs
257-512
129-256
97-128
65-96
33-64
17-32
<16
More than 70% of the jobs are “large”
T3E Job Size
NERSC User Group Meeting
SP Job Size
~ 60% of the jobs are > ¼ the maximum size
Full size jobs more than 10% of usage
NERSC User Group Meeting
Storage:HPSS
Storage -- Capacity and Used
10
100
1000
Ap
r-9
6
Jul-
96
Jan
-97
Jul-
97
Jan
-98
Jul-
98
Jan
-99
Jul-
99
Fe
b-0
0
Jul-
00
Tera
byte
s
Capacity
UsedIBM 3494 librarys
IBM 3590 tapesin STK libraries
STK 9840 tapesin STK libraries
CFS->Archive
34+33 TB
100+33 TB
100+330 TB
880 TB
UniTree->Hpss
100+660 TB
4 IBM libs -> 2 STK libs
NERSC User Group Meeting
(Future) NERSC Network Logical Diagram(Phase III - OC12 ESnet connection; nersc.gov addresses only in Oakland)
Oakland Scientific FacilityLBNL Berkeley Site
Title: Phase III NERSC Oakland Network Location: NERSCDATA\net
Rev.: 3.21
Date: 12/06/2000
Drawn By: Ted G. Sopher Jr./Howard Walter
OC- 12FDDI10/100 BaseT 1000BaseF
IROAKGWCisco
Catalyst 8540
CiscoCatalyst 6509
FDDI Subnet
CiscoCatalyst 5505
DEC GigaSwitch
CiscoCatalyst 29xx
Public Subnet
Production Subnet
Restricted Subnet
RestrictedSubnetFirewall
ESnet Router
0C3 (155Mbps)x2Packet Over ATM
(Temp Microwave)
LBLnet
ER10GWLBLnet External Gateway
Office Subnet(inc. 11Mbps Wireless)
CiscoCatalyst 5K
ESnetBRO
Systems
CiscoCatalyst 5513
PDSF
128.55.200.xxx/22
128.55.16.xxx/22
128.55.24.xxx/22
128.55.128.xxx/22
128.55.64.xxx/23
128.55.6.xxx/24
Notes:1. Maximum single f low rate, 155Mbps.
ns1.nersc.gov128.55.6.10
ns2.nersc.gov128.55.6.11
Jumbo FrameGigabit Ethernet
x.x.x.x/24
OC12(655Mbps)
LBLnet Router
????
PW R
OK
W IC 0ACT /CH 0
ACT /CH 1
W IC0A CT /CH 0
A CT/CH1
ETHAC T
C OL
VPN
HIPPI
10.0.0.x/22
NERSC Network Architecture
NERSC User Group Meeting
(Future) NERSC Network Logical Diagram(Phase III - OC12 ESnet connection; nersc.gov addresses only in Oakland)
Oakland Scientific FacilityLBNL Berkeley Site
Title: Phase III NERSC Oakland Network Location: NERSCDATA\net
Rev.: 3.21
Date: 12/06/2000
Drawn By: Ted G. Sopher Jr./Howard Walter
OC- 12FDDI10/100 BaseT 1000BaseF
IROAKGWCisco
Catalyst 8540
CiscoCatalyst 6509
FDDI Subnet
CiscoCatalyst 5505
DEC GigaSwitch
CiscoCatalyst 29xx
Public Subnet
Production Subnet
Restricted Subnet
RestrictedSubnetFirewall
ESnet Router
0C3 (155Mbps)x2Packet Over ATM
(Temp Microwave)
LBLnet
ER10GWLBLnet External Gateway
Office Subnet(inc. 11Mbps Wireless)
CiscoCatalyst 5K
ESnetBRO
Systems
CiscoCatalyst 5513
PDSF
128.55.200.xxx/22
128.55.16.xxx/22
128.55.24.xxx/22
128.55.128.xxx/22
128.55.64.xxx/23
128.55.6.xxx/24
Notes:1. Maximum single f low rate, 155Mbps.
ns1.nersc.gov128.55.6.10
ns2.nersc.gov128.55.6.11
Jumbo FrameGigabit Ethernet
x.x.x.x/24
OC12(655Mbps)
LBLnet Router
????
PWR
OK
WIC0ACT/CH0
ACT/CH1
WIC0ACT/CH0
ACT/CH1
ETHACT
COL
VPN
HIPPI
10.0.0.x/22
CONTINUE NETWORK IMPROVEMENTS
NERSC User Group Meeting
LBNL Oakland Scientific Facility
NERSC User Group Meeting
Oakland Facility
• 20,000 sf computer room; 7,000 sf office space—16,000 sf computer space built out—NERSC occupying 12,000 sf
• Ten year lease with 3 five year options• $10.5M computer room construction costs• Option for additional 20,000+ sf computer room
NERSC User Group Meeting
LBNL Oakland Scientific Facility
Move accomplished between Oct 26 to Nov 4
System Scheduled ActualSP 10/27 – 9 am no outageT3E 11/3 – 10 am 11/3 – 3 amSV1’s 11/3 – 10 am 11/2 – 3 pmHPSS 11/3 – 10 am 10/31 – 9:30 amPDSF 11/6 – 10 am 11/2 – 11 amOther Systems 11/3 – 10 am 11/1 – 8 am
NERSC User Group Meeting
Computer Room Layout
Up to 20,000 sf of computer space
Direct Esnet node at OC12
NERSC User Group Meeting
2000 Activities and Accomplishments
• PDSF Upgrade in conjunction with building move
NERSC User Group Meeting
2000 Activities and Accomplishments
• netCDF parallel support developed by NERSC staff for the Cray T3E.
— A similar effort is being planned to port netCDF to the IBM SP platform.
• Communication for Clusters: M-VIA and MVICH
— M-VIA and MVICH are VIA-based software for low-latency, high-bandwidth, inter-process communication.
— M-VIA is a modular implementation of the VIA standard for Linux.
— MVICH is an MPICH-based implementation of MPI for VIA.
NERSC User Group Meeting
FY 2000 User Survey Results
• Areas of most importance to users — available hardware (cycles)— overall running of the center— network access to NERSC— allocations process
• Highest satisfaction (score > 6.4)— Problem reporting/consulting services (timely response, quality, followup)— Training— Uptime (SP and T3E)— FORTRAN (T3E and PVP)
• Lowest satisfaction (score < 4.5) — PVP batch wait time — T3E batch wait time
• Largest increases in satisfaction from FY 1999 — PVP cluster (we introduced interactive SV1 services) — HPSS performance — Hardware management and configuration (we monitor and improve this continuously) — HPCF website (all areas are continuously improved, with a special focus on topics highlights
as needing improvement in the surveys) — T3E Fortran compilers
NERSC User Group Meeting
Client Comments from Survey
"Very responsive consulting staff that makes the user feel that his problem, and its solution, is important to NERSC"
"Provide excellent computing resources with high reliability and ease of use."
"The announcement managing and web-support is very professional."
"Manages large simulations and data. The oodles of scratch space on mcurie and gseaborg help me process large amounts of data in one
go."
"NERSC has been the most stable supercomputer center in the country particularly with the migration from the T3E to the IBM SP".
"Makes supercomputing easy."
NERSC User Group Meeting
NERSC 3 Phase 2a/b
NERSC User Group Meeting
Result: NERSC-3 Phase 2a
• System built and configured• Started factory tests 12/13• Expect delivery 1/5• Undergoing acceptance testing• General production April 2001
—What is different that needs testing• New Processors,
— New Nodes, New memory system
• New switch fabric• New Operating System• New parallel file system software
NERSC User Group Meeting
IBM Configuration
Phase 1 Phase 2a/b
Compute Nodes 256 134*Processors 256x2=512 134x16=2144*
Networking Nodes 8 2
Interactive Nodes 8 2
GPFS Nodes 16 16
Service Nodes 16 4
Total Nodes (CPUs) 304 (604) 158 (2528)
Total Memory (compute nodes) 256 GB 1.6 TB
Total Global Disk (user accessible) 10 TB 20 TB
Peak (compute nodes) 409.6 GF 3.2 TF*
Peak (all nodes) 486.4 GF 3.8 TF*
Sustained System Perf 33 GF 235+ GF/280+ GF
Production Dates April 1999 April 2001/Oct 2001*is a minimum - may increase due to sustained system performance measure
NERSC User Group Meeting
What has been completed
• 6 nodes added to the configuration• Memory per node increased to 12 GB for 140
compute nodes• “Loan” of full memory for Phase 2• System installed and braced• Switch adapters and memory added to system• System configuration• Security Audit• System testing for many functions• Benchmarks being run and problems being
diagnosed
NERSC User Group Meeting
Current Issues
• Failure of two benchmarks need to be resolved— Best case: indicate broken hardware – likely with
the switch adaptors— Worst case: indicate design and load issues that
are fundamental• Variation • Loading and switch contention• Remaining tests
— Throughput, ESP— Full System tests— I/O— Functionality
NERSC User Group Meeting
General Schedule
• Complete testing – TBD based on problem correction• Production Configuration set up
— 3rd party s/w, local tools, queues, etc• Availability Test
— Add early users ~10 days after successful testing complete
— Gradually add other users – complete ~ 40 days after successful testing
• Shut down Phase 1 ~ 10 days after system open to all users— Move 10 TB of disk space – configuration will
require Phase 2 downtime• Upgrade to Phase 2b in late summer, early fall
NERSC User Group Meeting
NERSC-3 Sustained System Performance Projections
• Estimates the amount scientific computation that can really be delivered
— Depends on delivery of Phase 2b functionality
— The higher the last number is the better since the system remains at NERSC for 4 more years
Peak rating for entire system vs Sustained System Performance on Compute Nodes
0
50
100
150
200
250
300
350
1 4 7 10 13 16 19 22 25 28 31 34
Months from installation
NP
B G
Flo
p/s
0
0.5
1
1.5
2
2.5
3
3.5
4
Pea
k T
Flo
p/s
NPB System Gflop/s Peak System Tflop/s
Software lags hardware
Test/Config, Acceptance,etc
NERSC User Group Meeting
NERSC Computational Power vs. Moore’s Law
NERSC Computational Power vs Moore's Law
0
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000
35,000,000
1993 1994 1995 1996 1997 (LBNL)
1998 1999 2000 2001(est)
2002(Est.)
2003(Est.)
Fiscal Year
no
ram
lize
d C
RU
s
Lowest Normalized CRU Estimate Higest Normalized CRU Estimate Moore's Law (relative FY 97)
NERSC User Group Meeting
NERSC 4
NERSC User Group Meeting
NERSC-4
• NERSC 4 IS ALREADY ON OUR MINDS— PLAN IS FOR FY 2003 INSTALLATION— PROCUREMENT PLANS BEING FORMULATED— EXPERIMENTATION AND EVALUATION OF
VENDORS IS STARTING• ESP, ARCHITECTURES, BRIEFINGS• CLUSTER EVALUATION EFFORTS
— USER REQUIREMENTS DOCUMENT (GREENBOOK) IMPORTANT
NERSC User Group Meeting
How Big Can NERSC-4 be
• Assume a delivery in FY 2003• Assume no other space is used in Oakland until
NERSC-4• Assume cost is not an issue (at least for now)• Assume technology still progresses
— ASCI will have a 30 Tflop/s system running for over 2 years
NERSC User Group Meeting
How close is 100 Tflop/s
• Available gross space in Oakland is 3,000 sf without major changes— Assume it is 70% usable— The rest goes to air handlers, columns, etc.
• That gives 3,000 sf of space for racks• IBM system used for estimates
— Other vendors are similar• Each processor is 1.5 Ghz, to yield 6 Gflop/s• An SMP node is made up of 32 processors• 2 Nodes in a frame
— 64 processors in a frame = 384 Gflops per frame. • Frames are 32 - 36" wide and 48” deep
— service clearance of 3 feet in front and back (which can overlap)
— 3 by 7 is 21 sf per frame
NERSC User Group Meeting
Practical System Peak
• Rack Distribution— 60% of racks are for CPUs
• 90% are user/computation nodes• 10% are system support nodes
— 20 % of racks are for switch fabric — 20% of racks for disks
• 5,400 sf / 21 sf per frames = 257 frames• 277 nodes that are are directly used by computation
— 8,870 CPUS for computation— system total is 9,856 (308 nodes)
• Practical system peak is 53 Tflop/s— .192 Tflop/s per node * 277 nodes— Some other places would claim 60 Tflop/s
NERSC User Group Meeting
How much use will it be
• Sustained vs Peak performance — Class A codes on T3E samples at 11% — LSMS
• 44% of peak on T3E • So far 60% of peak on Phase 2a (maybe more)
• Efficiency— T3E runs at a 30 day average about 95%— SP runs at a 30 day average about 80+%
• Still functionality planned
NERSC User Group Meeting
How much will it cost
• Current cost for a balanced system is about $7.8M per Tflop/s
• Aggressive— Cost should drop by a factor of 4— $1-2 M per Teraflop/s— Many assumptions
• Conservative— $3.5 M per Teraflop/s
• Added costs for install, operate and balance the facility is 20%.
• The full cost is $140 M to $250 M • Too Bad
NERSC User Group Meeting
The Real Strategy
• Traditional strategy within existing NERSC Program funding. Acquire new computational capability every three years - 3 to 4 times capability increase of existing systems
• Early, commercial, balanced systems with focus on- stable programming environment
- mature system management tools- good sustained to peak performance ratio
• Total value of $25M - $30M- About $9-10M/yr. using lease to own
• Have two generations in service at a time- e.g. T3E and IBM SP
• Phased introduction if technology indicates• Balance other system architecture components
NERSC User Group Meeting
Necessary Steps
1) Accumulate and evaluate benchmark candidates 2) Create a draft benchmark suite and run it on several systems 3) Create the draft benchmark rules 4) Set basic goals and options for procurement and then create a draft RFP document 5) Conduct market surveys (vendor briefings, intelligence gathering, etc.) - we do this after the first items so we
can be looking for the right information and also we can tell the vendors what to expect. It is often the case that we have to "market" to the vendors on why they should be bidding - since it costs them a lot.
6) Evaluate alternative and options for RFP and tests. This is also where we do a technology schedule (when what is available) and estimate prices - price/performance, etc.
7) Refine RFP and benchmark rules for final release 8) Go thru reviews 9) Release RFP10) Answer questions from vendors11) Get responses - evaluate12) Determine best value - present results and get concurrence13) Prepare to negotiate14) Negotiate15) Put contract package together16) Get concurrence and approval17) Vendor builds the system18) Factory test19) Vendor delivers it20) Acceptance testing (and resolving issues found in testing)- (1st payment is 2 months after acceptance)21) Preparation for production22) Production
NERSC User Group Meeting
Rough Schedule
Goal – NERSC-4 installation in first half of CY 2003
• Vendor responses (#11) in early CY 2002• Award in late summer/fall of CY2002.
— This is necessary in order to assure delivery and acceptance (# 22) in FY 2003.
• A lot of work and long lead times (for example, we have to account for review and approval times, 90 days for vendors to craft responses, time to negotiate, ...)
• NERSC Staff kick off meeting first week of march, — Been doing some planning work already.
NERSC User Group Meeting
NERSC-2 Decommissioning
• RETIRING NERSC 2 IS ALREADY ON OUR MINDS— IF POSSIBLE WE WOULD LIKE TO KEEP
NERSC 2 IN SERVICE UNTIL 6 MONTHS BEFORE NERSC 4 INSTALLATION• Therefore, expect retirement at the end of FY 2002
— It is “risky” to assume there will be a viable vector replacement • Team is working to determine possible paths for
traditional vector users— Report due in early summer
NERSC User Group Meeting
SUMMARY
NERSC does an exceptionally effective job delivering services to DOE and other
researchers
NERSC has made significant upgrades this year that position it well for future growth and
continued excellence
NERSC has a well mapped strategy for the next several years