high performance computing exposed

Post on 08-Jan-2017

116 views 2 download

transcript

Office of Instructional and Research Technology

Very large computing and the real world

a very few thoughts

Eric MarshallAssociate Director for Research Technology

Rutgers University

Office of Instructional and Research Technology

Shock and awe

Bigger is better!

Office of Instructional and Research Technology

The shiny future

• Newer is Better!

Office of Instructional and Research Technology

The real world

• Bugs, warts, and the eternal problem of hindsight

Office of Instructional and Research Technology

The problem of architecture

• Build as you go vs. predicting the future

Office of Instructional and Research Technology

Where do you put and for how long?

• The problem of 2x foot print in the land of 24x7

Office of Instructional and Research Technology

Who is expert?

• Is the architect, programmer, scientist, owner, vendor or bottle washer expert? Complex problems are hard.

Office of Instructional and Research Technology

“Anyone who understands the system isn’t doing science!”

• The problem of users

Office of Instructional and Research Technology

Supercomputers are disposable

• 3 to 5 year ‘shelf life’

Office of Instructional and Research Technology

“This system sucks, the last one was better!”(no matter how many systems)

• The problem of transition: porting, change and habits

Office of Instructional and Research Technology

Goldlock’s paradox

• The problem of useful use: efficient programming, useful scaling, overhead, keeping track of results, allocation, etc.

Office of Instructional and Research Technology

Goldlock’s paradox (cont’d)

• Someone will always say the solution is around around the corner!

Office of Instructional and Research Technology

Scaling is deadly

• Scaling problems: OS/SAN/code/people/etc.

Large Scale Cluster (LSC)SGI Origin 3800 + 3900, 600MHz

2 Nodes x 512 PE + 512GB + 2.9TB disk5 Nodes x 256 PE + 256GB + .9TB disk1 Node x 128 PE + 128GB + .9TB disk SAN Bandwidth: 2GB/s per LSC Node

CXFS, PCP, Workshop Pro,GridEngine, S-Plus,TotalView, Matlab, NAG SMP, Mathmatica

Analysis Cluster (ANC) SGI Origin 3900, 600 MHz, 2 Nodes x 96 PE + 96GB + 4.2TB disk

SAN Bandwidth: 2GB/s per ANC NodeGridEngine, CXFS, PCP, Workshop Pro

Tape SAN4 x STK 9310 Tape Libraries24 x 9940B Drives (200GB, 30MB/s)22 x 9840A Drives (20GB, 10MB/s)3.5PB Tape Storage On-Line 1.5PB Off-Line

LANCisco Catalyst 65094 x 16 GbE2 x 48 Fast Ethernet

ANC

SAN (FC) SwitchBrocade 2800 & 3800

Redundant AccessDual-Ported

Fiber ChannelMetaData Server (MDS)

HFS & HSMS ServerSGI Origin 3800, 600 MHz, 2 Nodes x 64 PE + 64GB

Disk SAN: 4GB/s per MDS NodeTape SAN: 1GB/s per MDS Node 2.8TB disk, Failsafe, DMF, CXFS

Onyx 3 - Infinite Reality 3

MDS

Computational Capability & Capacity89 Coupled Climate Model Years

Per Computational Day1 deg. Ocean Model2 deg. Atmospheric

Disk SAN 23.6TB SAN Disk

TP9100B5+P+HS RAID5

w/Dual Controllers2Gbit/s Fibre

GFDL HPCSJuly 2005

CCCI Cluster (IC)SGI Altix 3700, 1.5GHz

2 Nodes x 256 PE + 512GB + 2TB disk1 Node x 96 PE + 192GB + 3TB disk

SAN Bandwidth: 2GigE/Node, NFS mounted

PCP, Workshop Pro,GridEngine,TotalView,

NAG

IC

Office of Instructional and Research Technology

Questions?

Eric MarshallOffice of Instructional and Research Technologyeric.marshall@rutgers.edu732 445-2262