Download - 5/12/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters Dr. Alan D. George, Principal Investigator Mr. Burton C. Gordon, Sr.

5/12/04 1

Distributed Shared-Memory Parallel Computing with UPCon SAN-based Clusters

Dr. Alan D. George, Principal InvestigatorMr. Burton C. Gordon, Sr. Research Assistant

Mr. Hung-Hsun Su, Sr. Research Assistant

HCS Research LaboratoryUniversity of Florida

5/12/04 2

Outline

Objectives and Motivations

Background

Related Research

Approach

Results

Conclusions and Future Plans

5/12/04 3

Objectives and Motivations Objectives

Support advancements for HPC with Unified Parallel C (UPC) on cluster systems exploiting high-throughput, low-latency System-Area Networks (SANs) and LANs

Design and analysis of tools to support UPC on SAN-based systems Benchmarking and case studies with key UPC applications Analysis of tradeoffs in application, network, service, and system design

Motivations Increasing demand in sponsor and scientific computing community for shared-

memory parallel computing with UPC New and emerging technologies in system-area networking and cluster computing

Scalable Coherent Interface (SCI) Myrinet (GM) InfiniBand (VAPI) QsNet (Quadrics Elan) Gigabit Ethernet and 10 Gigabit Ethernet

Clusters offer excellent cost-performance potential

5/12/04 4

Background Key sponsor applications and developments toward

shared-memory parallel computing with UPC

UPC extends the C language to exploit parallelism Currently runs best on shared-memory multiprocessors or

proprietary clusters (e.g. AlphaServer SC) Notably HP/Compaq’s UPC compiler

First-generation UPC runtime systems becoming available for clusters MuPC, Berkeley UPC

Significant potential advantage in cost-performance ratio with Commercial Off-The-Shelf (COTS) cluster configurations Leverage economy of scale

Clusters exhibit low cost relative to tightly coupled SMP, CC-NUMA, and MPP systems

Scalable performance with COTS technologies

Com3

Com3

Com3

UPCUPC

UPCUPC

??

5/12/04 5

Related Research University of California at Berkeley

UPC runtime system

UPC to C translator

Global-Address Space Networking (GASNet) design and development

Application benchmarks

George Washington University UPC specification

UPC documentation

UPC testing strategies, testing suites

UPC benchmarking

UPC collective communications

Parallel I/O

Michigan Tech University Michigan Tech UPC (MuPC) design

and development

UPC collective communications

Memory model research

Programmability studies

Test suite development

Ohio State University UPC benchmarking

HP/Compaq UPC compiler

Intrepid GCC UPC compiler

5/12/04 6

Related Research -- MuPC & DSM MuPC (Michigan Tech UPC)

First open-source reference implementation of UPC for COTS clusters Any cluster that provides Pthreads and MPI can use

Built as a reference implementation, performance is secondary

Limitations in application size, memory mode

Not suitable for performance-critical applications

UPC/DSM/SCI SCI-VM (DSM system for SCI)

HAMSTER interface allows multiple modules to support MPI and shared-memory

models

Created using Dolphin SISCI API, ANSI C

SCI-VM not under constant development, so future upgrades sketchy

Not feasible for amount of work needed versus expected performance Better possibilities with GASNet

5/12/04 7

Related Research -- GASNet Communication system created by U.C. Berkeley

Target for Berkeley UPC system

Global-Address Space Networking (GASNet)[1]

Language-independent, low-level networking layer for high-performance communication

Segment region for communication on each node, three types Segment-fast: sacrifice size for speed Segment-large: allows large memory area for shared space, perhaps

with some loss in performance (though firehose [2] algorithm is often employed for better efficiency)

Segment-everything: expose the entire virtual memory space of each process for shared access Firehose algorithm allows memory to be managed into buckets for

efficient transfers Interface for high-level global address space SPMD languages

UPC [3] and Titanium [4]

Divided into two layers Core

Active Messages

Extended High-level operations which take direct advantage of network capabilities

communication system from U.C. Berkeley

5/12/04 8

Related Research -- Berkeley UPC Second open-source implementation of

UPC for COTS clusters

First with a focus on performance

GASNet for all accesses to remote memory

Network conduits allow for high performance

over many different interconnects

Targets a variety of architectures

x86, Alpha, Itanium, PowerPC, SPARC, MIPS,

PA-RISC

Best chance as of now for high-performance

UPC applications on COTS clusters

Note: Only supports strict shared-memory

access and therefore only uses the blocking

transfer functions in the GASNet spec

UPC Code Translator

Translator Generated CCode

Berkeley UPC RuntimeSystem

GASNetCommunication

System

Network Hardware

Platform-independent

Network-independent

Compiler-independent

Language-independent

5/12/04 9

Approach Collaboration

HP/Compaq UPC Compiler V2.1 running in our lab on ES80 AlphaServer (Marvel)

Support of testing by OSU, MTU, UCB/LBNL, UF, et al. with leading UPC tools and system for function performance evaluation

Field test of newest compiler and system

Exploiting SAN Strengths for UPC Design and develop new SCI Conduit for GASNet

in collaboration UCB/LBNL Evaluate DSM for SCI as option of executing UPC

Benchmarking Use and design of applications in UPC to grasp key

concepts and understand performance issues NAS benchmarks from GWU DES-cypher benchmark from UF

Performance Analysis Network communication experiments UPC computing experiments

Emphasis on SAN Options and Tradeoffs SCI, Myrinet, InfiniBand, Quadrics, GigE, 10GigE,

etc.

Up

per

La

ye

rsA

pp

lica

tio

ns,

Tra

ns

lato

rs,

Do

cu

me

nta

tio

n

Mid

dle

La

ye

rsR

un

tim

e S

ys

tem

s,

Inte

rfac

es

Lo

we

r L

ay

ers

Ru

nti

me

Sy

ste

ms,

Inte

rfac

es

UF

HC

S L

ab

Ohio StateBenchmarks

Michigan TechBenchmarks, modeling,

specification

UC BerkeleyBenchmarks, UPC-to-C translator, specification

GWUBenchmarks, documents,

specification

Benchmarks

Michigan TechUPC-to-MPI translation

and runtime system

UC BerkeleyC runtime system, upper

levels of GASNet

HPUPC runtime system on

AlphaServer

UC BerkeleyGASNet

GASNet collaboration,beta testing

GASNet collaboration,

network performance

analysis

5/12/04 10

GASNet SCI Conduit Scalable Coherent Interface (SCI)

Low-latency, high-bandwidth SAN Shared-memory capabilities

Require memory exporting and importing

PIO (require importing) + DMA (need 8 bytes alignment)

Remote write ~10x faster than remote read

SCI conduit AM enabling (core API)

Dedicated AM message channels (Command) Request/Response pairs to prevent

deadlock

Flags to signal arrival of new AM (Control)

Put/Get enabling (extended API) Global segment (Payload)

Control X

Command X-1

...

Command X-N

Payload X

Local (In use)

Local (free)

Control 1

...

...

Control N

Command 1-X

...

Command X-X

Control X

Payload 1

...

Payload X

...

Control 1

...

Control X

Command 1-X

...

Command X-X

Command N-X

Payload N

Command X-X

...

...

Control N

...

Command N-X

...

Control Segments(N total)

Command Segments(N*N total)

Payload Segments(N total)

SCI Space

Node X

Physical Address

Virtual Address

ExportingImporting

DMA Queues (Local)

5/12/04 11

GASNet SCI Conduit - Core API

AM Header

Medium AM Payload

Long AM Payload

Message Ready Flags

Control

Command Y-1

Node X

...

Command Y-X

...

Command Y-N

Payload Y

Node Y

Wait for Completion

1st Message Pointer = NULL?

Dequeue 1st message

New Message Available?

Enqueue all new messages

Yes Yes

No

Check

Extract new message

information

Handle message Exit Polling

Start Polling

Other processing

Process reply

message

AM reply or control reply

Memory

The queue is constructed by placing a next pointer in the

header that points to the next available message

Active Message Transferring

1. Obtain free slot Tract locally using array of flags

2. Package AM Header3. Transfer Data

Short AM PIO write (Header)

Medium AM PIO write (Header) PIO write (Medium Payload)

Long AM PIO write (Header) PIO write (Long Payload)

Payload size 1024 Unaligned portion of payload

DMA write (multiple of 8 bytes)4. Wait for transfer completion5. Signal AM arrival

Message Ready Flag Value = type of AM

Message Exist Flag Value = TRUE

6. Wait for reply/control signal Free up remote slot for reuse

5/12/04 12

GASNet SCI Conduit - Core API Testbed

Nodes: Each with dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset

SAN: 667 MB/s (300MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus

Tests SCI Conduit

Latency/Throughput (testam 10000 iterations) SCI Raw

PIO latency (scipp) DMA latency and throughput (dma_bench)

Analysis Latency a little high, but constant overhead (not

exponential Throughput follows RAW trend

Short/Medium AM Ping-Pong Latency

0

5

10

15

20

25

30

35

40

0 1 2 4 8 16

32

64

128

256

512

1024

Payload Size (Bytes)

La

ten

cy (

us)

SCI Raw SCI Conduit

Long AM Ping-Pong Latency

0

50

100

150

200

250

0 1 2 4 8 16

32

64

128

256

512 1K

2K

4K

8K

16K


Lat

ency

(us

)

SCI Raw SCI Conduit

Long AM Throughput

0

50

100

150

200

250

64

12

8

25

6

51

2

1K

2K

4K

8K

16

K

32

K

64

K

12

8K

25

6K


Thro

ughput

(MB

/s)

SCI Raw SCI Conduit

PIO/DMA Mode Shift

PIO/DMA Mode Shift

5/12/04 13

GASNet SCI Conduit – Extended API SCI-Specific API

Put1. Size 1024 or Unaligned portion send using Long

AM. Wait for completion2. Copy remaining data from local source to local

DMA queue3. DMA write from DMA queue to remote destination4. Check DMA queue state for completion

Get1. Unaligned portion transfer using AM system. Wait

for completion2. DMA write of remaining data from remote source

to local DMA queue3. Check DMA queue state for completion4. Copy data from local DMA queue to local

destination

Local Source

Local DMA Queue

Remote Destination

Aligned Data Remote Write

Long AMUnaligned Data

Local Destination

Short AMRemote Source

Long AM

Local DMA Queue

Remote Read

Remote Write

Aligned Data

Unaligned Data

Put (SCI Specific)

Get (SCI Specific)

Local Source

Long AM(s)Remote

Destination

Put (Reference)

Local Destination


Long AM(s)

Get (Reference In)

Local Destination


Medium AM(s)

Get (Reference Out)

Reference API (Uses Core API AM system) Put

1. Send Long AM(s)2. Synchronization

Get1. Send Short AM request to remote node2. Remote node handles the request and send

Multiple Medium AMs - dest. out of segment Long AM(s) - dest. in segment

3. Synchronization Tests

Latency (testsmall -in/-out 10000) Throughput (testlarge -in/-out 10000)

Signal

Transfer(src à dest)

5/12/04 14

GASNet SCI Conduit – Extended API Latency is the same for in-segment and out-of-

segment for both versions Reference Extended API

Blocking and non-blocking versions of put/get have nearly the same throughput results

Throughput result the for in-segment is ~10MB/s better than out-of-segment

HCS SCI Extended API Throughput result the same for in-segment out-of-

segment Max 12us (~8MB/s throughput) overhead extra for

non-bulk (unaligned) data Put performance comparison

HCS SCI conduit blocking version (max 214 MB/s) performs slightly better than the reference blocking version (max 208 MB/s)

HCS SCI non-blocking version has much higher throughput than the reference non-blocking version (max 241 MB/s, close to SCI Raw Max)

SCI-specific > Reference Get performance comparison

Out of segment: HCS SCI conduit comparable to reference version (~45 MB/s)

In segment: Reference version significantly out performs HCS SCI (max 207 MB/s) HCS SCI uses DMA read to implement get

Reference >= SCI specific Use HCS SCI put and reference get to obtain

optimal performance

0

1020

3040

50

6070

80

1 2 4 8 16 32 64 128 256 512 1K 2K

Data Size (Bytes)

La

ten

cy (

us)

Put (Ref) Get (Ref)Put (SCI) Get (SCI)SCI Raw (DMA) SCI Raw (PIO)

PIO/DMA Mode Shift

0

50

100

150

200

250

64 256 1K 4K 16K 64K 256K

Data Size (Bytes)

Thro

ughput (M

B/s

)

Put (Ref In) Get (Ref In) Put (Ref Out)Get (Ref Out) Put (SCI) Get (SCI)Put NB (SCI) SCI Raw

5/12/04 15

GASNet Conduit Comparison- Experimental Setup

Testbed Elan, VAPI, MPI and SCI conduits

Nodes: Each with dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset

SCI: 667 MB/s (300MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus Quadrics: 528 MB/s (340MB/s sustained) Elan3 ,using PCI-X in two nodes with QM-S16 16 port switch InfiniBand: 4x (10Gbps/ 800MB/s sustained) Infiniserv HCAs, using PCI-X 100, InfiniIO 2000 8 port switch

from Infinicon RedHat 9.0 with gcc compiler V 3.3.2, SCI uses MP-MPICH beta from RWTH Aachen Univ., Germany

GM (Myrinet) conduit (c/o access to cluster at MTU) Nodes*: Each with dual 2.0 GHz Intel Xeons, 2GB DDR PC2100 (DDR266) RAM Myrinet*: 250 MB/s Myrinet 2000, using PCI-X, on 8 nodes connected with 16-port M3F-SW16 switch RedHat 7.3 with Intel C compiler V 7.1

ES80 AlphaServer (Marvel) Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM, proprietary inter-processor connections Tru64 5.1B Unix, HP UPC V2.1 compiler

Experimental Setup GASNet configured with segment Large

As fast as segment-fast for inside the segment Makes use of Firehose for memory outside the segment (often more efficient than segment-fast)

GASNet Conduit experiments Berkeley GASNet test suite

Average of 1000 iterations Each uses put/get operations to take advantage of implemented extended APIs Executed with target memory falling inside and then outside the GASNet segment

Reported only inside results unless difference was significant Latency results use testsmall Throughput results use testlarge

* via testbed made available courtesy of Michigan Tech

5/12/04 16

GASNet Latency on Conduits

0

5

10

15

20

25

30

35

40

1 2 4 8 16 32 64 128 256 512 1K

Message Size (bytes)

Ro

un

d-t

rip

Lat

ency

(u

sec

)

GM put GM get Elan put Elan getVAPI put VAPI get MPI SCI put MPI SCI getHCS SCI put HCS SCI get

5/12/04 17

GASNet Throughput on Conduits

0

100

200

300

400

500

600

700

800

128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K


Th

rou

gh

pu

t (M

B/s

)GM put GM get Elan put Elan getVAPI put VAPI get MPI SCI put MPI SCI getHCS SCI put HCS SCI get

5/12/04 18

GASNet Conduit Analysis Elan shows best performance for latency of puts and gets VAPI is by far the best bandwidth; latency very good GM latencies a little higher than all the rest HCS SCI conduit shows better put latency than MPI on SCI for sizes > 64 bytes; very close

to MPI on SCI for smaller messages HCS SCI conduit has latency slightly higher than MPI on SCI GM and SCI provide about the same throughput

HCS SCI conduit slightly higher bandwidth for largest message sizes

Quick look at estimated total cost to support 8 nodes of these interconnect architectures: SCI: ~$8,700 Myrinet: ~$9,200 InfiniBand: ~$12,300 Elan3: ~$18,000 (based on Elan4 pricing structure, which is slightly higher)

Note: This study does not include the latest hardware from Myrinet (2000E w/ GM2) and Quadrics (Elan4 w/ QsNetII)

5/12/04 19

UPC function performance A look at common shared-data operations

Comparison between accesses to local data through regular and private pointers Block copies between shared and private memory

upc_memget, upc_memput

Pointer conversion (shared local to private) Pointer addition (advancing pointer to next location) Loads & Stores (to a single location local and remote)

Block copies upc_memget & upc_memget translate directly into GASNet blocking put and get

(even on local shared objects); see previous graph for results Marvel with HP UPC compiler shows no appreciable difference between local and

remote puts and gets and regular C operations Steady increase from 0.27 to 1.83 µsec for sizes 2 to 8K bytes Difference of < .5 µsec for remote operations

5/12/04 20

UPC function performance Pointer operations

Cast Local share to private All BUPC conduits ~2ns, Marvel needed ~90ns

Pointer addition below

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

MPI-SCI MPI GigE Elan GM VAPI Marvel

Ex

ec

uti

on

Tim

e (

us

ec

)

Private Shared

Shared-pointer manipulation about an order of magnitude greater than private.

5/12/04 21

UPC function performance Loads and stores with pointers (not bulk)

Data local to the calling node Pvt Shared are private pointers to the local shared space

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

MPI SCI MPI GigE Elan GM VAPI Marvel

Ex

ec

uti

on

Tim

e (

us

ec

)

Private Store Private Load Shared StoreShared Load Pvt Shared Store Pvt Shared Load

MPI on GigE shared store takes 2 orders of magnitude longer, therefore not shown. Marvel shared loads and stores twice an order of magnitude greater than private.

5/12/04 22

UPC function performance Remote loads and stores with pointers (not bulk)

Data remote to the calling node Note: MPI GigE showed a time of ~450µsec for loads and ~500µsec for stores

0

5

10

15

20

25

30

MPI-SCI Elan GM VAPI Marvel

Ex

ec

uti

on

Tim

e (

us

ec

)

Store Load

Marvel remote access through pointers the same as with local shared, two orders of magnitude less than Elan.

5/12/04 23

UPC Benchmarks – IS from NAS benchmarks* Class A size of benchmark executed with Berkeley UPC runtime system V1.1 with gcc V3.3.2 for Elan, MPI; Intel V7.1

for GM; HP UPC V2.1 for Marvel IS (Integer Sort), lots of fine-grain communication, low computation

Only un-optimized version executed correctly using Berkeley UPC, so un-optimized version used for comparison Communication layer should have greatest effect on performance Single thread shows performance without use of communication layer

Poor performance in the GASNet communication system does not necessary indicate poor performance in UPC application

MPI results poor for GASNet but decent for UPC applications, though all other conduits clearly yield better performance

* Using code developed at GWU

0

5

10

15

20

25

30

GM Elan GigE mpi VAPI SCI mpi Marvel

Network Conduit

Ex

ec

uti

on

Tim

e (

se

c)

1 Thread 2 Threads 4 Threads 8 Threads

Only two nodes available with Elan, unable to determine scalability at this point.

TCP/IP overhead outweighs benefit of parallelization.

Many shared-memory region accesses throttle back performance. Also, Marvel does not perform well with many small messages [5].

HCS SCI conduit not integrated with BUPC yet.

GM conduit uses Intel C compiler, 2.0GHz uP

5/12/04 24

UPC Benchmarks – FT from NAS benchmarks* Class A size of benchmark executed with same setup as IS FT: 3-D Fast Fourier Transform, medium communication, high computation

Used optimized version 01 (private pointers to local shared memory) High-bandwidth networks perform best (VAPI followed by Elan)

VAPI conduit allows cluster of Xeons to keep pace with Marvel’s performance MPI on GigE not well suited for these types of problems (high-latency, low-bandwidth traits limit performance) MPI on SCI has lower bandwidth than VAPI but still maintains near-linear speedup for more than 2 nodes (skirts TCP/IP overhead) GM performance a factor of processor speed (see 1 Thread)

0

5

10

15

20

25

30

35

40

45

50

GM Elan GigE mpi VAPI SCI mpi Marvel

Execu

tio

n T

ime (

sec)

1 Thread 2 Threads 4 Threads 8 Threads

* Using code developed at GWU

High-latency of MPI on GigE impedes performance.


5/12/04 25

UPC Benchmarks -- DES Differential Attack Simulator Differential attack simulator of S-DES (96-bit key) cipher (integer-based)

Creates basic components used in differential cryptanalysis S-Boxes, Difference Pair Tables (DPT), and Differential Distribution Tables (DDT)

Implemented in UPC to expose parallelism in DPT and DDT creation, choosing best pairs Bandwidth-intensive application Various access methods to shared memory

Used upc_memget and upc_memput for shared remote data access Used private pointers to local shared data for DPT creation

0

250

500

750

1000

1250

1500

1750

2000

GM Elan VAPI SCI mpi Marvel

Execu

tio

n T

ime (

msec)

Sequential 1 Thread 2 Threads 4 Threads

MPI on GigE not shown, due to high execution times (top ~4.5sec).

MPI conduits (GigE and SCI) get worse with increased nodes, as large communication between multiple nodes is not implemented effectively.


5/12/04 26

DES Analysis DES copies parts of each shared array to local memory for processing,

copies back when complete With increasing number of nodes, bandwidth and NIC response time become

more important Designed for high cache miss rate, so very costly in terms of memory access Interconnects with high bandwidth and fast response times perform the best

Marvel shows near-perfect linear speedup, processing time of integers an issue VAPI shows constant speedup Elan shows near-linear speedup from 1 to 2 nodes, but more nodes needed in testbed

for better analysis GM does not begin to show any speedup until 4 nodes, then minimal MPI conduit clearly inadequate for high-bandwidth programs

Notes on BUPC Casting a serious issue on the compiler

Many times it completely ignored pointer casts and issued warnings about not having the proper cast

Private pointer to local shared data was unreliable due to casting issues Some recent bug entries trying to address that issue

Shared-memory access was not always reliable or was very slow Hence the need to copy shared data to local memory for processing

5/12/04 27

10 Gigabit Ethernet – Preliminary results

Testbed Nodes: Each with dual x 2.4GHz Xeons, S2io

Xframe 10GigE card in PCI-X 100, 1GB PC2100

DDR RAM, Intel PRO/1000 1GigE, RedHat 9.0

kernel 2.4.20-8smp, LAM-MPI V7.0.3

0

20

40

60

80

100

120

140

160

180

200

0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K


Ro

un

d-t

rip

La

ten

cy

(u

se

c)

10GigE GigE 0

50

100

150

200

250

300

350

400

450

64 128 256 512 1K 2K 4K 8K 16K 32K 64K


Th

rou

gh

pu

t (M

B/s

)

10GigE GigE

10GigE is promising due to expected

economy-of-scale issues of Ethernet

S2io 10GigE shows impressive

throughput, though slightly less than half

of theoretical maximum; further tuning

needed to go higher

Results show much-needed decrease in

latency versus other Ethernet options

5/12/04 28

Conclusions Key insights

HCS SCI conduit shows promise for increased performance over MPI on SCI Needs further development especially for very-large segments and

memory operations targeted outside the pinned segment (implementing Firehose algorithm) On-going collaboration with vendor (Dolphin) will help to solve these

problems Berkeley UPC system a promising COTS cluster tool

Performance on par with HP UPC (also see [6]) Performance of COTS clusters match and sometimes beat performance

of high-end CC-NUMA Various conduits allow UPC to execute on many interconnects

VAPI and Elan are initially found to be strongest Some open issues with bugs and optimization

Active bug reports and development team help improvements Very good solution for clusters to execute UPC, but may not quite be

ready for production use No debugging or performance tools available

10GigE provides high bandwidth with lower latencies than 1GigE

5/12/04 29

Conclusions & Future Work Key accomplishments to date

Baselining of UPC on shared-memory multiprocessors Evaluation of promising tools for UPC on clusters Leveraging and extension of communication and UPC layers Conceptual design of new tools for UPC Preliminary network and system performance analyses for UPC systems Completion of V1.0 (segment-fast) of the GASNet Core and Extended

API SCI conduit for UPC

Future Work Continue refining HCS SCI conduit Dolphin SCI driver extensions Benchmarking of HCS SCI conduit in Berkeley UPC framework UPC performance analysis tool UPC debugging tool

5/12/04 30

References1. D. Bonachea, U.C. Berkeley Tech Report (UCB/CSD-02-1207) (spec v1.1), October

2002.

2. C. Bell, D. Bonachea, “A New DMA Registration Strategy for Pinning-Based High Performance Networks,” Workshop on Communication Architecture for Clusters (CAC'03), April, 2003.

3. W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, K. Warren, “Introduction to UPC and Language Specification,” IDA Center for Computing Sciences, Tech. Report., CCS-TR99-157, May 1999.

4. K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken , “Titanium: A High-Performance Java Dialect,” Concurrency: Practice and Experience, Vol. 10, No. 11-13, September-November 1998.

5. B. Gordon, S. Oral, G. Li, H. Su and A. George, “Performance Analysis of HP AlphaServer ES80 vs. SAN-based Clusters,” 22nd IEEE International Performance, Computing, and Communications Conference (IPCCC), April, 2003.

6. W. Chen, D. Bonachea, J. Duell, P. Husbands, C. Iancu, K. Yelick, “A Performance Analysis of the Berkeley UPC Compiler,” 17th Annual International Conference on Supercomputing (ICS), June, 2003.