5/12/04 1
Distributed Shared-Memory Parallel Computing with UPCon SAN-based Clusters
Dr. Alan D. George, Principal InvestigatorMr. Burton C. Gordon, Sr. Research Assistant
Mr. Hung-Hsun Su, Sr. Research Assistant
HCS Research LaboratoryUniversity of Florida
5/12/04 2
Outline
Objectives and Motivations
Background
Related Research
Approach
Results
Conclusions and Future Plans
5/12/04 3
Objectives and Motivations Objectives
Support advancements for HPC with Unified Parallel C (UPC) on cluster systems exploiting high-throughput, low-latency System-Area Networks (SANs) and LANs
Design and analysis of tools to support UPC on SAN-based systems Benchmarking and case studies with key UPC applications Analysis of tradeoffs in application, network, service, and system design
Motivations Increasing demand in sponsor and scientific computing community for shared-
memory parallel computing with UPC New and emerging technologies in system-area networking and cluster computing
Scalable Coherent Interface (SCI) Myrinet (GM) InfiniBand (VAPI) QsNet (Quadrics Elan) Gigabit Ethernet and 10 Gigabit Ethernet
Clusters offer excellent cost-performance potential
5/12/04 4
Background Key sponsor applications and developments toward
shared-memory parallel computing with UPC
UPC extends the C language to exploit parallelism Currently runs best on shared-memory multiprocessors or
proprietary clusters (e.g. AlphaServer SC) Notably HP/Compaq’s UPC compiler
First-generation UPC runtime systems becoming available for clusters MuPC, Berkeley UPC
Significant potential advantage in cost-performance ratio with Commercial Off-The-Shelf (COTS) cluster configurations Leverage economy of scale
Clusters exhibit low cost relative to tightly coupled SMP, CC-NUMA, and MPP systems
Scalable performance with COTS technologies
Com3
Com3
Com3
UPCUPC
UPCUPC
??
5/12/04 5
Related Research University of California at Berkeley
UPC runtime system
UPC to C translator
Global-Address Space Networking (GASNet) design and development
Application benchmarks
George Washington University UPC specification
UPC documentation
UPC testing strategies, testing suites
UPC benchmarking
UPC collective communications
Parallel I/O
Michigan Tech University Michigan Tech UPC (MuPC) design
and development
UPC collective communications
Memory model research
Programmability studies
Test suite development
Ohio State University UPC benchmarking
HP/Compaq UPC compiler
Intrepid GCC UPC compiler
5/12/04 6
Related Research -- MuPC & DSM MuPC (Michigan Tech UPC)
First open-source reference implementation of UPC for COTS clusters Any cluster that provides Pthreads and MPI can use
Built as a reference implementation, performance is secondary
Limitations in application size, memory mode
Not suitable for performance-critical applications
UPC/DSM/SCI SCI-VM (DSM system for SCI)
HAMSTER interface allows multiple modules to support MPI and shared-memory
models
Created using Dolphin SISCI API, ANSI C
SCI-VM not under constant development, so future upgrades sketchy
Not feasible for amount of work needed versus expected performance Better possibilities with GASNet
5/12/04 7
Related Research -- GASNet Communication system created by U.C. Berkeley
Target for Berkeley UPC system
Global-Address Space Networking (GASNet)[1]
Language-independent, low-level networking layer for high-performance communication
Segment region for communication on each node, three types Segment-fast: sacrifice size for speed Segment-large: allows large memory area for shared space, perhaps
with some loss in performance (though firehose [2] algorithm is often employed for better efficiency)
Segment-everything: expose the entire virtual memory space of each process for shared access Firehose algorithm allows memory to be managed into buckets for
efficient transfers Interface for high-level global address space SPMD languages
UPC [3] and Titanium [4]
Divided into two layers Core
Active Messages
Extended High-level operations which take direct advantage of network capabilities
communication system from U.C. Berkeley
5/12/04 8
Related Research -- Berkeley UPC Second open-source implementation of
UPC for COTS clusters
First with a focus on performance
GASNet for all accesses to remote memory
Network conduits allow for high performance
over many different interconnects
Targets a variety of architectures
x86, Alpha, Itanium, PowerPC, SPARC, MIPS,
PA-RISC
Best chance as of now for high-performance
UPC applications on COTS clusters
Note: Only supports strict shared-memory
access and therefore only uses the blocking
transfer functions in the GASNet spec
UPC Code Translator
Translator Generated CCode
Berkeley UPC RuntimeSystem
GASNetCommunication
System
Network Hardware
Platform-independent
Network-independent
Compiler-independent
Language-independent
5/12/04 9
Approach Collaboration
HP/Compaq UPC Compiler V2.1 running in our lab on ES80 AlphaServer (Marvel)
Support of testing by OSU, MTU, UCB/LBNL, UF, et al. with leading UPC tools and system for function performance evaluation
Field test of newest compiler and system
Exploiting SAN Strengths for UPC Design and develop new SCI Conduit for GASNet
in collaboration UCB/LBNL Evaluate DSM for SCI as option of executing UPC
Benchmarking Use and design of applications in UPC to grasp key
concepts and understand performance issues NAS benchmarks from GWU DES-cypher benchmark from UF
Performance Analysis Network communication experiments UPC computing experiments
Emphasis on SAN Options and Tradeoffs SCI, Myrinet, InfiniBand, Quadrics, GigE, 10GigE,
etc.
Up
per
La
ye
rsA
pp
lica
tio
ns,
Tra
ns
lato
rs,
Do
cu
me
nta
tio
n
Mid
dle
La
ye
rsR
un
tim
e S
ys
tem
s,
Inte
rfac
es
Lo
we
r L
ay
ers
Ru
nti
me
Sy
ste
ms,
Inte
rfac
es
UF
HC
S L
ab
Ohio StateBenchmarks
Michigan TechBenchmarks, modeling,
specification
UC BerkeleyBenchmarks, UPC-to-C translator, specification
GWUBenchmarks, documents,
specification
Benchmarks
Michigan TechUPC-to-MPI translation
and runtime system
UC BerkeleyC runtime system, upper
levels of GASNet
HPUPC runtime system on
AlphaServer
UC BerkeleyGASNet
GASNet collaboration,beta testing
GASNet collaboration,
network performance
analysis
5/12/04 10
GASNet SCI Conduit Scalable Coherent Interface (SCI)
Low-latency, high-bandwidth SAN Shared-memory capabilities
Require memory exporting and importing
PIO (require importing) + DMA (need 8 bytes alignment)
Remote write ~10x faster than remote read
SCI conduit AM enabling (core API)
Dedicated AM message channels (Command) Request/Response pairs to prevent
deadlock
Flags to signal arrival of new AM (Control)
Put/Get enabling (extended API) Global segment (Payload)
Control X
Command X-1
...
Command X-N
Payload X
Local (In use)
Local (free)
Control 1
...
...
Control N
Command 1-X
...
Command X-X
Control X
Payload 1
...
Payload X
...
Control 1
...
Control X
Command 1-X
...
Command X-X
Command N-X
Payload N
Command X-X
...
...
Control N
...
Command N-X
...
Control Segments(N total)
Command Segments(N*N total)
Payload Segments(N total)
SCI Space
Node X
Physical Address
Virtual Address
ExportingImporting
DMA Queues (Local)
5/12/04 11
GASNet SCI Conduit - Core API
AM Header
Medium AM Payload
Long AM Payload
Message Ready Flags
Control
Command Y-1
Node X
...
Command Y-X
...
Command Y-N
Payload Y
Node Y
Wait for Completion
1st Message Pointer = NULL?
Dequeue 1st message
New Message Available?
Enqueue all new messages
Yes Yes
No
Check
Extract new message
information
Handle message Exit Polling
Start Polling
Other processing
Process reply
message
AM reply or control reply
Memory
The queue is constructed by placing a next pointer in the
header that points to the next available message
Active Message Transferring
1. Obtain free slot Tract locally using array of flags
2. Package AM Header3. Transfer Data
Short AM PIO write (Header)
Medium AM PIO write (Header) PIO write (Medium Payload)
Long AM PIO write (Header) PIO write (Long Payload)
Payload size 1024 Unaligned portion of payload
DMA write (multiple of 8 bytes)4. Wait for transfer completion5. Signal AM arrival
Message Ready Flag Value = type of AM
Message Exist Flag Value = TRUE
6. Wait for reply/control signal Free up remote slot for reuse
5/12/04 12
GASNet SCI Conduit - Core API Testbed
Nodes: Each with dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset
SAN: 667 MB/s (300MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus
Tests SCI Conduit
Latency/Throughput (testam 10000 iterations) SCI Raw
PIO latency (scipp) DMA latency and throughput (dma_bench)
Analysis Latency a little high, but constant overhead (not
exponential Throughput follows RAW trend
Short/Medium AM Ping-Pong Latency
0
5
10
15
20
25
30
35
40
0 1 2 4 8 16
32
64
128
256
512
1024
Payload Size (Bytes)
La
ten
cy (
us)
SCI Raw SCI Conduit
Long AM Ping-Pong Latency
0
50
100
150
200
250
0 1 2 4 8 16
32
64
128
256
512 1K
2K
4K
8K
16K
Payload Size (Bytes)
Lat
ency
(us
)
SCI Raw SCI Conduit
Long AM Throughput
0
50
100
150
200
250
64
12
8
25
6
51
2
1K
2K
4K
8K
16
K
32
K
64
K
12
8K
25
6K
Payload Size (Bytes)
Thro
ughput
(MB
/s)
SCI Raw SCI Conduit
PIO/DMA Mode Shift
PIO/DMA Mode Shift
5/12/04 13
GASNet SCI Conduit – Extended API SCI-Specific API
Put1. Size 1024 or Unaligned portion send using Long
AM. Wait for completion2. Copy remaining data from local source to local
DMA queue3. DMA write from DMA queue to remote destination4. Check DMA queue state for completion
Get1. Unaligned portion transfer using AM system. Wait
for completion2. DMA write of remaining data from remote source
to local DMA queue3. Check DMA queue state for completion4. Copy data from local DMA queue to local
destination
Local Source
Local DMA Queue
Remote Destination
Aligned Data Remote Write
Long AMUnaligned Data
Local Destination
Short AMRemote Source
Long AM
Local DMA Queue
Remote Read
Remote Write
Aligned Data
Unaligned Data
Put (SCI Specific)
Get (SCI Specific)
Local Source
Long AM(s)Remote
Destination
Put (Reference)
Local Destination
Short AMRemote Source
Long AM(s)
Get (Reference In)
Local Destination
Short AMRemote Source
Medium AM(s)
Get (Reference Out)
Reference API (Uses Core API AM system) Put
1. Send Long AM(s)2. Synchronization
Get1. Send Short AM request to remote node2. Remote node handles the request and send
Multiple Medium AMs - dest. out of segment Long AM(s) - dest. in segment
3. Synchronization Tests
Latency (testsmall -in/-out 10000) Throughput (testlarge -in/-out 10000)
Signal
Transfer(src à dest)
5/12/04 14
GASNet SCI Conduit – Extended API Latency is the same for in-segment and out-of-
segment for both versions Reference Extended API
Blocking and non-blocking versions of put/get have nearly the same throughput results
Throughput result the for in-segment is ~10MB/s better than out-of-segment
HCS SCI Extended API Throughput result the same for in-segment out-of-
segment Max 12us (~8MB/s throughput) overhead extra for
non-bulk (unaligned) data Put performance comparison
HCS SCI conduit blocking version (max 214 MB/s) performs slightly better than the reference blocking version (max 208 MB/s)
HCS SCI non-blocking version has much higher throughput than the reference non-blocking version (max 241 MB/s, close to SCI Raw Max)
SCI-specific > Reference Get performance comparison
Out of segment: HCS SCI conduit comparable to reference version (~45 MB/s)
In segment: Reference version significantly out performs HCS SCI (max 207 MB/s) HCS SCI uses DMA read to implement get
Reference >= SCI specific Use HCS SCI put and reference get to obtain
optimal performance
0
1020
3040
50
6070
80
1 2 4 8 16 32 64 128 256 512 1K 2K
Data Size (Bytes)
La
ten
cy (
us)
Put (Ref) Get (Ref)Put (SCI) Get (SCI)SCI Raw (DMA) SCI Raw (PIO)
PIO/DMA Mode Shift
0
50
100
150
200
250
64 256 1K 4K 16K 64K 256K
Data Size (Bytes)
Thro
ughput (M
B/s
)
Put (Ref In) Get (Ref In) Put (Ref Out)Get (Ref Out) Put (SCI) Get (SCI)Put NB (SCI) SCI Raw
5/12/04 15
GASNet Conduit Comparison- Experimental Setup
Testbed Elan, VAPI, MPI and SCI conduits
Nodes: Each with dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset
SCI: 667 MB/s (300MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus Quadrics: 528 MB/s (340MB/s sustained) Elan3 ,using PCI-X in two nodes with QM-S16 16 port switch InfiniBand: 4x (10Gbps/ 800MB/s sustained) Infiniserv HCAs, using PCI-X 100, InfiniIO 2000 8 port switch
from Infinicon RedHat 9.0 with gcc compiler V 3.3.2, SCI uses MP-MPICH beta from RWTH Aachen Univ., Germany
GM (Myrinet) conduit (c/o access to cluster at MTU) Nodes*: Each with dual 2.0 GHz Intel Xeons, 2GB DDR PC2100 (DDR266) RAM Myrinet*: 250 MB/s Myrinet 2000, using PCI-X, on 8 nodes connected with 16-port M3F-SW16 switch RedHat 7.3 with Intel C compiler V 7.1
ES80 AlphaServer (Marvel) Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM, proprietary inter-processor connections Tru64 5.1B Unix, HP UPC V2.1 compiler
Experimental Setup GASNet configured with segment Large
As fast as segment-fast for inside the segment Makes use of Firehose for memory outside the segment (often more efficient than segment-fast)
GASNet Conduit experiments Berkeley GASNet test suite
Average of 1000 iterations Each uses put/get operations to take advantage of implemented extended APIs Executed with target memory falling inside and then outside the GASNet segment
Reported only inside results unless difference was significant Latency results use testsmall Throughput results use testlarge
* via testbed made available courtesy of Michigan Tech
5/12/04 16
GASNet Latency on Conduits
0
5
10
15
20
25
30
35
40
1 2 4 8 16 32 64 128 256 512 1K
Message Size (bytes)
Ro
un
d-t
rip
Lat
ency
(u
sec
)
GM put GM get Elan put Elan getVAPI put VAPI get MPI SCI put MPI SCI getHCS SCI put HCS SCI get
5/12/04 17
GASNet Throughput on Conduits
0
100
200
300
400
500
600
700
800
128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K
Message Size (bytes)
Th
rou
gh
pu
t (M
B/s
)GM put GM get Elan put Elan getVAPI put VAPI get MPI SCI put MPI SCI getHCS SCI put HCS SCI get
5/12/04 18
GASNet Conduit Analysis Elan shows best performance for latency of puts and gets VAPI is by far the best bandwidth; latency very good GM latencies a little higher than all the rest HCS SCI conduit shows better put latency than MPI on SCI for sizes > 64 bytes; very close
to MPI on SCI for smaller messages HCS SCI conduit has latency slightly higher than MPI on SCI GM and SCI provide about the same throughput
HCS SCI conduit slightly higher bandwidth for largest message sizes
Quick look at estimated total cost to support 8 nodes of these interconnect architectures: SCI: ~$8,700 Myrinet: ~$9,200 InfiniBand: ~$12,300 Elan3: ~$18,000 (based on Elan4 pricing structure, which is slightly higher)
Note: This study does not include the latest hardware from Myrinet (2000E w/ GM2) and Quadrics (Elan4 w/ QsNetII)
5/12/04 19
UPC function performance A look at common shared-data operations
Comparison between accesses to local data through regular and private pointers Block copies between shared and private memory
upc_memget, upc_memput
Pointer conversion (shared local to private) Pointer addition (advancing pointer to next location) Loads & Stores (to a single location local and remote)
Block copies upc_memget & upc_memget translate directly into GASNet blocking put and get
(even on local shared objects); see previous graph for results Marvel with HP UPC compiler shows no appreciable difference between local and
remote puts and gets and regular C operations Steady increase from 0.27 to 1.83 µsec for sizes 2 to 8K bytes Difference of < .5 µsec for remote operations
5/12/04 20
UPC function performance Pointer operations
Cast Local share to private All BUPC conduits ~2ns, Marvel needed ~90ns
Pointer addition below
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
MPI-SCI MPI GigE Elan GM VAPI Marvel
Ex
ec
uti
on
Tim
e (
us
ec
)
Private Shared
Shared-pointer manipulation about an order of magnitude greater than private.
5/12/04 21
UPC function performance Loads and stores with pointers (not bulk)
Data local to the calling node Pvt Shared are private pointers to the local shared space
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
MPI SCI MPI GigE Elan GM VAPI Marvel
Ex
ec
uti
on
Tim
e (
us
ec
)
Private Store Private Load Shared StoreShared Load Pvt Shared Store Pvt Shared Load
MPI on GigE shared store takes 2 orders of magnitude longer, therefore not shown. Marvel shared loads and stores twice an order of magnitude greater than private.
5/12/04 22
UPC function performance Remote loads and stores with pointers (not bulk)
Data remote to the calling node Note: MPI GigE showed a time of ~450µsec for loads and ~500µsec for stores
0
5
10
15
20
25
30
MPI-SCI Elan GM VAPI Marvel
Ex
ec
uti
on
Tim
e (
us
ec
)
Store Load
Marvel remote access through pointers the same as with local shared, two orders of magnitude less than Elan.
5/12/04 23
UPC Benchmarks – IS from NAS benchmarks* Class A size of benchmark executed with Berkeley UPC runtime system V1.1 with gcc V3.3.2 for Elan, MPI; Intel V7.1
for GM; HP UPC V2.1 for Marvel IS (Integer Sort), lots of fine-grain communication, low computation
Only un-optimized version executed correctly using Berkeley UPC, so un-optimized version used for comparison Communication layer should have greatest effect on performance Single thread shows performance without use of communication layer
Poor performance in the GASNet communication system does not necessary indicate poor performance in UPC application
MPI results poor for GASNet but decent for UPC applications, though all other conduits clearly yield better performance
* Using code developed at GWU
0
5
10
15
20
25
30
GM Elan GigE mpi VAPI SCI mpi Marvel
Network Conduit
Ex
ec
uti
on
Tim
e (
se
c)
1 Thread 2 Threads 4 Threads 8 Threads
Only two nodes available with Elan, unable to determine scalability at this point.
TCP/IP overhead outweighs benefit of parallelization.
Many shared-memory region accesses throttle back performance. Also, Marvel does not perform well with many small messages [5].
HCS SCI conduit not integrated with BUPC yet.
GM conduit uses Intel C compiler, 2.0GHz uP
5/12/04 24
UPC Benchmarks – FT from NAS benchmarks* Class A size of benchmark executed with same setup as IS FT: 3-D Fast Fourier Transform, medium communication, high computation
Used optimized version 01 (private pointers to local shared memory) High-bandwidth networks perform best (VAPI followed by Elan)
VAPI conduit allows cluster of Xeons to keep pace with Marvel’s performance MPI on GigE not well suited for these types of problems (high-latency, low-bandwidth traits limit performance) MPI on SCI has lower bandwidth than VAPI but still maintains near-linear speedup for more than 2 nodes (skirts TCP/IP overhead) GM performance a factor of processor speed (see 1 Thread)
0
5
10
15
20
25
30
35
40
45
50
GM Elan GigE mpi VAPI SCI mpi Marvel
Execu
tio
n T
ime (
sec)
1 Thread 2 Threads 4 Threads 8 Threads
* Using code developed at GWU
High-latency of MPI on GigE impedes performance.
GM conduit uses Intel C compiler, 2.0GHz uP
5/12/04 25
UPC Benchmarks -- DES Differential Attack Simulator Differential attack simulator of S-DES (96-bit key) cipher (integer-based)
Creates basic components used in differential cryptanalysis S-Boxes, Difference Pair Tables (DPT), and Differential Distribution Tables (DDT)
Implemented in UPC to expose parallelism in DPT and DDT creation, choosing best pairs Bandwidth-intensive application Various access methods to shared memory
Used upc_memget and upc_memput for shared remote data access Used private pointers to local shared data for DPT creation
0
250
500
750
1000
1250
1500
1750
2000
GM Elan VAPI SCI mpi Marvel
Execu
tio
n T
ime (
msec)
Sequential 1 Thread 2 Threads 4 Threads
MPI on GigE not shown, due to high execution times (top ~4.5sec).
MPI conduits (GigE and SCI) get worse with increased nodes, as large communication between multiple nodes is not implemented effectively.
GM conduit uses Intel C compiler, 2.0GHz uP
5/12/04 26
DES Analysis DES copies parts of each shared array to local memory for processing,
copies back when complete With increasing number of nodes, bandwidth and NIC response time become
more important Designed for high cache miss rate, so very costly in terms of memory access Interconnects with high bandwidth and fast response times perform the best
Marvel shows near-perfect linear speedup, processing time of integers an issue VAPI shows constant speedup Elan shows near-linear speedup from 1 to 2 nodes, but more nodes needed in testbed
for better analysis GM does not begin to show any speedup until 4 nodes, then minimal MPI conduit clearly inadequate for high-bandwidth programs
Notes on BUPC Casting a serious issue on the compiler
Many times it completely ignored pointer casts and issued warnings about not having the proper cast
Private pointer to local shared data was unreliable due to casting issues Some recent bug entries trying to address that issue
Shared-memory access was not always reliable or was very slow Hence the need to copy shared data to local memory for processing
5/12/04 27
10 Gigabit Ethernet – Preliminary results
Testbed Nodes: Each with dual x 2.4GHz Xeons, S2io
Xframe 10GigE card in PCI-X 100, 1GB PC2100
DDR RAM, Intel PRO/1000 1GigE, RedHat 9.0
kernel 2.4.20-8smp, LAM-MPI V7.0.3
0
20
40
60
80
100
120
140
160
180
200
0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Message Size (bytes)
Ro
un
d-t
rip
La
ten
cy
(u
se
c)
10GigE GigE 0
50
100
150
200
250
300
350
400
450
64 128 256 512 1K 2K 4K 8K 16K 32K 64K
Message Size (bytes)
Th
rou
gh
pu
t (M
B/s
)
10GigE GigE
10GigE is promising due to expected
economy-of-scale issues of Ethernet
S2io 10GigE shows impressive
throughput, though slightly less than half
of theoretical maximum; further tuning
needed to go higher
Results show much-needed decrease in
latency versus other Ethernet options
5/12/04 28
Conclusions Key insights
HCS SCI conduit shows promise for increased performance over MPI on SCI Needs further development especially for very-large segments and
memory operations targeted outside the pinned segment (implementing Firehose algorithm) On-going collaboration with vendor (Dolphin) will help to solve these
problems Berkeley UPC system a promising COTS cluster tool
Performance on par with HP UPC (also see [6]) Performance of COTS clusters match and sometimes beat performance
of high-end CC-NUMA Various conduits allow UPC to execute on many interconnects
VAPI and Elan are initially found to be strongest Some open issues with bugs and optimization
Active bug reports and development team help improvements Very good solution for clusters to execute UPC, but may not quite be
ready for production use No debugging or performance tools available
10GigE provides high bandwidth with lower latencies than 1GigE
5/12/04 29
Conclusions & Future Work Key accomplishments to date
Baselining of UPC on shared-memory multiprocessors Evaluation of promising tools for UPC on clusters Leveraging and extension of communication and UPC layers Conceptual design of new tools for UPC Preliminary network and system performance analyses for UPC systems Completion of V1.0 (segment-fast) of the GASNet Core and Extended
API SCI conduit for UPC
Future Work Continue refining HCS SCI conduit Dolphin SCI driver extensions Benchmarking of HCS SCI conduit in Berkeley UPC framework UPC performance analysis tool UPC debugging tool
5/12/04 30
References1. D. Bonachea, U.C. Berkeley Tech Report (UCB/CSD-02-1207) (spec v1.1), October
2002.
2. C. Bell, D. Bonachea, “A New DMA Registration Strategy for Pinning-Based High Performance Networks,” Workshop on Communication Architecture for Clusters (CAC'03), April, 2003.
3. W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, K. Warren, “Introduction to UPC and Language Specification,” IDA Center for Computing Sciences, Tech. Report., CCS-TR99-157, May 1999.
4. K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken , “Titanium: A High-Performance Java Dialect,” Concurrency: Practice and Experience, Vol. 10, No. 11-13, September-November 1998.
5. B. Gordon, S. Oral, G. Li, H. Su and A. George, “Performance Analysis of HP AlphaServer ES80 vs. SAN-based Clusters,” 22nd IEEE International Performance, Computing, and Communications Conference (IPCCC), April, 2003.
6. W. Chen, D. Bonachea, J. Duell, P. Husbands, C. Iancu, K. Yelick, “A Performance Analysis of the Berkeley UPC Compiler,” 17th Annual International Conference on Supercomputing (ICS), June, 2003.