Post on 16-Jan-2016
transcript
BSP on the Origin2000
Lab for the course:Seminar in Scientific Computing with
BSP Dr. Anne Weill – anne@tx.technion.ac.il ,ph:4997
Origin2000 (SGI)
32 processors
Origin2000/3000 architecture features
Important hardware and software components:
* node board: processors + memory
* node interconnect topology and configurations
* scalability of the architecture
* directory-based cache coherency
* single system image components
Origin2000 node board
Origin2000 – two nodes
Origin2000 interconnect
Origin2000 interconnect
32 processors
64 processors
Origin router interconnect
- Router chip has 6 CrayLink interfaces: 2 for connections to nodes (HUBs) and 4 for connections to other routers in the network * 4-dimensional interconnect
- Router links are point-to-point connections 17+7 wires @ 400 MHz (that is, wire speed 800 MB/s)
- Worm hole routing with static routing table loaded at boot - Router delay is 50 ns in one direction
- The interconnect topology is determined by the size of the computer (number of nodes): * direct (back-to-back) connection for 2 nodes (4 cpu) * strongly connected cube up to 32 cpu * hypercube for up to 64 cpu * hypercube of hypercubes for up to 256 cpu
Origin address space - Physically the memory is distributed and not contiguous - Node id is assigned at boot time
- Logically memory is a shared single contiguous address space, the virtual address space is 44 bits (16 TB) - A program (compiler) uses the virtual address space - CPU translates from virtual to physical address space
node id 8 bits Node offset 32 bits (4 GB)
39 32 31 0
k1n
0
012
n
TLB
Physical
Virtual TLB – Translation Look-aside Buffer
0 1 2 3 .. Node id
Empty slot
Memory present
page
Login to carmel
1. Open an ssh window to :
carmel.technion.ac.il
2. Username : course01-course20
Password : bsp2006
Contact : Dr. Anne Weill – anne@tx.technion.ac.il ,
phone :4997
Compiling and running codes
1. Setting path set path=($path /u/tcc/anne/BSP/bin)
2. Compiling%bspcc prog1.c –o prog1%bspcc –flibrary-level 1 prog1.c –o prog1 (for non-dedicated machine)
3. Running%bsprun –npes 4 prog1
Running on carmel
1. Interactive mode :
% ./prog.exe <parameters>
2. NQE queues:
% qsub –q qcourse script.bat
BSP functions
bsp_begin(maxpr) Start of program with at most maxpr processes
bsp_end() End of program
bsp_nprocs() Number of processes currently running
bsp_pid() Returns process id`
bsp_time() Returns elapsed wallclock time
Sample program
Output of hello program
How it works
bsprun
P0
P1
P2
P3
Prog.exe
Prog.exe
Prog.exe
Prog.exe
SPMD – single program multiple data
• Each processor views only its local memory.
• Contents of variable X are different in different processors.
• Transfer of data can occur in principle through one-sided or two-sided communication.
DRMA- direct remote memory access
• All processors must register the space into which remote “read” and “write” will happen
• Calls to bsp_put
• Calls to bsp_get
• Call to bsp_sync – all processors synchronize, all communication is completed after the call
BSP functions for communication
bsp_push_reg(var,nbytes) Registration of variable
bsp_put(pid,source,dest,offset,nbytes)
Pid is destination processor
bsp_get(pid , source,offset,dest,nbytes
Pid is source processor
bsp_pop_reg(var) `
Running on carmel
1. Interactive mode :
% ./prog.exe <parameters>
2. NQE queues:
% qsub –q qcourse script.bat
Script file for batch
Output of command: “qstat –a”
Another example
*What does the following program ?
• What will the program print ?
Output of program
Another example
* Is there a problem with the following example?
• What will the program print ?
Answer
• As it is written, the program will not print any output : the data is actually transferred only after the bsp_sync statement
• Additional question : what will the program print if bsp_sync is placed right after the put statement?
• NB : the programs are in directory /u/tcc/anne/BSPcourse, under prog2.c and prog2wrong.c – try them
Exercise1 (due Nov. 26d 2006)
1. Copy over to your directory the directory: /u/tcc/anne/BSPcourse. Take a look at the bspedupack.h file.
2. Write a C program in which each processor writes its pid into an array PIDS(0:p-1) on p0. (PIDS(i)=i).
3. Run the program for p=1,2,4,8,16 processors and print PIDS. You can run it interactively.
4. Same with a get instruction.