Date post: | 05-Jan-2016 |
Category: |
Documents |
Upload: | noel-wilkins |
View: | 216 times |
Download: | 1 times |
San Diego, March 27th 2003
Roberto De Pietri -- chep03 1
apeNEXT* The apeNEXT multi-TFlops LGT supercomputer: architecture description and project status report
* The apeNEXT project
Roberto De Pietri ([email protected])Università di Parma & INFN gruppo collegato di Parma
San Diego, March 27th 2003
Roberto De Pietri -- chep03 2
The APE familyOur line of Home Made Computers
APE(1988)
APE100(1993)
APEmille(1999)
apeNEXT(2003)
Architecture SIMD SIMD SIMD SIMD++
# nodes 16 2048 2048 4096
Topology flexible 1D
rigid 3D flexible 3d flexible 3D
Memory 256 MB 8 GB 64 GB 1 TB
# registers (w.size)
64 (x32) 128 (x32) 512 (x32) 512 (x64)
clock speed 8 MHz 25 MHz 66 MHz 200 MHz
Total Computing Power of all …
~1.5 GFlops
~ 250 GFlops
~ 2 TFlops ~ 8-20 TFlops
San Diego, March 27th 2003
Roberto De Pietri -- chep03 3
APE (‘88) 1 GFlops
San Diego, March 27th 2003
Roberto De Pietri -- chep03 4
The APE paradigm
Very efficient for LQCD The normal operation as a basic operation Native implementation of the complex type a x b + c (complex numbers)
Large number of register Efficient optimizations
VLIW (very long instruction word) Reliable and safe HW solution Easy to program software tools
APEse, TAO Machine simulator
San Diego, March 27th 2003
Roberto De Pietri -- chep03 5
Since APE 100 Our own designed VLSI
Pipelined normal operation on a chip (MAD) 3D topology
Remote I/O and X - link ON CABLE Y and Z – link on the BACKPLANE
Large number of APEmille installation in Europe 30 crate (~ 65 GFlops) Almost 2 TeraFlops of computing power
San Diego, March 27th 2003
Roberto De Pietri -- chep03 6
APEmille installations
Bielefeld 130 GF (2 crates) Zeuthen 520 GF (8 crates) Milan 130 GF (2 crates) Bari 65 GF (1 crates) Trento 65 GF (1 crates) Pisa 325 GF (5 crates) Rome 1 520 GF (8 crates) Rome 2 130 GF (2 crates) Orsay 16 GF (1/4 crates) Swansea 65 GF (1 crates)
Gr. Total ~1966 GF
San Diego, March 27th 2003
Roberto De Pietri -- chep03 7
The apeNEXT architecture
3D mesh of computing nodes
Each node is a:complete self-sufficient computing engine(1.6 GFlops)
Z+(bp)
Y+(bp)
X+(cables)
0 2
4 6
8 10
12 14
1 3
5 7
9 11
13 15
J&T
DDR-MEM
X+
……Z-
7th link
San Diego, March 27th 2003
Roberto De Pietri -- chep03 8
The apeNEXT architecture (2)
Two directions (Y,Z) on the backplane
Direction X through front panel cables
System topologies:
Processing Board 4 x 2 x 2 ~ 26 GF subCrate (16 PB) 4 x 8 x 8 ~ 0.4 TF Crate (32 PB) 8 x 8 x 8 ~ 0.8 TF Large systems (8*n) x 8 x 8
Z+(bp)
Y+(bp)
X+(cables)
0 2
4 6
8 10
12 14
1 3
5 7
9 11
13 15
J&T
DDR-MEM
X+
……Z-
San Diego, March 27th 2003
Roberto De Pietri -- chep03 9
Components (1)
The CHIP
The J&T chip is the core of apeNEXT and everything is built around it !!
San Diego, March 27th 2003
Roberto De Pietri -- chep03 10
Components (2)
J&T Module 1 J&T Chip 9 DRAM chips
256 Mbitsmemory chips
1024 Mbits memory chips(supported)
San Diego, March 27th 2003
Roberto De Pietri -- chep03 11
Components (3)
Processing Board
Z+(bp)
Y+(bp)
X+(cables)
0 2
4 6
8 10
12 14
1 3
5 7
9 11
13 15
San Diego, March 27th 2003
Roberto De Pietri -- chep03 12
Components (4)
Back Plane Z+,Z-
links Y+,Y-
links
Z+(bp)
Y+(bp)
X+(cables)
0 2
4 6
8 10
12 14
1 3
5 7
9 11
13 15
San Diego, March 27th 2003
Roberto De Pietri -- chep03 13
Components (5)
The Cabinet
Standard 1U rack mounted
PC
Standard 48Volt Power
Supplies
San Diego, March 27th 2003
Roberto De Pietri -- chep03 14
Host Interface
I2C: bootstrap & control 7th-Link (200MB/s)
San Diego, March 27th 2003
Roberto De Pietri -- chep03 15
I2C (x4)
7th Link Port
PCI (64bit,66Mhz)PCI form factor
Fifo
Altera APEXIIPCI Interface PLDA
7Link Ctrl
I2C Ctrl
PCIMaster
Ctrl
PCITargetCtrl
7Link Ctrl
QDR Mem Ctrl
Fifo
Fifo
QDRMem Bank
• PCI Interface 64bit, 66Mhz
• PCI Master Mode for 7th Link Intf
• PCI Target Mode for I2C Intf
• QuadDataRate Memory (x32)
• Altera APEX II based
• 7th Link: 1(2) bidir chan. (200*9 M/s)
• I2C: 4 independent ports
Host I/O Interface
San Diego, March 27th 2003
Roberto De Pietri -- chep03 16
• Dominant Technologies:– LVDS: 1728 (16*6*2*9) differential signals 200MB/s, 144 routed via cables, 576 via backplane on 12 controlled-impedance (100W) layers
– High-Speed differential connectors:
•Samtec QTS (J&T Module)
•Erni ERMET-ZD (Backplane)
•16 Nodes 3D-Interconnected
• 4x2x2 Topology 26 Gflops, 4.6 GB Memory
• Light System:
– J&T Module connectors
– Glue Logic (Clock tree 10Mhz)
– Global signal interconnection (FPGA)
– DC-DC converters (48V to 3.3/2.5/1.8 V)
• Collaboration with NEURICAM spaPB
San Diego, March 27th 2003
Roberto De Pietri -- chep03 17
J&T Module J&T 9 DDR-SDRAM, 256Mbit (x16)
memory chips 6 Link LVDS up to 400MB/s Host Fast I/O Link (7th Link) I2C Link (slow control network) Dual Power 2.5V + 1.8V, 7-10W
estimated Dominant technologies:
SSTL-II (memory interface) LVDS (network interface + I/O)
San Diego, March 27th 2003
Roberto De Pietri -- chep03 18
Overview of the J&T Architecture
Peak floating point performance of about 1.6Gflops IEEE compliant double precision
Integer arithmetic performance of about 400 Mips Link bandwidth of about 200 Mbyte/sec each
full duplex 7 links: X+,X-,Y+,Y-,Z+,Z- and the 7th link
Support for current generation DDR memory Memory bandwidth of 3.2 Gbyte/sec
400 Mword/sec
San Diego, March 27th 2003
Roberto De Pietri -- chep03 19
J&T Computing & control
integrated
no glue logic
Reduced time for project, simulation and test of the prototype
San Diego, March 27th 2003
Roberto De Pietri -- chep03 20
J&T: Top Level Diagram
San Diego, March 27th 2003
Roberto De Pietri -- chep03 21
The J&T Arithmetic BOX
4 multipliers
4 adder/sub
At 200 MHz (fully piped) = 1.6 GFlops
•Pipelined complex “normal” a*b+c (8 flops) per cycle
San Diego, March 27th 2003
Roberto De Pietri -- chep03 22
The J&T remote IO
fifo-based communication:
LVDS
1.6 Gb/s per link (8 bit @ 200MHz)
6 (+1) independent links
San Diego, March 27th 2003
Roberto De Pietri -- chep03 23
J&T summary
CMOS 0.18m, 7 metal (ATMEL)
200 MHz Double Precision Complex
Normal Operation 64 bit AGU 8 KW program cache 128 bit local memory
channel 6+1 LVDS 200 MB/s links BGA package, 600 pins
San Diego, March 27th 2003
Roberto De Pietri -- chep03 24
Key steps of the J&T design
✔ January 2001: VHDL design starts✔ May 2001: Contract with Atmel established✔ November 2001: First placement experiment started
✔ February 2002: Major rework on the network protocol (to increase robustness against transmission errors).
✔April 2002: Network OK, re-start placement exercises✔June 2002: Good placement available✔June 2002 (end): Satisfactory routing available
✔July 2002(beginning): Power routing not OK and✔ 5% of “random logic” removed✔July 2002(end): Both problems solved
………………Continues on next slides .............................
San Diego, March 27th 2003
Roberto De Pietri -- chep03 25
Key steps of the J&T design (2)
✔September 2002: New placement available (with new power layout)✔September 2002: Excessive congestion .... OR✔October 2002: Very bad timing closure✔November 2002: Satisfactory placement OK✔Dec. 9th 2002: successful routing completed.
✔January 2003: Timing analysis reasonably satisfactory✔January 2003: Simulations with back annotation OK ✔January 2003: Analysis of critical path (dangerous and not)✔February 2003: Hammering down remaining timing problems✔February 2003: Careful analysis of all risky corners✔February 2003: Transfer of simulation data to Atmel✔End of March Final sign off (Laura …. is working on it…..)
San Diego, March 27th 2003
Roberto De Pietri -- chep03 26
Timing J&T ready June 03
We will receive between 300 to 600 chips We need 256 processor to assemble a crate !!
We expect them to work !! The same team designed 7 ASICs of similar complexity Impressive full-detailed simulations of multiple J&T systems More one simulate less one has to test !!
Everything else ready and tested Within days/weeks the first working apeNEXT computer will
operate
September ’03 mass production will star (hopefully) at Neuricam INFN already founded 8 TFlops of computing power !!
San Diego, March 27th 2003
Roberto De Pietri -- chep03 27
Mechanics DC/DC
J&T Module
apeNEXT PB
J&T Module
Board-to-Board Connector
AIR-FLOWCHANNEL
2
TOP VIEW ( local )
AIR-FLOW CHANNEL
1
AIR-FLOWCHANNEL
3
AIR-FLOWCHANNEL
3
Fra
me
a1
b1
b3
a3
b2
a2
PB constraints:
• Power consumption: up to 340W
• PB-BP insertion force: 80-150 Kg (!)
• Fully populated PB weight: 4-5 Kg
Custom design of card frame and insertion tool
Detailed study of airflow
San Diego, March 27th 2003
Roberto De Pietri -- chep03 28
• T, V, I monitored;• Interfaced to I2C control network
PB Prototype
San Diego, March 27th 2003
Roberto De Pietri -- chep03 29
PB (preliminary)Test• Next Test-Bed: metal frame with power supply
• I2C Test i.e. test of “slow-control” I/O intf.
• minimal set of components assembled•simple/short test (1 week) •done succesfully (Dec 01)
• Clock distribution test• PB LVDS characterization
San Diego, March 27th 2003
Roberto De Pietri -- chep03 30
PB Status
Activity Status Who Cost Note
PB development
(inc. feasibility study and LVDS EVB)
Done Neuricam 67 KEuro
PB ver.1 prototypes (3) Done Neuricam
DDI 10 KEuro
J&T Module develop. Done Neuricam 23 KEuro
PB ver.2 prototypes (3) Done Neuricam
SOMACIS 10 KEuro
San Diego, March 27th 2003
Roberto De Pietri -- chep03 31
connector kit cost:7KEuro (!)PB Insertion force:80-150 Kg(!)
NEXT BackPlane • 16 PB Slots + Root Slot
• Size 447x600 mm2•4600 LVDS differential signals,
point-to-point up to 600 Mb/s
• 16 controlled-imp. layers (32 Tot)• Press-fit only
• Erni/Tyco connectors
•ERMET-ZD• Providers:
APW (primary)
ERNI (2nd source)
Activity Status Who Cost Note
BP development Done APW(ERNI) 32 KEuro
BP prototypes (3)
Done APW 41 KEuro
San Diego, March 27th 2003
Roberto De Pietri -- chep03 32
Host I/O Interface
• PCI Interface 64bit, 66Mhz
• PCI Master Mode for 7th Link Intf
• PCI Target Mode for I2C Intf
Activity Status Who Cost Note
Altera design Done INFN
PCB design and prototypes
Done NEURICAM 3KE
I2C (x4)
7th Link Port
PCI (64bit,66Mhz)
PCI form factor
Fifo
Altera APEXIIPCI Interface PLDA
7Link Ctrl
I2C Ctrl
PCIMaster
Ctrl
PCITargetCtrl
7Link Ctrl
QDR Mem Ctrl
Fifo
Fifo
QDRMem Bank
• QuadDataRate Memory (x32)
• Altera APEX II based
• 7th Link: 1(2) bidir chan. (200*9 M/s)
• I2C: 4 indipendent ports
San Diego, March 27th 2003
Roberto De Pietri -- chep03 33
• Problem:•PB weight: 4-5 Kg, PB consumption: 340W (est.) •32 PB + 2 Root Board (2 independent subcrates)• Power supply: (<48Vx150A per subcrate)• Integrated Host PCs• Forced air cooling• Robust, expandable/modular, CE, EMC ....
• Solution:•42U rack (h: 2,10 m):
• EMC proof,• efficient cables routing
• 19”-1U slots per 9 “host PCs” (rack mounted)
• Hot-swap power supply cabinet (modular)
• Custom design of “card cage” and “tie bar”• Custom design of cooling system
Activity Status Who Cost Note
Design of rack (inc. selection of power
supply)
Done (Apr ’02) APW(NEURICAM)
50 KEuro
Full rack prototype Done (Sept ’02) APW 8-10 KEuro
Cabinets
San Diego, March 27th 2003
Roberto De Pietri -- chep03 34
San Diego, March 27th 2003
Roberto De Pietri -- chep03 35
Software
TAO compilers and linker ….. READY All existing APE program will run with no change Physical code already been run on the simulator
Kernel of PHYSICS codes used to benchmark the efficiencies of the FP unit
C COMPILER gcc (2.93) and lcc have be retargeted lcc WORKS (almost). Factor 5 on performance
http://www.cs.princeton.edu/software/lcc/
San Diego, March 27th 2003
Roberto De Pietri -- chep03 36
Project Costs
Total development cost of 1700 k€uro
1050 k€uro for VLSI development 550 k€uro non VLSI
Manpower involved = 20 man/year Mass production cost ~0.5 €uro/MFlops
San Diego, March 27th 2003
Roberto De Pietri -- chep03 37
Conclusions J&T ready June 03 (300….600 chips)
Everything else ready and tested !!!
If tests ok mass production starting September ‘03 at Neuricam
All components over-dimensioned Cooling, LVDS tested @ 400 Mb/s, power supply on
boards …
Makes possible a technology step with no extra design and test effort
San Diego, March 27th 2003
Roberto De Pietri -- chep03 38
Conclusions (2) Installation plans
INFN 8 TFlops (10 cabinets)already approved (on delivering of a working machine)
DESY Considering between 8 TFlops to 16 TFlops Paris ……….
Inversion of Dirac Operator (APEmill program) 54 % efficiency on the VHDL hardware simulator
Communications, memory refresh, synchronization wait …….. all included …
San Diego, March 27th 2003
Roberto De Pietri -- chep03 39
apeNEXT vs. cluster
72.5 GFlops409.6 GFlops
819.2 GFlops
1.6*16*16 *2 GFlops
San Diego, March 27th 2003
Roberto De Pietri -- chep03 40
ASICs of similar complexity
ADD322 3 input integer Adder. Prototype for APE100 integrated into ZCPU
MAD APE100 Floating point engine
ZCPU APE100Sequencer + Integer ALU + AGU
Commuter APE100 Communication device
T1000 APEmille Integer ALU+AGU+Program controller
J1000 APEmille Floating point engine
COMM1000 APEmille Communication device