Correlator Options for 128T
MWA Cambridge MeetingRoger Cappallo
MIT Haystack Observatory2011.6.6
Current StatusCorrelator Hardware Inventory
10 each of v.2 Correlator Boards, PFB Boards, CB/RTM’s, PFB/RTM’s
2 full-size card cages + 1 small, with power supplies
e2e simulation software file input packets module file output packets
PFB FPGA Firmware for 32T very limited de-skew capability no inter-board transfer (via mesh backplane) corner-turns specific to 32T case PFB to 10 KHz channels needs no changes
Current Status (cont’d)CB FPGA Firmware
32T: operational code uses every other 50 ms interval, though 100% duty cycle code is available
512T: error-free CMAC (only) code for 115 cells working at 180 MHz
128T Correlator Requirements30.72 MHz BW in 24 coarse channels of 1.28 MHz256 inputs16 Rx’s with 48 fibres82.6 Gb/s aggregate bit rate~32K correlation productsF stage: ~150 GCMAC/s (12 tap FIR, 40 KHz channels)X stage: 1.01 TCMAC/s
KEY (compared to 32T) same x4 x16
Top Level Choiceshardware: use current hardware, developing FPGA
firmware as necessarysoftware: get RX signals into standardized format
(10 gigE) ASAP, do PFB and correlation in GPU-equipped server
hybrid: use existing PFB’s for F stage and to form 10 gigE packets to be correlated in software
Hardware SolutionUsing existing 32T firmware it should take 4 PFB
boards and 16 CB’s, but architecture doesn’t scale in a fully-parallel sense due to cross-correlations, and it would really take 6 PFB’s and 18 CB’s, with firmware mods
unchanged 32T firmware leads to a system with 20 PFB’s and 20 CB’s!
using tested CMAC design (115 cells @ 180 MHz) yields enough computation in ~6.5 CB’s, optimal partition appears to be 8 PFB’s and 8 CB’s
18 CB System• split system into thirds, each getting 8 coarse chans
• each PFB gets 8 input fibres (need to do deskew)
• routing logic on CB’s changes, CMAC’s same
18 CB Hardware AssessmentPRO
relatively minor FPGA design work on PFB
modest amount of change to FPGA code on CB’s
system interfaces all tested and working
use is made of all purpose-built boards
CONanother build of ~10
CB’s (and CB/RTM’s) necessary (~120 K$)
8 CB System• Each PFB gets 6 input fibres total, from 2 Rx’s
• Each PFB outputs to 8 different CB’s
• CB uses CMAC design from 512T at only 80% of achieved speed
• CB needs some cleverness in allocating cells to CMAC chips
• LTA could be skipped due to low output rate (10 Hz dump rate)
8 CB Hardware AssessmentPRO
no additional cost for hardware
relatively minor FPGA design work on PFB
system interfaces all tested and working
use is made of all purpose-built boards
CONsignificant amount of
modified FPGA code on CB
Software Solution Put Rx coarse channel data into 10 gigE packets,
by (e.g.)modifying AgFo designOTS programmable modules (a la 2PIP)
F stage in host servers or GPU’sDo X stage in multiple GPU’s
GPU Correlation
Wayth et al. (2009) correlated 1 coarse channel for 32 T in realtime, using a single Nvidia C1060 GPU
How can we gain a factor of 24 x 16 = 384 in performance? 4x duty cycle – Wayth’s code did 1 s of processing in 0.19 s 2x memory BW reduction – by using a channel width of 40 KHz
a larger block can be fit into shared memory 2x – by using a smaller word size (4 Re + 4 Im bits) Tesla C2050 has triple the shared memory of C1060 integer arithmetic uses less shared memory space multiple GPU units in parallel
GPU Bottlenecks
NIC input rate max of 7 or 8 Gb/s to Host
Host Device BW (set by PCIe bus) PCI gen 2 x16 spec max
of 8 GB/s Global memory processor
BW spec max for C2050 is
144 GB/s Multiply & accumulate
rate spec max for C2050 is
1.01 Tflops (single prec or 32 bit int)
Software AssessmentPRO
greatest flexibility, as all code is in software
switched topology allows good match between # of servers and load
easily expandable
CON format conversion to 10
gigE will require some mixture of hardware acquisition and FPGA coding
acquisition cost of GPU-equipped servers
Hybrid System
•modified PFB output stage in INF chip forms 10 gigE packets• 4 lanes through CX-4 connector to unidirectional optical transceiver• GPU-equipped servers only do 4+4 bit cross mult & sum• 8 PFB’s used • 6 inputs each• 1 stream of 8
Gb/s per PFB output
• more real-estate
Hybrid AssessmentPRO
little additional cost to convert data to 10 gigE
minimal FPGA design workrelieves GPU of filtering
burdenswitched topology allows
good match between # of servers and load
easily expandable
CONsome risk in
unidirectional 10 gigE transceiver mods
acquisition cost of GPU-equipped servers
Level of Effort - none/modest/significant