Download - Correlator Options for 128T

Correlator Options for 128T

MWA Cambridge MeetingRoger Cappallo

MIT Haystack Observatory2011.6.6

Current StatusCorrelator Hardware Inventory

10 each of v.2 Correlator Boards, PFB Boards, CB/RTM’s, PFB/RTM’s

2 full-size card cages + 1 small, with power supplies

e2e simulation software file input packets module file output packets

PFB FPGA Firmware for 32T very limited de-skew capability no inter-board transfer (via mesh backplane) corner-turns specific to 32T case PFB to 10 KHz channels needs no changes

Current Status (cont’d)CB FPGA Firmware

32T: operational code uses every other 50 ms interval, though 100% duty cycle code is available

512T: error-free CMAC (only) code for 115 cells working at 180 MHz

128T Correlator Requirements30.72 MHz BW in 24 coarse channels of 1.28 MHz256 inputs16 Rx’s with 48 fibres82.6 Gb/s aggregate bit rate~32K correlation productsF stage: ~150 GCMAC/s (12 tap FIR, 40 KHz channels)X stage: 1.01 TCMAC/s

KEY (compared to 32T) same x4 x16

Top Level Choiceshardware: use current hardware, developing FPGA

firmware as necessarysoftware: get RX signals into standardized format

(10 gigE) ASAP, do PFB and correlation in GPU-equipped server

hybrid: use existing PFB’s for F stage and to form 10 gigE packets to be correlated in software

Hardware SolutionUsing existing 32T firmware it should take 4 PFB

boards and 16 CB’s, but architecture doesn’t scale in a fully-parallel sense due to cross-correlations, and it would really take 6 PFB’s and 18 CB’s, with firmware mods

unchanged 32T firmware leads to a system with 20 PFB’s and 20 CB’s!

using tested CMAC design (115 cells @ 180 MHz) yields enough computation in ~6.5 CB’s, optimal partition appears to be 8 PFB’s and 8 CB’s

18 CB System• split system into thirds, each getting 8 coarse chans

• each PFB gets 8 input fibres (need to do deskew)

• routing logic on CB’s changes, CMAC’s same

18 CB Hardware AssessmentPRO

relatively minor FPGA design work on PFB

modest amount of change to FPGA code on CB’s

system interfaces all tested and working

use is made of all purpose-built boards

CONanother build of ~10

CB’s (and CB/RTM’s) necessary (~120 K$)

8 CB System• Each PFB gets 6 input fibres total, from 2 Rx’s

• Each PFB outputs to 8 different CB’s

• CB uses CMAC design from 512T at only 80% of achieved speed

• CB needs some cleverness in allocating cells to CMAC chips

• LTA could be skipped due to low output rate (10 Hz dump rate)

8 CB Hardware AssessmentPRO

no additional cost for hardware

relatively minor FPGA design work on PFB

system interfaces all tested and working

use is made of all purpose-built boards

CONsignificant amount of

modified FPGA code on CB

Software Solution Put Rx coarse channel data into 10 gigE packets,

by (e.g.)modifying AgFo designOTS programmable modules (a la 2PIP)

F stage in host servers or GPU’sDo X stage in multiple GPU’s

GPU Correlation

Wayth et al. (2009) correlated 1 coarse channel for 32 T in realtime, using a single Nvidia C1060 GPU

How can we gain a factor of 24 x 16 = 384 in performance? 4x duty cycle – Wayth’s code did 1 s of processing in 0.19 s 2x memory BW reduction – by using a channel width of 40 KHz

a larger block can be fit into shared memory 2x – by using a smaller word size (4 Re + 4 Im bits) Tesla C2050 has triple the shared memory of C1060 integer arithmetic uses less shared memory space multiple GPU units in parallel

GPU Bottlenecks

NIC input rate max of 7 or 8 Gb/s to Host

Host Device BW (set by PCIe bus) PCI gen 2 x16 spec max

of 8 GB/s Global memory processor

BW spec max for C2050 is

144 GB/s Multiply & accumulate

rate spec max for C2050 is

1.01 Tflops (single prec or 32 bit int)

Software AssessmentPRO

greatest flexibility, as all code is in software

switched topology allows good match between # of servers and load

easily expandable

CON format conversion to 10

gigE will require some mixture of hardware acquisition and FPGA coding

acquisition cost of GPU-equipped servers

Hybrid System

•modified PFB output stage in INF chip forms 10 gigE packets• 4 lanes through CX-4 connector to unidirectional optical transceiver• GPU-equipped servers only do 4+4 bit cross mult & sum• 8 PFB’s used • 6 inputs each• 1 stream of 8

Gb/s per PFB output

• more real-estate

Hybrid AssessmentPRO

little additional cost to convert data to 10 gigE

minimal FPGA design workrelieves GPU of filtering

burdenswitched topology allows

good match between # of servers and load

easily expandable

CONsome risk in

unidirectional 10 gigE transceiver mods

acquisition cost of GPU-equipped servers

Level of Effort - none/modest/significant