Download - My Presentation SAO3tauceti.caltech.edu/casper-workshop-2017/slides/7_roshanineshat.pdf · Title: Microsoft PowerPoint - My_Presentation_SAO3.pptx Author: arash Created Date: 8/14/2017

Increasing the speed of wideband VLBI correlation using GPUs

Arash Roshanineshat

Summer 20171

Overview

• Background• Correlation• DiFX• Hardware• DiFX with GPU• Benchmarks• Conclusion• Future Work• Acknowledgements

2

Background

3

The Event Horizon Telescope (EHT) is an international collaboration aiming to capture the first image of a black hole by creating a virtual Earth-sized telescope.

EHT Telescope Array:

Correlation

4

The result appears to have come from a single antenna whose surface is made of the actual individual antennas.

Source

The goal is to make high-resolution maps of radio sources.

Correlation

5

Correlation Method:• XF

1. Cross Multiplication2. Frequency Transformation

Filter Bank

Filter Bank

X X XR(x) R(x) R(x)

𝑉 (𝑡)

𝑉 (𝑡)

• FX 1. Frequency Transformation2. Cross Multiplications

FX

Correlation

6

Correlation Platforms:

• Application-Specific Integrated Circuits (ASICs)• Field Programmable Gate Arrays (FPGAs)

In 2007, Adam Deller introduced Distributed FX (DiFX) software package

DiFX

7

• The core is programmed in C and C++• Suitable for generic multi-processor systems• Supports modern hard-drive recording systems naturally• Easy to configure

Source data

DataStream 1

DataStream 2

DataStream N

Core 1

Core 2

DataStream M

FX Manager

DiFX

8

EHT data is recorded at 4096 mega-samples per second

DiFX correlator does not process it in real-time

Can we speed up the process?

DiFX

9

2010, Andrew Woods,“Accelerating Software Radio Astronomy FX Correlation with GPU and

FPGA Co-processors”

He simplified the DiFX code and used only the core.

Concluded that co-processors like GPU and FPGA will speed-up the process.

Focused on X-Engine of DiFX

To research the effect of GPU co-processors on full DiFX package,we setup a cluster from scratch.

Hardware

12

3 4

10

• 4 Machines without GPU• 20 CPU each• 3.4 GHz max frequency

Hardware

• 1 GPU Machine• 16 CPUs• 3.0 GHz max frequency• 4 x 1080 Ti GPUs 11

1 2 3 4

Hardware

Loading Data

Star shape network12

40 GbE

LAN

DependenciesDependencies of DiFX:

• MPI Libraries (OpenMPI)This library will provide Map/Reduce functions to distribute data in the cluster

• Haystack Observatory Postprocessing System (HOPS)This library will provide “fourfit” process to plot the output data

• Intel® Integrated Performance Primitives (Intel® IPP)Very optimized vector library for Intel CPUs

Additional Dependecy:• CUDA Driver

This will give the opportunity of using Nvidia GPUs

Software

.v2d .vex

.calc

$ mpifxcorr

.input$ vex2difx

$ calcif2

Output FilesMark4 Datafiles

$ difx2mark4

Plots

$ fourfit

Vdif Files

Difx Operation Block Diagram:Config files Core

14

Software

Output of Difx on CPU:

15

DiFX with GPU

Vector Operations and FFT Libraries are defined in the file:

architecture.h.in

Following architectures are introduced:

• Intel • GENERIC

16

DiFX with GPU- INTEL mode uses:

• Intel Integrated Performance Primitives (IPP) For F-Engine • Intel Integrated Performance Primitives (IPP) for X-Engine

- GENERIC architecture uses:• FFTW for F-Engine• C++ standards for X-Engine

FFTW cuFFTw

CUDA has made porting the code much more easier than before with having little modification of code.

17

DiFX with GPU

X Engine:

vectorMul(src, dst, length){

for i = 0 to length:dst[i] = src[i] * src[i]

}

Embarrassingly ParallelProblem

F Engine:

FFT function all use cuFFTw library

18

DiFX with GPU

19

• Very small difference. floating point operations are not guaranteed to be identical.

Benchmarks

1 2 3 4Only CPU machines 396 198 130 90

0

50

100

150

200

250

300

350

400

450

Tim

e (s

econ

d)

Benchmark of DiFX

20

CPU Machine

Benchmarks

21

Change one of the CPUmachines with the GPU

machine

GPU is Off!

Benchmarks

1 2 3 4Only CPU machines 396 198 130 90GPU Disabled 290 138 94 71

0

50

100

150

200

250

300

350

400

450

Tim

e (s

econ

d)

Benchmark of DiFX

22

CPU Machine

GPU MachineGPU Off

Benchmarks

23

Turn On the GPU!

Benchmarks

1 2 3 4Only CPU machines 396 198 130 90GPU Disabled 290 138 94 71GPU Enabled 258 119 80 63

0

50

100

150

200

250

300

350

400

450

Tim

e (s

econ

d)

Benchmark of DiFX

24

CPU Machine

GPU MachineGPU Off

GPU MachineGPU On

Conclusion

25

• GPU will make the process faster, about 20% to 25%• From the Financially perspective, a GPU machine is cheaper but

works better

Future Work

26

• Optimize the GPU process • Study other alternative libraries to support more co-

processors, like Tensorflow and Thrust

Acknowledgments

27

Jonathan Weintroub

Shep Doeleman

Andre Young

Lindy Blackburn

Rurik Primiani

Mark Peryer

Geoff Crew

DiIFX Community

Questions?

28