+ All Categories
Home > Technology > GPUs for GEC Competition @ GECCO-2013

GPUs for GEC Competition @ GECCO-2013

Date post: 13-May-2015
Category:
Upload: daniele-loiacono
View: 217 times
Download: 0 times
Share this document with a friend
Description:
Results of the GPUs for GEC Competition held at GECCO 2013. Organizers Daniele Loiacono, Politecnico di Milano Antonino Tumeo, Pacific Northwest National Laboratory Webpage http://gpu.geccocompetitions.com
Popular Tags:
33
GECCO 2013 GPUs for GEC GECCO 2013 GPUs for Genetic and Evolutionary Computation Competition Daniele Loiacono and Antonino Tumeo
Transcript
Page 1: GPUs for GEC Competition @ GECCO-2013

GECCO 2013 GPUs for GEC

GECCO 2013 GPUs for Genetic and Evolutionary Computation Competition Daniele Loiacono and Antonino Tumeo

Page 2: GPUs for GEC Competition @ GECCO-2013

GECCO 2013 GPUs for GEC

Why GPUs?

!  The GPU has evolved into a very flexible and powerful processor: " It’s programmable using high-level languages " Now supports 32-bit and 64-bit floating point IEEE-754 precision " It offers lots of GFLOPS

!  GPU in every PC and workstation

Page 3: GPUs for GEC Competition @ GECCO-2013

GECCO 2013 GPUs for GEC

!  Goal " Attract the applications of genetic and evolutionary

computation that can maximally exploit the parallelism provided by low-cost consumer graphical cards.

!  Evaluation " 50% – Quality and Performance " 30% - Relevance for EC community " 20% – Novelty

!  Panel

… and myself

This competition…

Simon Harding El-Ghazali Talbi Antonino Tumeo Jaume Bacardit

Page 4: GPUs for GEC Competition @ GECCO-2013

GECCO 2013 GPUs for GEC

Entries

“Fast QAP Solver with ACO and Taboo Search on Multiple GPUs with the Move-Cost Adjusted Thread Assignment”. Shigeyoshi Tsutsui and Noriyuki Fujimoto “GPOCL: A Massively Parallel GP Implementation in OpenCL” Douglas A. Augusto Helio J.C. Barbosa

Page 5: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

Fast QAP Solver with ACO and Taboo Search on Multiple GPUs with the Move-

Cost Adjusted Thread Assignment�

�������������������������������������� �

����������������������������������������������

Page 6: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

Quadratic Assignment Problem (QAP) •  One of the hardest combinatorial optimization

problem –  There are many real-world applications:

•  Optimum location allocation of factories in a multinational company •  Optimum section allocation in a big building •  …

•  Definition: –  Given n locations and n facilities, the task is to assign the

facilities to the locations to minimize the cost

•  aij is the distance matrix for each pair of locations i and j •  bij is the flow matrix for each pair of facilities i and j

∑∑= =

=n

iji

n

jijbaf

1)()(

1)( φφφ

Page 7: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

Initialize Pheromone density τ"

Update pheromone density τ"

Construct solutions based on τ"

Apply local search (Tabu search)

start

end

ACO+TS on a Single GPU

Pheromone Density Matrix τ"

Initialize Pheromone density τ"

Construct solutions based on τ"

Apply local search (Tabu search)

Update pheromone density τ"

Terminate? Terminate?

InstancesConstructionof solusions

TSUpdating

Trailtai40a 0.007% 99.992% 0.001%tai50a 0.005% 99.994% 0.000%tai60a 0.004% 99.996% 0.000%tai80a 0.002% 99.997% 0.000%

tai100a 0.002% 99.998% 0.000%tai50b 0.022% 99.976% 0.002%tai60b 0.017% 99.982% 0.001%tai80b 0.011% 99.988% 0.001%

tai100b 0.008% 99.991% 0.000%tai150b 0.005% 99.995% 0.000%

Time distribution in sequential run on CPU

•  We combined ACO and Taboo Search (TS)�

Page 8: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

•  A neighbor φ ’ of φ in QAP

•  Neighborhood size of N(φ) is Nsize=n(n-1)/2 •  To choose the best φ’, we need to calculate

costs for all of Nsize neighbors

2 1 0 3 φ ��

Neighborhood in the QAP

0 1 2 3 φ’ ��

swap�

Page 9: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

Computation Cost of a Neighboring Solution

•  Fast update [Taillard 04]: –  if we have memory of Δ(φ, r, s) for all pairs r, s, –  and {u, v} ∩ {r, s}=� satisfies, Δ(φ’, u, v) can be obtained as:

∑−

≠=$$%

&''(

)

−+−

+−+−

+−+−

+−+−=Δ

1

,,0 )()()()()()()()(

)()()()()()()()(

)()()()()()()()(

)()()()()()()()(

)()(

)()(

)()(

)()(),,(

n

srkk kskrskkrksrk

skrkksrkskkr

ssrrssrssrsr

srrsrsrrssrr

bbabbabbabba

bbabbabbabbasr

φφφφφφφφ

φφφφφφφφ

φφφφφφφφ

φφφφφφφφφ

)(nO

))((

))(( ),,(),,'(

)(')(')(')(')(')(')(')('

)(')(')(')(')(')(')(')('

rurvsvsuusvsvrur

urvrvsussusvrvru

bbbbaaaabbbbaaaa

vuvu

φφφφφφφφ

φφφφφφφφ

φφ

−+−−+−

+−+−−+−

+Δ=Δ

)1(O

•  Let φ’ be a neighbor of φ obtained by exchanging r-th and s-th elements of φ, then move cost Δ(φ, r, s)=f(φ’) - f(φ) can be obtained as

Page 10: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

Parallel computation of move cost -The simplest threads assignment-

threadIdx.x=0 threadIdx.x=1 threadIdx.x=2

.

.

.

.

. threadIdx.x=Nsize-1

blockIdx.x=0

Assign m agents to blocks

Ass

ign

mov

e ca

lcul

atio

ns to

thre

ads

blockIdx.x=1 blockIdx.x= m-1

Nsize=n(n-1)/2�

threadIdx.x=0 threadIdx.x=1 threadIdx.x=2

.

.

.

.

. threadIdx.x=Nsize-1

threadIdx.x=0 threadIdx.x=1 threadIdx.x=2

.

.

.

.

. threadIdx.x=Nsize-1

0 1 2 3 4 5 6 7 8 9 10 11 12 1301 02 1 23 3 4 54 6 7 8 95 10 11 12 13 146 15 16 17 18 19 207 21 22 23 24 25 26 278 28 29 30 31 32 33 34 359 36 37 38 39 40 41 42 43 4410 45 46 47 48 49 50 51 52 53 5411 55 56 57 58 59 60 61 62 63 64 6512 66 67 68 69 70 71 72 73 74 75 76 7713 78 79 80 81 82 83 84 85 86 87 88 89 90

u

v

Page 11: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

Move-Cost Adjusted Thread Assignment (MATA) Computational time

warp 0

warp 1

0 1 2 3 15 16 30 31

32 33

32

32

thread index

Computational time

No branch divergence in each warp !�

0123456

28293031

warp 0

thread index

32333435363738

60616263

warp 1

O(1) O(n)

Delay Caused by Branch Divergence�

Page 12: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

Speedup on a Single GTX480

tai50a

tai60a

tai80a

tai100a

tai50b

tai60b

tai80b

tai100b

Average

0

5

10

15

20

25

30

35

40

Speedup

3.7

26.1

3.4

27.7

4.3

20.3

3.4

18.3

3.9

24.9

4.6

35.5

4.2

21.4

5.4

29.5

4.1

25.5

����� ��� � ���

CPU: i7 965, 3.2GHz

QAP Instances

Page 13: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

Implementation on Multiple GPUs

CPU

ACO0

ACO2

ACO1ACO3

CPU

work memory

: solutions

GPU0

GPU1

GPU2

GPU3

ACO3

ACO0

ACO1

Page 14: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

4 Types of Island Models

•  We implemented following 4 types of island models

1.  IM-INDP: Island model with independent runs

2.  IM-ELIT: Island model with elitist 3.  IM-RING: Island model with ring connected 4.  IM-ELMR: Island model with elitist and

massive ring connected

Page 15: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

IM-INDP: Island model with independent runs�

CPU

��

ACO0

ACO1

ACO3

ACO2

Page 16: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

IM-ELIT: Island model with elitist�

��

worst guy

best guy

global best guy

ACO0

ACO1

ACO3

ACO2

Page 17: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

IM-RING: Island model with ring connected�

��

worst guy

best guy

ACO1

ACO2

ACO3

ACO0

Page 18: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

IM-ELMR: Island model with elitist and massive ring connected�

CPU

IM-ELIT +�

ACO1

ACO2

ACO3

ACO0

Page 19: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

Results of Island Models with 4 GPUs�

tai50a

tai60a

tai80a

tai100a

tai50b

tai60b

tai80b

tai100b

Average

0

1

2

3

4

5

6

7

Speedup

2.12.62.93.3

1.92.22.3

2.5 2.42.52.72.9

1.72.12.2

2.5

1.5

2.32.53

1.21.41.9

2.6

1.5

4.74.3

6.5

1.4

2.32.3

3.2

1.7

2.52.6

3.3

���� ��� ��� ���

Page 20: GPUs for GEC Competition @ GECCO-2013

GECCO Competitions: GPUs%for%Gene+c%and%Evolu+onary%Computa+on,%Evolu+onary%

Conclusion

•  On a single GPU with “MATA” – 25.5 times speedup to CPU (i7 965, 3.2GHz)

•  On 4-GPU (GTX480) –  IM_ELMR model has 3.3 times speedup to

single GPU •  As a result, 25.5×3.3 = 84.2 times speedup

compared with the CPU computation�

Page 21: GPUs for GEC Competition @ GECCO-2013

GPOCL:A Massively Parallel GP

Implementation in OpenCL

Douglas A. Augusto Helio J.C. [email protected] [email protected]

Laboratorio Nacional de Computacao Cientıfica (LNCC)

Rio de Janeiro, Brazil

Page 22: GPUs for GEC Competition @ GECCO-2013

GPOCL’s Features

2 / 12

n Fast and e�cient C/C++ implementation based on a compactlinear tree representation.

n Massively parallel tree interpretation using OpenCL.

n It can be executed on virtually any parallel device, comprising dif-ferent architectures and vendors.

n It implements three di↵erent parallel strategies (fitness-based,population-based, and a mixture of both).

n To improve diversity it can evolve loosely-coupled subpopulations(neighborhoods).

n It has a rich set of command-line options, including primitives’ setdefinition, probabilities of the genetic operators, stopping crite-ria, minimum and maximum tree sizes, and the configuration ofneighborhoods.

n It is Free Software (http://gpocl.sf.net).

Page 23: GPUs for GEC Competition @ GECCO-2013

Open Computing Language (OpenCL)

3 / 12

n Open Computing Language, or simply OpenCL, is an open-standard programming language for heterogeneous parallel com-puting.1

n It aims at e�ciently exploiting the computing power of all process-ing devices, such as traditional processors (CPU) and accelerators(GPU, FPGA, DSP, Intel’s MIC, and so forth).

n It provides a uniform programming interface, which saves the pro-grammer from writing di↵erent codes in di↵erent languages whentargeting multiple compute architectures, thus providing portabil-ity.

n It is very flexible (low-level language).

1http://www.khronos.org

Page 24: GPUs for GEC Competition @ GECCO-2013

GPOCL

4 / 12

GPOCL implements a GP system using a prefix linear tree represen-tation. Its main routine performs the following high-level procedures:

1. OpenCL initialization: This is the step where the generalOpenCL-related tasks are initialized.

2. Calculating n-dimensional ranges: Defines how much paral-lel work there will be and how they are distributed among thecompute units.

3. Memory bu↵ers creation: In this phase all global memory re-gions accessed by the OpenCL kernels are allocated on the deviceand possibly initialized. The fitness cases are transferred andenough space is reserved for the population and error vectors.

4. Kernel building: An OpenCL kernel, relative to a given strategyof parallelization, is compiled just-in-time, targeting the computedevice.

5. Evolving: This iterative routine implements the actual geneticprogramming dynamics.

Page 25: GPUs for GEC Competition @ GECCO-2013

Main Evolutionary Algorithm

5 / 12

Create (randomly) the initial population P ;

22 Evaluate(P);

for generation 1 to NG doCopy the best (elitism) programs of P to the temporary population Ptmp;

while |Ptmp| < |P | doSelect and copy from P two fit programs, p1 and p2;

if [probabilistically] crossover thenRecombine p1 and p2, generating p

01 and p

02;

p1 p

01; p2 p

02;

endif [probabilistically] mutation then

Apply mutation in p1 and p2, creating p

01 and p

02;

p1 p

01; p2 p

02;

endInsert p1 and p2 into Ptmp;

endP Ptmp; then reset Ptmp;

1818 Evaluate(P);

endreturn the best program found;

Page 26: GPUs for GEC Competition @ GECCO-2013

Evaluate(P)

6 / 12

The evaluation step itself does not do much—the hard work is donemostly by the OpenCL kernels. Basically, three things happen withinEvaluate(P):

1. Population transfer: All programs of P are transferred to thetarget compute device.

2. Kernel execution: For any non-trivial problem, this is the mostdemanding phase. Here, the entire recently transferred popula-tion is evaluated—by interpreting each program over each fitnesscase—on the compute device. Fortunately, this step can be doneboth in parallel as well accelerated by GPUs.

3. Error retrieval: After being computed and accumulated in theprevious step, the population’s prediction errors need to be trans-ferred to the host so that this information is available to theevolutionary process.

Page 27: GPUs for GEC Competition @ GECCO-2013

Overall Best Parallelization Strategy

7 / 12

n The population of programs and fitness cases are parallelized.n A mixture of the fitness- and population-based strategies.n While di↵erent programs are evaluated simultaneously on di↵erent

compute units (CU), the processing elements (PE) within each CUtake care, in parallel, of the whole training data set.

n Since internally to each CU the PEs will be interpreting the sameprogram, the event of instruction divergence is unlikely.

Page 28: GPUs for GEC Competition @ GECCO-2013

Some benchmarks on a NVIDIA

GTX-285 GPU

An old generation GPU (released in early 2009)

8 / 12

Page 29: GPUs for GEC Competition @ GECCO-2013

Fitness-based Parallelization Strategy

9 / 12

100

1000

5000

10000

25000

50000

1000

5000

10000

25000

50000

1.0002.0003.0004.0005.0006.0007.0008.0009.000

10.000

Bill

ion G

Pop/s

Population size Data set size

Bill

ion G

Pop/s

9.540 Billion GPop/s(good performance, but requires a lot of fitness cases)

Page 30: GPUs for GEC Competition @ GECCO-2013

Population-based Parallelization Strategy

10 / 12

100

1000

5000

10000

25000

50000

1000

5000

10000

25000

50000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

Bill

ion G

Pop/s

Population size Data set size

Bill

ion G

Pop/s

0.690 Billion GPop/s(bad performance, causes a lot of instruction divergence)

Page 31: GPUs for GEC Competition @ GECCO-2013

Combined Fitness- and Population-basedParallelization Strategy

11 / 12

100

1000

5000

10000

25000

50000

1000

5000

10000

25000

50000

7.000

8.000

9.000

10.000

11.000

12.000

Bill

ion G

Pop/s

Population size Data set size

Bill

ion G

Pop/s

11.85 Billion GPop/s

Page 32: GPUs for GEC Competition @ GECCO-2013

12 / 12

Thank you!

Page 33: GPUs for GEC Competition @ GECCO-2013

GECCO 2013 GPUs for GEC

Shigeyoshi Tsutsui, Hannan University and

Noriyuki Fujimoto, Osaka Prefecture University

Fast QAP Solver with ACO and Taboo Search on Multiple GPUs with the Move-Cost Adjusted Thread Assignment

And the winner is....


Recommended