+ All Categories
Home > Documents > IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman...

IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman...

Date post: 01-Apr-2015
Category:
Upload: reed-baggs
View: 230 times
Download: 0 times
Share this document with a friend
62
IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman [email protected] Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent, Belgium, January 2007
Transcript
Page 1: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

1

GCC Tutorial – The compilation flow of the auto-vectorizer

Dorit Nuzman

[email protected]

Haifa IBM Labs

2nd HiPEAC GCC Tutorial

Ghent, Belgium, January 2007

Page 2: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

2

a b c d e f g h i j k l m n o p

OP(a)

OP(b)

OP(c)

OP(d)

Data in Memory:

VOP( a, b, c, d ) VR1

a b c dVR1

VR2

VR3

VR4

VR5

0 1 2 3

What is vectorization

Vector Registers

Vector operation

Data elements packed into vectors Vector length Vectorization Factor (VF)

VF = 4 original serial loop:

for(i=0; i<N; i++){ a[i] = a[i] + b[i];}

loop in vector notation:for (i=0; i<N; i+=VF) { a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1];}

vectorization

Page 3: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

3

…mips port

…Ada front-end

middle-end

GIMPLE trees

back-end

RTL

GCC Passes

machine description

Fortran front-endC front-end

C++ front-end

parse trees

rs6000 porti386 port

assembly

loop analyses and optimizations

data-dependence

scalar-evolution

number of iters

invariant motion

iv-canon/optimize

linear transform

unswitching

if-conversion

unrolling

vectorization

- loop form ok?

- any data-deps?

- scalar-cycles?

- aliasing?

- access-patterns?

original serial loop:for(i=0; i<N; i++){ a[i] = a[i] + b[i];}

loop in vector notation:for (i=0; i<N; i+=VF) { a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1];}

- vector size?

- supportable?

- alignment?

- data shuffle?

- cost?

Why study the vectorizer?

- middle-end & back-end aspects

- performance impact potential

- there’s a lot to do…

Page 4: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

4

Talk Layout What is vectorization

Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port

Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation

Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases

Using the vectorizer Programming and tuning hints

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees

rs6000 porti386 port

assembly

RTL

back-end

vectorization

Page 5: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

5

A GCC “port”: Target specific files

gcc/gcc/config/<myport>/– for example: i386, ia64, rs6000, spu…

target-specific compiler options: <target>.opt– command-line options of GCC specific to the target– for example: -maltivec, -msse2, -mtune=power4, -minsert-sched-nops=

target-specific definitions: <target>.h– basic parameters and features – for example:

target-specific support functions: <target>.c– target predicates, code generation functions, target variants

machine description: <target>.md– definition of RTL instructions and their translations to assembly– content of machine description determines which features (operations, modes) are available

GCC Backend – machine-description files and operation tables

#define POINTER_SIZE (TARGET_32BIT ? 32 : 64)#define BYTES_BIG_ENDIAN 1#define FIXED_REGISTERS \

{0, 1, FIXED_R2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, FIXED_R13, 0, 0, \

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ ….#define CALL_USED_REGISTERS \

{1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, FIXED_R13, 0, 0, \ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \...

Page 6: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

6

machine-description file

alpha/alpha.md

(define_insn "sminqi3"

[(set (match_operand:QI 0 "register_operand" "=r")

(smin:QI (match_operand:QI 1 "reg_or_0_operand" "%rJ")

(match_operand:QI 2 "reg_or_8bit_operand" "rI")))]

"TARGET_MAX"

"minsb8 %r1,%2,%0"

[(set_attr "type" "mvi")])

(define_insn "sminv8qi3"

[(set (match_operand:V8QI 0 "register_operand" "=r")

(smin:V8QI (match_operand:V8QI 1 "reg_or_0_operand" "rW")

(match_operand:V8QI 2 "reg_or_0_operand" "rW")))]

"TARGET_MAX"

"minsb8 %r1,%r2,%0"

[(set_attr "type" "mvi")])

RTL operations: rtl.defDEF_RTL_EXPR(SMIN, "smin", "ee", RTX_COMM_ARITH)

DEF_RTL_EXPR(SMAX, "smax", "ee", RTX_COMM_ARITH)

DEF_RTL_EXPR(UMIN, "umin", "ee", RTX_COMM_ARITH)

DEF_RTL_EXPR(UMAX, "umax", "ee", RTX_COMM_ARITH)

gcc/gcc:rtl.def

gcc/gcc/config/<port>:<target>.opt<target>.h<target>.c<target>.md

http://gcc.gnu.org/onlinedocs/gccint/

Page 7: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

7

alpha/alpha.md

(define_insn "sminqi3"

[(set (match_operand:QI 0 "register_operand" "=r")

(smin:QI (match_operand:QI 1 "reg_or_0_operand" "%rJ")

(match_operand:QI 2 "reg_or_8bit_operand" "rI")))]

"TARGET_MAX"

"minsb8 %r1,%2,%0"

[(set_attr "type" "mvi")])

(define_insn "sminv8qi3"

[(set (match_operand:V8QI 0 "register_operand" "=r")

(smin:V8QI (match_operand:V8QI 1 "reg_or_0_operand" "rW")

(match_operand:V8QI 2 "reg_or_0_operand" "rW")))]

"TARGET_MAX"

"minsb8 %r1,%r2,%0"

[(set_attr "type" "mvi")])

machine-description fileRTL operations: rtl.defDEF_RTL_EXPR(SMIN, "smin", "ee", RTX_COMM_ARITH)

DEF_RTL_EXPR(SMAX, "smax", "ee", RTX_COMM_ARITH)

DEF_RTL_EXPR(UMIN, "umin", "ee", RTX_COMM_ARITH)

DEF_RTL_EXPR(UMAX, "umax", "ee", RTX_COMM_ARITH)

- machine-modes:qi, hi, si, di, sf, df

- vector machine-modes:alpha: v8qi, v4hi

altivec: v16qi, v8hi, v4si

- constraints

- conditions

- attributes

- assembly

- scalar and vector operations differ only in operand modes

Page 8: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

8

rs6000/rs6000.md

(define_expand "sminsi3"

[(set (match_dup 3)

(if_then_else:SI (gt:SI (match_operand:SI 1 "gpc_reg_operand" "")

(match_operand:SI 2 "reg_or_short_operand" ""))

(const_int 0)

(minus:SI (match_dup 2) (match_dup 1))))

(set (match_operand:SI 0 "gpc_reg_operand" "")

(minus:SI (match_dup 2) (match_dup 3)))]

"TARGET_POWER || TARGET_ISEL"

"{

if (TARGET_ISEL) {

operands[2] = force_reg (SImode, operands[2]);

rs6000_emit_minmax (operands[0], SMIN, operands[1], operands[2]);

DONE;

}

operands[3] = gen_reg_rtx (SImode);

}")

RTL operations: rtl.defDEF_RTL_EXPR(IF_THEN_ELSE, "if_then_else", "eee", RTX_TERNARY)

DEF_RTL_EXPR(GT, "gt", "ee", RTX_COMPARE)

DEF_RTL_EXPR(MINUS, "minus", "ee", RTX_BIN_ARITH)

rs6000/rs6000.c

Page 9: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

9

;; Vec int modes(define_mode_macro VI [V4SI V8HI V16QI])

(define_insn "smin<mode>3" [(set (match_operand:VI 0 "register_operand" "=v") (smin:VI (match_operand:VI 1 "register_operand" "v") (match_operand:VI 2 "register_operand" "v")))] "TARGET_ALTIVEC" "vmins<VI_char> %0,%1,%2" [(set_attr "type" "vecsimple")])

rs6000/altivec.md

(define_insn "sminv4sf3" [(set (match_operand:V4SF 0 "register_operand" "=v") (smin:V4SF (match_operand:V4SF 1 "register_operand" "v") (match_operand:V4SF 2 "register_operand" "v")))] "TARGET_ALTIVEC" "vminfp %0,%1,%2" [(set_attr "type" "veccmp")])

When the same pattern applies to multiple modes:

use mode macros to generate an entire family of patterns

Page 10: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

10

optabs.c,h

optab/typeqihisiv4siv2si…

smin_optab700701CODE_FOR_nothing

753CODE_FOR_nothing

umin_optab702703CODE_FOR_nothing

754CODE_FOR_nothing

build/gcc/insn-emit.crtx

gen_sminv4si3 (rtx operand0 ATTRIBUTE_UNUSED,

rtx operand1 ATTRIBUTE_UNUSED,

rtx operand2 ATTRIBUTE_UNUSED) {

return gen_rtx_SET (VOIDmode,

operand0,

gen_rtx_SMIN (V4SImode, operand1, operand2));

}

build/gcc/insn-output.c { "sminv4si3",

{

"vminsw %0,%1,%2", 0, 0 },

(insn_gen_fn) gen_sminv4si3,

&operand_data[1427],

3, 0, 1, 1 }

- tables of RTL operations sharing common semantics, butdiffering in operand size and/or structure

- no type information available anymore

GCC Backend – machine-description files and operation tables

Page 11: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

11

optabs.c,h

optab/typeqihisiv4siv2si…

smin_optab700701CODE_FOR_nothing

753CODE_FOR_nothing

umin_optab702703CODE_FOR_nothing

754CODE_FOR_nothing

build/gcc/insn-emit.crtx

gen_sminv4si3 (rtx operand0 ATTRIBUTE_UNUSED,

rtx operand1 ATTRIBUTE_UNUSED,

rtx operand2 ATTRIBUTE_UNUSED) {

return gen_rtx_SET (VOIDmode,

operand0,

gen_rtx_SMIN (V4SImode, operand1, operand2));

}

build/gcc/insn-output.c { "sminv4si3",

{

"vminsw %0,%1,%2", 0, 0 },

(insn_gen_fn) gen_sminv4si3,

&operand_data[1427],

3, 0, 1, 1 }

- tables of RTL operations sharing common semantics, butdiffering in operand size and/or structure

- no type information available anymore

GCC Backend – machine-description files and operation tables

gcc/gcc:rtl.def

gcc/gcc/config/<port>:<target>.opt<target>.h<target>.c<target>.md

gcc/gcc:rtl.def

gcc/gcc/config/<port>:<target>.opt<target>.h<target>.c<target>.md

optabqihisiv8qiv4hiv2si

smin

umin

Page 12: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

12

min_27 = MIN_EXPR <tmp_26, min_50>;

optab = optab_for_tree_code (code, vectype);

vec_mode = TYPE_MODE (vectype);

icode = (int) optab->handlers[(int) vec_mode].insn_code;

if (icode == CODE_FOR_nothing)

{

if (vect_print_dump_info (REPORT_DETAILS))

fprintf (vect_dump, "operation not supported by target.");

return false;

}

optab/typeqihisiv8qiv4hiv2si

smin_optab700701CODE_FOR_nothing

752753CODE_FOR_nothing

umin_optab702703CODE_FOR_nothing

754755CODE_FOR_nothing

Querying the backend for target support in the vectorizer

vector int

v2si

smin_optab

Page 13: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

13

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees

rs6000 porti386 port

assembly

RTL

back-end

vectorization

Talk Layout What is vectorization

Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port

Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation

Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases

Using the vectorizer Programming and tuning hints

Page 14: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

14

Enabling vectorization for a new port

<target.md> - distinction between scalar and vector ops: operand modes- availability of vector ops: deduced from MD file

<target>.h- specify supported vector length in bytes: #define UNITS_PER_SIMD_WORD 16

<target>-modes.def - specify supported vector modes:

/* Vector modes. */VECTOR_MODES (INT, 8); /* V8QI V4HI V2SI */VECTOR_MODES (INT, 16); /* V16QI V8HI V4SI V2DI */VECTOR_MODE (INT, DI, 1);VECTOR_MODES (FLOAT, 8); /* V4HF V2SF */VECTOR_MODES (FLOAT, 16); /* V8HF V4SF V2DF */

Basic features:

Page 15: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

15

Enabling vectorization for a new port

Special idioms: generic vector operations:

look over list of idioms in optabs.h

specialized vector operations:look over target.h

Advanced features:

#define reduc_smax_optab (optab_table[OTI_reduc_smax])#define reduc_umax_optab (optab_table[OTI_reduc_umax])#define reduc_smin_optab (optab_table[OTI_reduc_smin])#define reduc_umin_optab (optab_table[OTI_reduc_umin])#define reduc_splus_optab (optab_table[OTI_reduc_splus])#define reduc_uplus_optab (optab_table[OTI_reduc_uplus])

#define ssum_widen_optab (optab_table[OTI_ssum_widen])#define usum_widen_optab (optab_table[OTI_usum_widen])#define sdot_prod_optab (optab_table[OTI_sdot_prod])#define udot_prod_optab (optab_table[OTI_udot_prod])

#define vec_set_optab (optab_table[OTI_vec_set])#define vec_extract_optab (optab_table[OTI_vec_extract])#define vec_extract_even_optab (optab_table[OTI_vec_extract_even])#define vec_extract_odd_optab (optab_table[OTI_vec_extract_odd])#define vec_interleave_high_optab (optab_table[OTI_vec_interleave_high])#define vec_interleave_low_optab (optab_table[OTI_vec_interleave_low])#define vec_init_optab (optab_table[OTI_vec_init])#define vec_shl_optab (optab_table[OTI_vec_shl])#define vec_shr_optab (optab_table[OTI_vec_shr])#define vec_realign_load_optab (optab_table[OTI_vec_realign_load])#define vec_widen_umult_hi_optab (optab_table[OTI_vec_widen_umult_hi])#define vec_widen_umult_lo_optab (optab_table[OTI_vec_widen_umult_lo])#define vec_widen_smult_hi_optab (optab_table[OTI_vec_widen_smult_hi])#define vec_widen_smult_lo_optab (optab_table[OTI_vec_widen_smult_lo])#define vec_unpacks_hi_optab (optab_table[OTI_vec_unpacks_hi])#define vec_unpacku_hi_optab (optab_table[OTI_vec_unpacku_hi])#define vec_unpacks_lo_optab (optab_table[OTI_vec_unpacks_lo])#define vec_unpacku_lo_optab (optab_table[OTI_vec_unpacku_lo])#define vec_pack_mod_optab (optab_table[OTI_vec_pack_mod])#define vec_pack_ssat_optab (optab_table[OTI_vec_pack_ssat])#define vec_pack_usat_optab (optab_table[OTI_vec_pack_usat])

/* Functions relating to vectorization. */

struct vectorize

{

tree (* builtin_mask_for_load) (void);

tree (* builtin_vectorized_function)

(unsigned, tree);

tree (* builtin_mul_widen_even) (tree);

tree (* builtin_mul_widen_odd) (tree);

} vectorize;

Page 16: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

16

Enabling vectorization for a new port

Special idioms: generic vector operations:

look over list of idioms in optabs.h

specialized vector operations:look over target.h

Advanced features:

#define reduc_smax_optab (optab_table[OTI_reduc_smax])#define reduc_umax_optab (optab_table[OTI_reduc_umax])#define reduc_smin_optab (optab_table[OTI_reduc_smin])#define reduc_umin_optab (optab_table[OTI_reduc_umin])#define reduc_splus_optab (optab_table[OTI_reduc_splus])#define reduc_uplus_optab (optab_table[OTI_reduc_uplus])

#define ssum_widen_optab (optab_table[OTI_ssum_widen])#define usum_widen_optab (optab_table[OTI_usum_widen])#define sdot_prod_optab (optab_table[OTI_sdot_prod])#define udot_prod_optab (optab_table[OTI_udot_prod])

#define vec_set_optab (optab_table[OTI_vec_set])#define vec_extract_optab (optab_table[OTI_vec_extract])#define vec_extract_even_optab (optab_table[OTI_vec_extract_even])#define vec_extract_odd_optab (optab_table[OTI_vec_extract_odd])#define vec_interleave_high_optab (optab_table[OTI_vec_interleave_high])#define vec_interleave_low_optab (optab_table[OTI_vec_interleave_low])#define vec_init_optab (optab_table[OTI_vec_init])#define vec_shl_optab (optab_table[OTI_vec_shl])#define vec_shr_optab (optab_table[OTI_vec_shr])#define vec_realign_load_optab (optab_table[OTI_vec_realign_load])#define vec_widen_umult_hi_optab (optab_table[OTI_vec_widen_umult_hi])#define vec_widen_umult_lo_optab (optab_table[OTI_vec_widen_umult_lo])#define vec_widen_smult_hi_optab (optab_table[OTI_vec_widen_smult_hi])#define vec_widen_smult_lo_optab (optab_table[OTI_vec_widen_smult_lo])#define vec_unpacks_hi_optab (optab_table[OTI_vec_unpacks_hi])#define vec_unpacku_hi_optab (optab_table[OTI_vec_unpacku_hi])#define vec_unpacks_lo_optab (optab_table[OTI_vec_unpacks_lo])#define vec_unpacku_lo_optab (optab_table[OTI_vec_unpacku_lo])#define vec_pack_mod_optab (optab_table[OTI_vec_pack_mod])#define vec_pack_ssat_optab (optab_table[OTI_vec_pack_ssat])#define vec_pack_usat_optab (optab_table[OTI_vec_pack_usat])

/* Functions relating to vectorization. */

struct vectorize

{

tree (* builtin_mask_for_load) (void);

tree (* builtin_vectorized_function)

(unsigned, tree);

tree (* builtin_mul_widen_even) (tree);

tree (* builtin_mul_widen_odd) (tree);

} vectorize;

gcc/gcc:rtl.deftarget.hoptabs.h

gcc/gcc/config/<port>:<target>.opt<target>.h<target>.c<target>.md

Page 17: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

17

testcases are in gcc/gcc/testsuite/gcc.dg/vect

additional target-specific testcases testsuite/gcc.target/i386/vect1.c

vect.exp: add logic to decide whether to compile/run and with which target-specific options

Add where relevant in:testsuite/lib/target-supports.exp:

Enabling vectorization for a new port

if [istarget "powerpc*-*-*"] {

}

} elseif { [istarget "spu-*-*"] } {

set dg-do-what-default run

} elseif { [istarget "i?86-*-*"] || [istarget "x86_64-*-*"] } {

lappend DEFAULT_VECTCFLAGS "-msse2"

set dg-do-what-default run

} elseif { [istarget "mipsisa64*-*-*"]

&& [check_effective_target_mpaired_single] } {

lappend DEFAULT_VECTCFLAGS "-mpaired-single"

set dg-do-what-default run

} elseif [istarget "sparc*-*-*"] {

} elseif [istarget "alpha*-*-*"] {

lappend DEFAULT_VECTCFLAGS "-mmax"

if [check_alpha_max_hw_available] {

set dg-do-what-default run

} else {

set dg-do-what-default compile

}

} elseif [istarget "ia64-*-*"] {

set dg-do-what-default run

} else {

return

Enable the vectorizer testcases

Page 18: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

18

testcases are in gcc/gcc/testsuite/gcc.dg/vect

additional target-specific testcases testsuite/gcc.target/i386/vect1.c

vect.exp: add logic to decide whether to compile/run and with which target-specific options

Add where relevant in:testsuite/lib/target-supports.exp:

Enabling vectorization for a new portEnable the vectorizer testcases

proc check_effective_target_vect_int

check_effective_target_vect_shift

check_effective_target_vect_long

proc check_effective_target_vect_float

proc check_effective_target_vect_double { } {

global et_vect_double_saved

if [info exists et_vect_double_saved] {

verbose "using cached result" 2

} else {

set et_vect_double_saved 0

if { [istarget i?86-*-*]

|| [istarget x86_64-*-*]

|| [istarget spu-*-*] } {

set et_vect_double_saved 1

}

}

return $et_vect_double_saved

}

check_effective_target_vect_no_int_max

check_effective_target_vect_no_int_add

check_effective_target_vect_sdot_hi

check_effective_target_vect_udot_hi

check_effective_target_vect_sdot_si

check_effective_target_vect_udot_si

….

Page 19: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

19

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees

rs6000 porti386 port

assembly

RTL

back-end

vectorization

Talk Layout What is vectorization

Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port

Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation

Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases

Using the vectorizer Programming and tuning hints

Page 20: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

20

A tree-level pass

New C file in gcc/gcc: tree-vectorizer.c tree-vect-analyze.c tree-vect-trasnform.c tree-vect-patterns.c tree-vectorizer.h

tree-flow.h – prototype for pass functionunsigned vectorize_loops (void);

gcc/Makefile.in entries

The pass is invoked for each function

unsigned vectorize_loops (void)

{

unsigned int i;

unsigned int num_vectorized_loops = 0;

unsigned int vect_loops_num;

loop_iterator li;

struct loop *loop;

vect_loops_num = number_of_loops ();

FOR_EACH_LOOP (li, loop, LI_ONLY_OLD)

{

loop_vec_info loop_vinfo;

vect_loop_location = find_loop_location (loop);

loop_vinfo = vect_analyze_loop (loop);

loop->aux = loop_vinfo;

if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))

continue;

vect_transform_loop (loop_vinfo);

num_vectorized_loops++;

}

if (vect_print_dump_info (REPORT_VECTORIZED_LOOPS))

fprintf (vect_dump, "vectorized %u loops in function.\n",

num_vectorized_loops);

}

Page 21: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

21

A tree-level pass

… NEXT_PASS (pass_split_crit_edges); NEXT_PASS (pass_pre); NEXT_PASS (pass_may_alias); NEXT_PASS (pass_sink_code); NEXT_PASS (pass_tree_loop); NEXT_PASS (pass_cse_reciprocals); NEXT_PASS (pass_reassoc); NEXT_PASS (pass_vrp); NEXT_PASS (pass_dominator); p = &pass_tree_loop.sub; NEXT_PASS (pass_tree_loop_init); NEXT_PASS (pass_copy_prop); NEXT_PASS (pass_lim); NEXT_PASS (pass_tree_unswitch); NEXT_PASS (pass_scev_cprop); NEXT_PASS (pass_empty_loop); NEXT_PASS (pass_record_bounds); NEXT_PASS (pass_linear_transform); NEXT_PASS (pass_iv_canon); NEXT_PASS (pass_if_conversion); NEXT_PASS (pass_vectorize); NEXT_PASS (pass_complete_unroll); NEXT_PASS (pass_loop_prefetch); NEXT_PASS (pass_iv_optimize); NEXT_PASS (pass_tree_loop_done); *p = NULL;

p = &pass_vectorize.sub; NEXT_PASS (pass_lower_vector_ssa); NEXT_PASS (pass_dce_loop); *p = NULL;

add the pass to the pass hierarchy in passes.c

in tree-pass.h – prototype for pass structureextern struct tree_opt_pass pass_vectorize;

pass-structure definitionin tree-ssa-loop.c

Page 22: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

22

A tree-level pass

• pass structure definition:struct tree_opt_pass pass_vectorize ={ "vect", /* name */ gate_tree_vectorize, /* gate */ tree_vectorize, /* execute */ NULL, /* sub */ NULL, /* next */ 0, /* static_pass_number */ TV_TREE_VECTORIZATION, /* tv_id */ PROP_cfg | PROP_ssa, /* properties_required */ 0, /* properties_provided */ 0, /* properties_destroyed */ TODO_verify_loops, /* todo_flags_start */ TODO_dump_func

| TODO_update_ssa, /* todo_flags_finish */ 0 /* letter */};

• timevar.def: variable used for timing and for identification in timing reports:DEFTIMEVAR (TV_TREE_VECTORIZATION , "tree vectorization")

• static boolgate_tree_vectorize (void){ return flag_tree_vectorize

&& current_loops;}

• static unsigned inttree_vectorize (void){ return vectorize_loops ();}

• common.optAdd command line option

ftree-vectorize

Common Report Var(flag_tree_vectorize)

Enable loop vectorization on trees

Page 23: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

23

A tree-level pass

invoke.texi:Document the pass for the GCC manual:

@item -ftree-vectorizePerform loop vectorization on trees.

@item vect@opindex fdump-tree-vectDump each function after applying vectorization of loops. The file name ismade by appending @file{.vect} to the source file name.

gcc –O2 –ftree-vectorize example.c gcc –O2 –ftree-vectorize –maltivec example.c gcc –O2 –ftree-vectorize –msse2 example.c gcc –O2 –ftree-vectorize –maltivec –fdump-tree-vect example.c gcc –O2 –ftree-vectorize –maltivec –fdump-tree-vect-details example.c gcc –O2 –ftree-vectorize –maltivec –ftree-vectorizer-verbose=2 example.c gcc –O2 –ftree-vectorize –maltivec –ftree-vectorizer-verbose=7 –fdump-tree-vect

example.c

gcc/gcc:rtl.deftarget.hoptabs.h

gcc/gcc/config/<port>:<target>.opt<target>.h<target>.c<target>.md

1. [tree-vect*.c]

2. tree-flow.h

3. Makefile.in

4. [tree-ssa-loop.c]

5. timevar.def

6. common.opt

7. Invoke.texi

Page 24: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

24

Example: vectorizer dump reports

int main1 (short *in, int off, short scale, int n)

{

int i, sum = 0;

for (i = 0; i < n; i++) {

sum += ((int) in[i] * (int) in[i+off]) >> scale;

}

return sum;

}

autocorrelation

Speedups:- powerpc970 – 5-6x- Cell SPU – 4-5x

[dorit@mac-ira vect]$ gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=5 vect-widen-mult-sum.c

vect-widen-mult-sum.c:16: note: Vectorizing an unaligned access.

vect-widen-mult-sum.c:16: note: Vectorizing an unaligned access.

vect-widen-mult-sum.c:16: note: LOOP VECTORIZED.

vect-widen-mult-sum.c:12: note: vectorized 1 loops in function.

Page 25: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

25

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees

rs6000 porti386 port

assembly

RTL

back-end

vectorization

Talk Layout What is vectorization

Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port

Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation

Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases

Using the vectorizer Programming and tuning hints

Page 26: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

26

Auto-vectorization Skeleton

vect_analyze_loop (loop) { if (!1_analyze_loop_form (loop)) FAIL if (!2_analyze_data_refs (loop)) FAIL if (!3_analyze_scalar_dependence_cycles (loop)) FAIL if (!4_pattern_recog (loop)) FAIL if (!5_analyze_data_alignment (loop)) FAIL if (!6_determine_VF (loop)) FAIL if (!7_analyze_data_dependence_distances (loop)) FAIL if (!8_analyze_memory_access_patterns (loop)) FAIL if (!9_analyze_all_operations_supported (loop)) FAIL

SUCCEED}

if SUCCEED:vect_transform_loop (loop) { FOR_ALL_STMTS_IN_LOOP(loop, stmt)

replace_OP_by_VOP (stmt); decrease_loop_bound_by_factor_VF (loop);}

tree-vect-analyze.c

tree-vect-transform.c

Page 27: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

27

Auto-Vectorization Transformation

original serial loop:for(i=0; i<N; i++){ a[i] = a[i] + b[i];}

loop in vector notation:for (i=0; i<N; i+=VF){ a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1];}

loop in vector notation:for (i=0; i<(N-N%VF); i+=VF){ a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1];}

for ( ; i < N; i++) { a[i] = a[i] + b[i];}

vectorization

Modify loop bound - strip-mine - create epilog loop

Replace scalar statements with vector statements

vectorized loop

epilog loop

Page 28: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

28

Vectorization on SSA-ed GIMPLE trees

float T.1, T.2, T.3;

loop:

if ( i < 16 ) break;

S1: T.1 = a[i ];

S2: T.2 = b[i ];

S3: T.3 = T.1 * T.2;

S4: a[i] = T.3;

S5: i = i + 1;

goto loop;

loop: if (i < 16) break; T.11 = a[i ]; T.12 = a[i+1]; T.13 = a[i+2]; T.14 = a[i+3]; T.21 = b[i ]; T.22 = b[i+1]; T.23 = b[i+2]; T.24 = b[i+3]; T.31 = T.11 * T.21; T.32 = T.12 * T.22; T.33 = T.13 * T.23; T.34 = T.14 * T.24; a[i] = T.31; a[i+1] = T.32; a[i+2] = T.33; a[i+3] = T.34; i = i + 4; goto loop;

VF = 4 “unroll by VF and replace”

int i;

float a[N], b[N];

for (i=0; i < 16; i++)

a[i] = a[i ] * b[i ];

v4sf VT.1, VT.2, VT.3;

v4sf *VPa = (v4sf *)a, *VPb = (v4sf *)b;

int indx;

loop:

if ( indx < 4 ) break;

VT.1 = VPa[indx ];

VT.2 = VPb[indx ];

VT.3 = VT.1 * VT.2;

VPa[indx] = VT.3;

indx = indx + 1;

goto loop;

Page 29: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

29

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees

rs6000 porti386 port

assembly

RTL

back-end

vectorization

Talk Layout What is vectorization

Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port

Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation

Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases

Using the vectorizer Programming and tuning hints

Page 30: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

30

Vectorizer analyses and transformation: Reduction

s = 0;

for (i=0; i<N; i++) {

s += a[i] * b[i];

}

loop:

s_1 = phi (0, s_2)

i_1 = phi (0, i_2)

xa_1 = a[i_1]

xb_1 = b[i_1]

tmp_1 = xa * xb

s_2 = s_1 + tmp_1

i_2 = i_1 + 1

if (i_2 < N) goto loop

cross iteration dependences

reduction

induction

Analysis

Detect scalar dependece cycles

Identify scalar cycles that are reduction/induction

0 1 2 3 4 5 6 7 8 9 10 11

0 0 0 00 1 2 3

tmp_1

4 6 8 1012 15 18 21

Page 31: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

31

static void

vect_analyze_scalar_cycles (loop_vec_info loop_vinfo)

{

tree phi;

struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);

basic_block bb = loop->header;

if (vect_print_dump_info (REPORT_DETAILS))

fprintf (vect_dump, "=== vect_analyze_scalar_cycles ===");

for (phi = phi_nodes (bb); phi; phi = PHI_CHAIN (phi))

{

stmt_vec_info stmt_vinfo = vinfo_for_stmt (phi);

tree def = PHI_RESULT (phi);

if (!is_gimple_reg (SSA_NAME_VAR (def)))

continue;

STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_unknown_def_type;

tree access_fn = analyze_scalar_evolution (loop, def);

if (!access_fn)

continue;

if (vect_is_simple_iv_evolution (loop->num, access_fn)

{

STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_induction_def;

continue;

}

tree rstmt = vect_is_simple_reduction (loop, phi);

if (rstmt)

{

STMT_VINFO_DEF_TYPE (stmt_vinfo) =

STMT_VINFO_DEF_TYPE (vinfo_for_stmt (rstmt)) =

vect_reduction_def;

}

else

if (vect_print_dump_info (REPORT_DETAILS))

fprintf (vect_dump, "Unknown def-use cycle pattern.");

} /* End for loop */

return;

}

s_1 = phi (0, s_2)

i_1 = phi (0, i_2)

xa_1 = a[i_1]

xb_1 = b[i_1]

tmp_1 = xa * xb

s_2 = s_1 + tmp_1

i_2 = i_1 + 1

unknownreduc

tree-vect-analyze.c

Page 32: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

32

edge latch_e = loop_latch_edge (loop); tree loop_arg = PHI_ARG_DEF_FROM_EDGE (phi, latch_e); tree def_stmt = SSA_NAME_DEF_STMT (loop_arg); tree operation = GIMPLE_STMT_OPERAND (def_stmt, 1); enum tree_code code = TREE_CODE (operation);… if (!commutative_tree_code (code) || !associative_tree_code (code)) { if (vect_print_dump_info (REPORT_DETAILS)) { fprintf (vect_dump, "reduction: not commutative/associative: "); print_generic_expr (vect_dump, operation, TDF_SLIM); } return NULL_TREE; } if (SCALAR_FLOAT_TYPE_P (type) && !flag_unsafe_math_optimizations) { if (vect_print_dump_info (REPORT_DETAILS)) { fprintf (vect_dump, "reduction: unsafe fp math optimization: "); print_generic_expr (vect_dump, operation, TDF_SLIM); } return NULL_TREE; }…

s_1 = phi (0, s_2)

i_1 = phi (0, i_2)

xa_1 = a[i_1]

xb_1 = b[i_1]

tmp_1 = xa * xb

s_2 = s_1 + tmp_1

i_2 = i_1 + 1

Snippet from vect_is_simple_reduction:

tree-vectorizer.c

Page 33: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

33

Vectorizer analyses and transformation: Reduction

loop:

s_1 = phi (0, s_2)

i_1 = phi (0, i_1)

xa_1 = a[i_1]

xb_1 = b[i_1]

tmp_1 = xa * xb

s_2 = s_1 + tmp_1

i_2 = i_1 + 1

if (i_2 < N) goto loop

Transformation

loop:

vs_1 = phi (vs_0, vs_2)

i_1 = phi (0, i_1)

vxa_1 = vpa[i_1]

vxb_1 = vpb[i_1]

vtmp_1 = vxa * vxb

vs_2 = vs_1 + vtmp_1

i_2 = i_1 + 1

if (i_2 < N/VF) goto loop

vec_dest = vect_create_destination_var (scalar_dest, vectype);

expr = build2 (code, vectype, loop_vec_def0, reduc_def);

new_stmt = build2 (GIMPLE_MODIFY_STMT, void_type_node, vec_dest, expr);

new_temp = make_ssa_name (vec_dest, new_stmt);

GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;

bsi_insert_before (bsi, vec_stmt, BSI_SAME_STMT);

tree-vect-transform.c

Page 34: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

34

0 1 2 3

Vectorizer analyses and transformation: Reduction

s = 0;

for (i=0; i<N; i++) {

s += a[i] * b[i];

}

printf (“sum = %f\n”, s);

Transformation

28

0 1 2 3+

4 5 6 7+

0 0 0 0s1,s2,s3,s4

loop:

vs_1 = phi (vs_0, vs_2)

i_1 = phi (0, i_2)

vxa_1 = vpa[i_1]

vxb_1 = vpb[i_1]

vtmp_1 = vxa * vxb

vs_2 = vs_1 + vtmp_1

i_2 = i_1 + 1

if (i_2 < N/VF) goto loop

4 6 8 10

8 10

+

12 16+

28

16

scalar epilog

whole vector shifts

sum across

vs_0

vtmp_1

vs_2

vtmp_1

s

Page 35: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

35

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees

rs6000 porti386 port

assembly

RTL

back-end

vectorization

Talk Layout What is vectorization

Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port

Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation

Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases

Using the vectorizer Programming and tuning hints

Page 36: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

36

Adding new idioms

tree.def: define the tree-code:

/* Reduction operations. Operations that take a vector of elements and "reduce" it to a scalar result (e.g. summing the elements of the vector, finding the minimum over the vector elements, etc). Operand 0 is a vector; the first element in the vector has the result. Operand 1 is a vector. */

DEFTREECODE (REDUC_PLUS_EXPR, "reduc_plus_expr", tcc_unary, 1)

tree-pretty-print.cdump_generic_node, op_prio, op_symbol

tree-inline.c: estimate_num_insns_1 ()

Page 37: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

37

Adding new idioms

optabs.h: add a new operator table (optab) index to enum optab_index

/* Reduction operations on a vector operand. */ OTI_reduc_splus, OTI_reduc_uplus,

optabs.h: define matching shortcuts

#define reduc_splus_optab (optab_table[OTI_reduc_splus])#define reduc_uplus_optab (optab_table[OTI_reduc_uplus])

Page 38: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

38

Adding new idioms

optabs.c: add selection of appropriate optab in the dispatch function optab_for_tree_code():

case REDUC_PLUS_EXPR: return TYPE_UNSIGNED (type) ? reduc_uplus_optab : reduc_splus_optab;

optabs.c: initialize the new optabs in init_optabs()

reduc_splus_optab = init_optab (UNKNOWN); reduc_uplus_optab = init_optab (UNKNOWN);

Page 39: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

39

Adding new idioms

genopinit.c: fill in the optabs:

"reduc_splus_optab->handlers[$A].insn_code = CODE_FOR_$(reduc_splus_$a$)" ,

"reduc_uplus_optab->handlers[$A].insn_code = CODE_FOR_$(reduc_uplus_$a$)",

optab/typeqihisiv8qiv4hiv2si

reduc_splus_optabCODE_FOR_nothing

CODE_FOR_nothing

CODE_FOR_nothing

reduc_uplus_optab

CODE_FOR_nothing

CODE_FOR_nothing

CODE_FOR_nothing

gcc/gcc:rtl.deftarget.hoptabs.h

gcc/gcc/config/<port>:<target>.opt<target>.h<target>.c<target>.md

1. tree.def

2. tree-pretty-print.c

3. tree-inline.c

4. optabs.h

5. optabs.c

6. genopinit.c

7. expr.c

8. <target>.md

Page 40: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

40

Adding new idioms expr.c: tree-to-rtl expansion: case REDUC_PLUS_EXPR: { op0 = expand_normal (TREE_OPERAND (exp, 0)); this_optab = optab_for_tree_code (code, type); temp = expand_unop (mode, this_optab, op0, target, unsignedp); gcc_assert (temp); return temp; }

<target>.md: RTL instruction definition:(define_expand "reduc_splus_<mode>" [(set (match_operand:VIshort 0 "register_operand" "=v") (unspec:VIshort [(match_operand:VIshort 1 "register_operand" "v")]

UNSPEC_REDUC_PLUS))] "TARGET_ALTIVEC" "{rtx vzero = gen_reg_rtx (V4SImode); rtx vtmp1 = gen_reg_rtx (V4SImode); emit_insn (gen_altivec_vspltisw (vzero, const0_rtx)); emit_insn (gen_altivec_vsum4s<VI_char>s (vtmp1, operands[1], vzero)); emit_insn (gen_altivec_vsumsws_nomode (operands[0], vtmp1, vzero)); DONE;}")

1. tree.def

2. tree-pretty-print.c

3. tree-inline.c

4. optabs.h

5. optabs.c

6. genopinit.c

7. expr.c

8. <target>.md

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees

rs6000 porti386 port

assembly

RTL

back-end

vectorization

expand

Page 41: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

41

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees

rs6000 porti386 port

assembly

RTL

back-end

vectorization

Talk Layout What is vectorization

Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port

Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation

Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases

Using the vectorizer Programming and tuning hints

Page 42: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

42

vect-reduc-min.c#define N 16

int main1 ()

{

int i;

float c[N] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};

float min = 10;

for (i = 0; i < N; i++) {

min = min > c[i] ? c[i] : min;

}

/* check results: */

if (min != 0)

abort ();

return 0;

}

gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=4 vect-reduc-min.c

vect-reduc-min.c:14: note: not vectorized: unsupported use in stmt.vect-reduc-min.c:9: note: vectorized 0 loops in function.

gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=7 vect-reduc-min.c

…vect-reduc-min.c:14: note: === vect_analyze_scalar_cycles ===vect-reduc-min.c:14: note: Analyze phi: min_6 = PHI <min_3(6), 1.0e+1(2)>vect-reduc-min.c:14: note: reduction: not commutative/associative:

min_6 > min_7 ? min_7 : min_6

vect-reduc-min.c:14: note: Unknown def-use cycle pattern…vect-reduc-min.c:14: note: Unsupported pattern.vect-reduc-min.c:14: note: not vectorized: unsupported use in stmt.vect-reduc-min.c:14: note: unexpected pattern.vect-reduc-min.c:9: note: vectorized 0 loops in function.

gcc -O2 -ftree-vectorize -maltivec vect-reduc-min.c -ftree-vectorizer-verbose=4 -ffast-math

vect-reduc-min.c:14: note: LOOP VECTORIZED.vect-reduc-min.c:9: note: vectorized 1 loops in function.

Compilation Flow Example

Page 43: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

43

vect-min.c.081t.ifcvt

main1 (){ unsigned int ivtmp.31; int pretmp.25; float min; float c[16]; int i; float D.2429; static float C.3[16] = {…};

<bb 2>: c = C.3;

# ivtmp.31_2 = PHI <ivtmp.31_3(4), 16(2)> # min_15 = PHI <min_7(4), 1.0e+1(2)> # i_14 = PHI <i_8(4), 0(2)><L0>:; D.2429_6 = c[i_14]; min_7 = MIN_EXPR <D.2429_6, min_15>; i_8 = i_14 + 1; ivtmp.31_3 = ivtmp.31_2 - 1; if (ivtmp.31_3 != 0) goto <L8>; else goto <L2>;

<L8>:; goto <bb 3> (<L0>);

# min_1 = PHI <min_7(3)><L2>:; if (min_1 != 0.0) goto <L3>; else goto <L4>;

<L3>:; abort ();

<L4>:; return 0;}

vect-min.c.004t.gimple

c = C.3; min = 1.0e+1;

i = 0;

goto <D2425>;

<D2424>:;

i.4 = i;

D.2429 = c[i.4];

min = MIN_EXPR <D.2429, min>;

i = i + 1;

<D2425>:;

if (i <= 15)

{

goto <D2424>;

}

else

{

goto <D2426>;

}

<D2426>:;

if (min != 0.0)

{

abort ();

}

else

{

}

D.2430 = 0;

return D.2430;

-fdump-tree-all -da

Page 44: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

44

vect-min.c.082t.vect

<bb 2>:

c = C.3;

vect_pc.32_5 = (__vector float *) &c;

vect_cst_.40_21 = { 1.0e+1, 1.0e+1, 1.0e+1, 1.0e+1 };

# ivtmp.43_28 = PHI <ivtmp.43_29(4), 0(2)>

# vect_var.39_19 = PHI <vect_var.39_20, vect_cst.40_21>

# ivtmp.37_16 = PHI <ivtmp.37_17(4), vect_pc.32_5(2)>

# ivtmp.31_2 = PHI <ivtmp.31_3(4), 16(2)>

# min_15 = PHI <min_7(4), 1.0e+1(2)>

# i_14 = PHI <i_8(4), 0(2)>

<L0>:;

vect_var_.38_18 = *ivtmp.37_16;

D.2429_6 = c[i_14];

vect_var.39_20 = MIN_EXPR <vect_var.38_18, vect_var.39_19>;

min_7 = MIN_EXPR <D.2429_6, min_15>;

i_8 = i_14 + 1;

ivtmp.31_3 = ivtmp.31_2 - 1;

ivtmp.37_17 = ivtmp.37_16 + 16B;

ivtmp.43_29 = ivtmp.43_28 + 1;

if (ivtmp.43_29 < 4) goto <L8>; else goto <L2>;

<L8>:;

goto <bb 3> (<L0>);

Continued:

# vect_var_.39_22 = PHI <vect_var_.39_20(3)>

# min_1 = PHI <min_7(3)>

<L2>:;

vect_var_.42_23 = vect_var_.39_22 v>> 64;

vect_var.42_24 =

MIN_EXPR <vect_var.42_23, vect_var.39_22>;

vect_var_.42_25 = vect_var_.42_24 v>> 32;

vect_var_.42_26 =

MIN_EXPR <vect_var_.42_25, vect_var_.42_24>;

vect_var_.41_27 =

BIT_FIELD_REF <vect_var_.42_26, 32, 96>;

if (vect_var_.41_27 != 0.0) goto <L3>; else goto <L4>;

<L3>:;

abort ();

<L4>:;

return 0;

}

Page 45: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

45

vect-min.c.095t.dse2

c = C.3;

vect_pc.36_4 = (__vector float *) &c;

vect_var_.38_6 = *vect_pc.36_4;

vect_var_.39_1 = MIN_EXPR <vect_var_.38_6, { 1.0e+1, 1.0e+1, 1.0e+1, 1.0e+1 }>;

ivtmp.37_14 = vect_pc.36_4 + 16B;

vect_var_.38_32 = *ivtmp.37_14;

vect_var_.39_33 = MIN_EXPR <vect_var_.39_1, vect_var_.38_32>;

ivtmp.37_34 = ivtmp.37_14 + 16B;

vect_var_.38_39 = *ivtmp.37_34;

vect_var_.39_40 = MIN_EXPR <vect_var_.39_33, vect_var_.38_39>;

ivtmp.37_41 = ivtmp.37_34 + 16B;

vect_var_.38_18 = *ivtmp.37_41;

vect_var_.39_20 = MIN_EXPR <vect_var_.38_18, vect_var_.39_40>;

vect_var_.42_23 = vect_var_.39_20 v>> 64;

vect_var_.42_24 = MIN_EXPR <vect_var_.39_20, vect_var_.42_23>;

vect_var_.42_25 = vect_var_.42_24 v>> 32;

vect_var_.42_26 = MIN_EXPR <vect_var_.42_25, vect_var_.42_24>;

vect_var_.41_27 = BIT_FIELD_REF <vect_var_.42_26, 32, 96>;

if (vect_var_.41_27 != 0.0) goto <L3>; else goto <L4>;

<L3>:;

abort ();

<L4>:;

return 0;

}

Page 46: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

46

vect-min.c.138r.life2

(insn:HI 26 25 27 2 (set (reg:V4SF 138) (mem/u/c/i:V4SF (reg/f:SI 139) [2 S16 A128])) 632

{altivec_lvx_v4sf} ))

(insn:HI 27 26 28 2 (set (reg:V4SF 141) (mem:V4SF (plus:SI (reg/f:SI 113 sfp) (const_int 16 [0x10])) [2 S16 A128])) 632

{altivec_lvx_v4sf} (nil) (nil))

(insn:HI 28 27 29 2 (set (reg:V4SF 126 [ vect_var_.39 ]) (smin:V4SF (reg:V4SF 138) (reg:V4SF 141))) 706 {sminv4sf3}))

(insn:HI 29 28 30 2 (set (reg/f:SI 127 [ ivtmp.37 ]) (plus:SI (reg/f:SI 134) (const_int 16 [0x10]))) 79 {*addsi3_internal1} (nil) (nil))

(insn:HI 30 29 31 2 (set (reg:V4SF 142) (mem:V4SF (plus:SI (reg/f:SI 134) (const_int 16 [0x10])) [2 S16 A128])) 632

{altivec_lvx_v4sf} (nil) (nil)))

(insn:HI 31 30 32 2 (set (reg:V4SF 121 [ vect_var_.50 ]) (smin:V4SF (reg:V4SF 126 [ vect_var_.39 ]) (reg:V4SF 142))) 706 {sminv4sf3} (nil))))

(insn:HI 33 32 34 2 (set (reg:V4SF 143) (mem:V4SF (plus:SI (reg/f:SI 127 [ ivtmp.37 ]) (const_int 16 [0x10])) [2 S16 A128])) 632

{altivec_lvx_v4sf} (nil)) (nil))

(insn:HI 34 33 35 2 (set (reg:V4SF 119 [ vect_var_.53 ]) (smin:V4SF (reg:V4SF 121 [ vect_var_.50 ]) (reg:V4SF 143))) 706 {sminv4sf3} (nil))))

vect-min.c.153r.sched2

(insn:TI 82 84 89 2 (set (reg:V4SF 77 0 [138])

(mem/u/c/i:V4SF (reg/f:SI 9 9 [139]) [2 S16 A128])) 632 {altivec_lvx_v4sf} (nil) (nil))))

(insn 89 82 83 2 (set (reg:SI 9 9) (plus:SI (reg/f:SI 1 1)

(const_int 16 [0x10]))) 79 {*addsi3_internal1} (nil) (nil))

(insn:TI 83 89 90 2 (set (reg:V4SF 78 1 [141])

(mem:V4SF (reg:SI 9 9) [2 S16 A128])){altivec_lvx_v4sf} ))

(insn 90 83 92 2 (set (reg:SI 9 9)

(plus:SI (reg/f:SI 29 29 [orig:127 ivtmp.37 ] [127])

(const_int 16 [0x10]))) 79 {*addsi3_internal1} (nil) (nil))

(insn 92 90 28 2 (set (reg:SI 29 29)

(plus:SI (reg/f:SI 29 29 [orig:127 ivtmp.37 ] [127])

(const_int 32 [0x20]))) 79 {*addsi3_internal1} (nil) (nil))

(insn:TI 28 92 33 2 (set (reg:V4SF 77 0[orig:126 vect_var.39] [126]) (smin:V4SF (reg:V4SF 77 0 [138])

(reg:V4SF 78 1 [141]))) 706 {sminv4sf3} (nil) (nil)))

(insn 33 28 35 2 (set (reg:V4SF 78 1 [143])

(mem:V4SF (reg:SI 9 9) [2 S16 A128])){altivec_lvx_v4sf}

(insn 35 33 93 2 (set (reg:V4SF 89 12 [144])

(mem:V4SF (reg:SI 29 29) [2 S16 A128])) {altivec_lvx_v4sf}))

Page 47: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

47

vect-min.s

main1: stwu 1,-128(1) lis 4,.LANCHOR0@ha mflr 0 la 4,.LANCHOR0@l(4) li 5,64 stw 29,116(1) stw 0,132(1) addi 29,1,16 mr 3,29 bl memcpy addi 9,29,16 addi 29,29,16 lvx 13,0,9 lis 9,.LC0@ha la 9,.LC0@l(9) lvx 0,0,9 addi 9,1,16 lvx 1,0,9 addi 9,29,16 addi 29,29,32 vminfp 0,0,1 lvx 1,0,9 lvx 12,0,29 addi 9,1,108

vminfp 0,0,13

vminfp 0,0,1

vminfp 0,0,12

vsldoi 13,0,0,8

vminfp 0,0,13

vsldoi 1,0,0,12

vminfp 1,1,0

stvewx 1,0,9

lis 9,.LC1@ha

lfs 13,108(1)

lfs 0,.LC1@l(9)

fcmpu 7,13,0

bne- 7,.L7

lwz 0,132(1)

lwz 29,116(1)

li 3,0

addi 1,1,128

mtlr 0

blr

Page 48: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

48

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees

rs6000 porti386 port

assembly

RTL

back-end

vectorization

Talk Layout What is vectorization

Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port

Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation

Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases

Using the vectorizer Programming and tuning hints

Page 49: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

49

Using the Vectorizer – Programming Hints Don’t unroll the loop

for (i=0; i<N; i+=4){ a[i] = x; a[i+1] = x; a[i+2] = x; a[i+3] = x;}

Use countable loops, with no side-effects No function-calls in the loop (distribute into a separate loop); No ‘break’/’continue’

Avoid aliasing problems Use __restrict__ qualified pointers

Keep the memory access-pattern simple Don’t use array of structures, e.g.:

struct {int f1; int f2;} a[N]; for (i=0; i<N; i++) a[i].f1 = x;

Use constant increment. i.e., don’t use the following:for (i=0; i<N; i+=incr) a[i] = x;

Alignment Use alignment attributes If have more than a single misaligned store – distribute into a separate loop (currently the

vectorizer peels the loop to align a misaligned store).

for (i=0; i<N; i++)

a[i] = x;

foo (float * __restrict__ p, float * __restrict__ q)

int af1[N], af2[N];

for (i=0; i<N; i++)af1[i] = x;

Page 50: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

50

-ffast-math if operating on floats in a reduction

computation (to allow the vectorizer to change the order of the computation)

-fwrapv if operating on signed subword integers (to

avoid casts to int that currently confuse the vectorizer)

--param min-vect-loop-bound=[X] if have loops with a short trip-count

-fno-vect-loop-version if worried about code size

-funroll-loops –fvariable-expansion-in-unroller –param max-variable-expansions-in-unroller=[X] for improved scheduling of summation

(breaking the accumulation into X+1 accumulator to increase ILP).

float *b, *c, diff, min, max;

for (i = 0; i < N; i++) {

diff += (b[i] - c[i]);

}

for (i = 0; i < N; i++) {

max = max < c[i] ? c[i] : max;

}

for (i = 0; i < N; i++) {

min = min > c[i] ? c[i] : min;

}

signed char *b, *c, diff;

for (i = 0; i < N; i++) {

diff += (signed char)(b[i] - c[i]);

}

for (i=0; i<N; i++){

p[i] = q[i];

}

Loop versioning:

if (q is aligned) {

for (i=0; i<N; i++){

x = q[i]; // q is aligned

p[i] = x;

}else {

for (i=0; i<N; i++){

x = q[i]; // q’s alignment unknown

p[i] = x;

}

Using the Vectorizer – Tuning Hints

Page 51: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

51

More information

Vectorizer: http://gcc.gnu.org/projects/tree-ssa/vectorization.html http://gcc.gnu.org/wiki/VectorizationTasks Summit papers

- http://www.gccsummit.org/2006/2006-GCC-Summit-Proceedings.pdf- ftp://gcc.gnu.org/pub/gcc/summit/2004/Autovectorization.pdf

General http://gcc.gnu.org/onlinedocs/gccint/ http://gcc.gnu.org/wiki Summit papers

Happy Hacking!

Page 52: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

52

The End

Page 53: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

53

for (i = 0; i < n; i++) {

sum += ((int) in[i] * (int) in[i+off]) >> scale;

}

Page 54: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

54

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees

rs6000 porti386 port

assembly

RTL

back-end

vectorization

Talk Layout What is vectorization

Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port

Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation

Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases

Using the vectorizer Programming and tuning hints

Page 55: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

55

Non-consecutive access patterns

a b c d e f g h i j k l m n o p

OP(a)

OP(f)

OP(k)

OP(p)

Data in Memory:

VOP( a, f, k, p ) VR5

a b c dVR1

VR2

VR3

VR4

VR5

0 1 2 3

e f g h

i j k l

m n o p

a

f

k

p

a f k p

a f k p

A[i], i={0,5,10,15,…}; access_fn(i) = (0,+,5)

Page 56: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

56

Basic unpacking and packing operations for strided access

Use two pairs of inverse operations widely supported on SIMD platforms:

extract_even, extract_odd:

interleave_high, interleave_low:

Use them recursively to support strided accesses with power-of-2 strides Support several data types

Page 57: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

57

28

S1: a = x [8*i]

S2: b = x [8*i+1]

S3: c = x [8*i+2]

S4: d = x [8*i+3]

S5: e = x [8*i+4]

S6: f = x [8*i+5]

S7: g = x [8*i+6]

S8: h = x [8*i+7]

S9: y [2*i] = k = f (a,…,h)

S10: y [2*i+1] = l = g (a,…,h)

0 1 2 3 4 5 6 7 8 9 1011 12 131415 16 171819 20212223 24252627 282930 31

a b c d e f y h

0 1 2 3 4 5 6 7

k l

0 1

0 2 4 6 8 1012 14 222016 18 24262830 1 3 5 7 9 111315 23 31

4 80 12

1719 21 252729

16 2420

2480 16

311 5 9 13 17 21 2925 2 6 10 14 22 263018 3 7 1115 2719 23

284 12 201 9 17 25 2 10 2618 273 11 19 5 132129 306 14 22 317 1523

δ=8 VF=4

load δ *VF elements

generate δ *log δ extracts (odd/even)

Page 58: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

59

Very common in real world computations Complex data rgba images (alpha blend) multi-channel audio streams (down mix)

Viterbi decoder: 5x improvement on entire benchmark

PLDI 2006

Strided Accesses (Interleaved Data)

Page 59: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

60

Mixed data types

short b[N];int a[N];for (i=0; i<N; i++) a[i] = (int) b[i];

Unpack

Page 60: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

61

Multiple Data-Types & Type Conversions

S1:x_int = memref

S2:z_int = x_int + 1

S3:y_char = memref

….

VS1.0: vx0 = memref0

VS1.1: vx1 = memref1

VS1.2: vx2 = memref2

VS1.3: vx3 = memref3

VS2.0: vz0 = vx0 + v1

VS2.1: vz1 = vx1 + v1

VS2.2: vz2 = vx2 + v1

VS2.3: vz3 = vx3 + v1

V1 = {1, 1, 1, 1}

VS3: vy = memref

VF = 16

4

4

16

VS3.0: vy0 = vpack (vz0, vz1)

VS3.1: vy1 = vpack (vz2, vz3)

VS3: vy = vpack (vy0, vy1)

(char) z_int

units

“unroll” by VF/units

Page 61: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

62

Very common in multimedia computations Video: unsigned chars shorts Audio: signed shorts ints Filters, autocorrelation, dot product, alpha-blending…

Autocorrelation: 6x improvement on benchmarkfor (i = 0; i < n; i++) {

acc += ((int) short_in1[i] * (int) short_in2[i+lag]) >> Scale;

}

Multiple Data-Types & Type Conversions

Page 62: IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman dorit@il.ibm.com Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

IBM Labs in Haifa

63


Recommended