IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman...

IBM Labs in Haifa

1

GCC Tutorial – The compilation flow of the auto-vectorizer

Dorit Nuzman

[email protected]

Haifa IBM Labs

2nd HiPEAC GCC Tutorial

Ghent, Belgium, January 2007

IBM Labs in Haifa

2

a b c d e f g h i j k l m n o p

OP(a)

OP(b)

OP(c)

OP(d)

Data in Memory:

VOP( a, b, c, d ) VR1

a b c dVR1

VR2

VR3

VR4

VR5

0 1 2 3

What is vectorization

Vector Registers

Vector operation

Data elements packed into vectors Vector length Vectorization Factor (VF)

VF = 4 original serial loop:

for(i=0; i<N; i++){ a[i] = a[i] + b[i];}

loop in vector notation:for (i=0; i<N; i+=VF) { a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1];}

vectorization

IBM Labs in Haifa

3

…mips port

…Ada front-end

middle-end

GIMPLE trees

back-end

RTL

GCC Passes

machine description

Fortran front-endC front-end

C++ front-end

parse trees

rs6000 porti386 port

assembly

loop analyses and optimizations

data-dependence

scalar-evolution

number of iters

invariant motion

iv-canon/optimize

linear transform

unswitching

if-conversion

unrolling

vectorization

- loop form ok?

- any data-deps?

- scalar-cycles?

- aliasing?

- access-patterns?

original serial loop:for(i=0; i<N; i++){ a[i] = a[i] + b[i];}

loop in vector notation:for (i=0; i<N; i+=VF) { a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1];}

- vector size?

- supportable?

- alignment?

- data shuffle?

- cost?

Why study the vectorizer?

- middle-end & back-end aspects

- performance impact potential

- there’s a lot to do…

IBM Labs in Haifa

4

Talk Layout What is vectorization

Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port

Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation

Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases

Using the vectorizer Programming and tuning hints

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees


assembly

RTL

back-end

vectorization

IBM Labs in Haifa

5

A GCC “port”: Target specific files

gcc/gcc/config/<myport>/– for example: i386, ia64, rs6000, spu…

target-specific compiler options: <target>.opt– command-line options of GCC specific to the target– for example: -maltivec, -msse2, -mtune=power4, -minsert-sched-nops=

target-specific definitions: <target>.h– basic parameters and features – for example:

target-specific support functions: <target>.c– target predicates, code generation functions, target variants

machine description: <target>.md– definition of RTL instructions and their translations to assembly– content of machine description determines which features (operations, modes) are available

GCC Backend – machine-description files and operation tables

#define POINTER_SIZE (TARGET_32BIT ? 32 : 64)#define BYTES_BIG_ENDIAN 1#define FIXED_REGISTERS \

{0, 1, FIXED_R2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, FIXED_R13, 0, 0, \

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ ….#define CALL_USED_REGISTERS \

{1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, FIXED_R13, 0, 0, \ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \...

IBM Labs in Haifa

6

machine-description file

alpha/alpha.md

(define_insn "sminqi3"

[(set (match_operand:QI 0 "register_operand" "=r")

(smin:QI (match_operand:QI 1 "reg_or_0_operand" "%rJ")

(match_operand:QI 2 "reg_or_8bit_operand" "rI")))]

"TARGET_MAX"

"minsb8 %r1,%2,%0"

[(set_attr "type" "mvi")])

(define_insn "sminv8qi3"

[(set (match_operand:V8QI 0 "register_operand" "=r")

(smin:V8QI (match_operand:V8QI 1 "reg_or_0_operand" "rW")

(match_operand:V8QI 2 "reg_or_0_operand" "rW")))]

"TARGET_MAX"

"minsb8 %r1,%r2,%0"


RTL operations: rtl.defDEF_RTL_EXPR(SMIN, "smin", "ee", RTX_COMM_ARITH)

DEF_RTL_EXPR(SMAX, "smax", "ee", RTX_COMM_ARITH)

DEF_RTL_EXPR(UMIN, "umin", "ee", RTX_COMM_ARITH)

DEF_RTL_EXPR(UMAX, "umax", "ee", RTX_COMM_ARITH)

gcc/gcc:rtl.def

gcc/gcc/config/<port>:<target>.opt<target>.h<target>.c<target>.md

http://gcc.gnu.org/onlinedocs/gccint/

IBM Labs in Haifa

7

alpha/alpha.md

(define_insn "sminqi3"

[(set (match_operand:QI 0 "register_operand" "=r")

(smin:QI (match_operand:QI 1 "reg_or_0_operand" "%rJ")

(match_operand:QI 2 "reg_or_8bit_operand" "rI")))]

"TARGET_MAX"

"minsb8 %r1,%2,%0"


(define_insn "sminv8qi3"

[(set (match_operand:V8QI 0 "register_operand" "=r")

(smin:V8QI (match_operand:V8QI 1 "reg_or_0_operand" "rW")

(match_operand:V8QI 2 "reg_or_0_operand" "rW")))]

"TARGET_MAX"

"minsb8 %r1,%r2,%0"


machine-description fileRTL operations: rtl.defDEF_RTL_EXPR(SMIN, "smin", "ee", RTX_COMM_ARITH)

DEF_RTL_EXPR(SMAX, "smax", "ee", RTX_COMM_ARITH)

DEF_RTL_EXPR(UMIN, "umin", "ee", RTX_COMM_ARITH)

DEF_RTL_EXPR(UMAX, "umax", "ee", RTX_COMM_ARITH)

- machine-modes:qi, hi, si, di, sf, df

- vector machine-modes:alpha: v8qi, v4hi

altivec: v16qi, v8hi, v4si

- constraints

- conditions

- attributes

- assembly

- scalar and vector operations differ only in operand modes

IBM Labs in Haifa

8

rs6000/rs6000.md

(define_expand "sminsi3"

[(set (match_dup 3)

(if_then_else:SI (gt:SI (match_operand:SI 1 "gpc_reg_operand" "")

(match_operand:SI 2 "reg_or_short_operand" ""))

(const_int 0)

(minus:SI (match_dup 2) (match_dup 1))))

(set (match_operand:SI 0 "gpc_reg_operand" "")

(minus:SI (match_dup 2) (match_dup 3)))]

"TARGET_POWER || TARGET_ISEL"

"{

if (TARGET_ISEL) {

operands[2] = force_reg (SImode, operands[2]);

rs6000_emit_minmax (operands[0], SMIN, operands[1], operands[2]);

DONE;

}

operands[3] = gen_reg_rtx (SImode);

}")

RTL operations: rtl.defDEF_RTL_EXPR(IF_THEN_ELSE, "if_then_else", "eee", RTX_TERNARY)

DEF_RTL_EXPR(GT, "gt", "ee", RTX_COMPARE)

DEF_RTL_EXPR(MINUS, "minus", "ee", RTX_BIN_ARITH)

rs6000/rs6000.c

IBM Labs in Haifa

9

;; Vec int modes(define_mode_macro VI [V4SI V8HI V16QI])

(define_insn "smin<mode>3" [(set (match_operand:VI 0 "register_operand" "=v") (smin:VI (match_operand:VI 1 "register_operand" "v") (match_operand:VI 2 "register_operand" "v")))] "TARGET_ALTIVEC" "vmins<VI_char> %0,%1,%2" [(set_attr "type" "vecsimple")])

rs6000/altivec.md

(define_insn "sminv4sf3" [(set (match_operand:V4SF 0 "register_operand" "=v") (smin:V4SF (match_operand:V4SF 1 "register_operand" "v") (match_operand:V4SF 2 "register_operand" "v")))] "TARGET_ALTIVEC" "vminfp %0,%1,%2" [(set_attr "type" "veccmp")])

When the same pattern applies to multiple modes:

use mode macros to generate an entire family of patterns

IBM Labs in Haifa

10

optabs.c,h

optab/typeqihisiv4siv2si…

smin_optab700701CODE_FOR_nothing

753CODE_FOR_nothing

…

umin_optab702703CODE_FOR_nothing

754CODE_FOR_nothing

…

build/gcc/insn-emit.crtx

gen_sminv4si3 (rtx operand0 ATTRIBUTE_UNUSED,

rtx operand1 ATTRIBUTE_UNUSED,

rtx operand2 ATTRIBUTE_UNUSED) {

return gen_rtx_SET (VOIDmode,

operand0,

gen_rtx_SMIN (V4SImode, operand1, operand2));

}

build/gcc/insn-output.c { "sminv4si3",

{

"vminsw %0,%1,%2", 0, 0 },

(insn_gen_fn) gen_sminv4si3,

&operand_data[1427],

3, 0, 1, 1 }

- tables of RTL operations sharing common semantics, butdiffering in operand size and/or structure

- no type information available anymore


IBM Labs in Haifa

11

optabs.c,h

optab/typeqihisiv4siv2si…


753CODE_FOR_nothing

…


754CODE_FOR_nothing

…

build/gcc/insn-emit.crtx

gen_sminv4si3 (rtx operand0 ATTRIBUTE_UNUSED,

rtx operand1 ATTRIBUTE_UNUSED,

rtx operand2 ATTRIBUTE_UNUSED) {

return gen_rtx_SET (VOIDmode,

operand0,

gen_rtx_SMIN (V4SImode, operand1, operand2));

}

build/gcc/insn-output.c { "sminv4si3",

{

"vminsw %0,%1,%2", 0, 0 },

(insn_gen_fn) gen_sminv4si3,

&operand_data[1427],

3, 0, 1, 1 }

- tables of RTL operations sharing common semantics, butdiffering in operand size and/or structure

- no type information available anymore


gcc/gcc:rtl.def


gcc/gcc:rtl.def


optabqihisiv8qiv4hiv2si

smin

umin

IBM Labs in Haifa

12

min_27 = MIN_EXPR <tmp_26, min_50>;

optab = optab_for_tree_code (code, vectype);

vec_mode = TYPE_MODE (vectype);

icode = (int) optab->handlers[(int) vec_mode].insn_code;

if (icode == CODE_FOR_nothing)

{

if (vect_print_dump_info (REPORT_DETAILS))

fprintf (vect_dump, "operation not supported by target.");

return false;

}

optab/typeqihisiv8qiv4hiv2si


752753CODE_FOR_nothing


754755CODE_FOR_nothing

Querying the backend for target support in the vectorizer

vector int

v2si

smin_optab

IBM Labs in Haifa

13

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees


assembly

RTL

back-end

vectorization






IBM Labs in Haifa

14

Enabling vectorization for a new port

<target.md> - distinction between scalar and vector ops: operand modes- availability of vector ops: deduced from MD file

<target>.h- specify supported vector length in bytes: #define UNITS_PER_SIMD_WORD 16

<target>-modes.def - specify supported vector modes:

/* Vector modes. */VECTOR_MODES (INT, 8); /* V8QI V4HI V2SI */VECTOR_MODES (INT, 16); /* V16QI V8HI V4SI V2DI */VECTOR_MODE (INT, DI, 1);VECTOR_MODES (FLOAT, 8); /* V4HF V2SF */VECTOR_MODES (FLOAT, 16); /* V8HF V4SF V2DF */

Basic features:

IBM Labs in Haifa

15


Special idioms: generic vector operations:

look over list of idioms in optabs.h

specialized vector operations:look over target.h

Advanced features:

#define reduc_smax_optab (optab_table[OTI_reduc_smax])#define reduc_umax_optab (optab_table[OTI_reduc_umax])#define reduc_smin_optab (optab_table[OTI_reduc_smin])#define reduc_umin_optab (optab_table[OTI_reduc_umin])#define reduc_splus_optab (optab_table[OTI_reduc_splus])#define reduc_uplus_optab (optab_table[OTI_reduc_uplus])

#define ssum_widen_optab (optab_table[OTI_ssum_widen])#define usum_widen_optab (optab_table[OTI_usum_widen])#define sdot_prod_optab (optab_table[OTI_sdot_prod])#define udot_prod_optab (optab_table[OTI_udot_prod])

#define vec_set_optab (optab_table[OTI_vec_set])#define vec_extract_optab (optab_table[OTI_vec_extract])#define vec_extract_even_optab (optab_table[OTI_vec_extract_even])#define vec_extract_odd_optab (optab_table[OTI_vec_extract_odd])#define vec_interleave_high_optab (optab_table[OTI_vec_interleave_high])#define vec_interleave_low_optab (optab_table[OTI_vec_interleave_low])#define vec_init_optab (optab_table[OTI_vec_init])#define vec_shl_optab (optab_table[OTI_vec_shl])#define vec_shr_optab (optab_table[OTI_vec_shr])#define vec_realign_load_optab (optab_table[OTI_vec_realign_load])#define vec_widen_umult_hi_optab (optab_table[OTI_vec_widen_umult_hi])#define vec_widen_umult_lo_optab (optab_table[OTI_vec_widen_umult_lo])#define vec_widen_smult_hi_optab (optab_table[OTI_vec_widen_smult_hi])#define vec_widen_smult_lo_optab (optab_table[OTI_vec_widen_smult_lo])#define vec_unpacks_hi_optab (optab_table[OTI_vec_unpacks_hi])#define vec_unpacku_hi_optab (optab_table[OTI_vec_unpacku_hi])#define vec_unpacks_lo_optab (optab_table[OTI_vec_unpacks_lo])#define vec_unpacku_lo_optab (optab_table[OTI_vec_unpacku_lo])#define vec_pack_mod_optab (optab_table[OTI_vec_pack_mod])#define vec_pack_ssat_optab (optab_table[OTI_vec_pack_ssat])#define vec_pack_usat_optab (optab_table[OTI_vec_pack_usat])

/* Functions relating to vectorization. */

struct vectorize

{

tree (* builtin_mask_for_load) (void);

tree (* builtin_vectorized_function)

(unsigned, tree);

tree (* builtin_mul_widen_even) (tree);

tree (* builtin_mul_widen_odd) (tree);

} vectorize;

IBM Labs in Haifa

16


Special idioms: generic vector operations:

look over list of idioms in optabs.h

specialized vector operations:look over target.h

Advanced features:

#define reduc_smax_optab (optab_table[OTI_reduc_smax])#define reduc_umax_optab (optab_table[OTI_reduc_umax])#define reduc_smin_optab (optab_table[OTI_reduc_smin])#define reduc_umin_optab (optab_table[OTI_reduc_umin])#define reduc_splus_optab (optab_table[OTI_reduc_splus])#define reduc_uplus_optab (optab_table[OTI_reduc_uplus])

#define ssum_widen_optab (optab_table[OTI_ssum_widen])#define usum_widen_optab (optab_table[OTI_usum_widen])#define sdot_prod_optab (optab_table[OTI_sdot_prod])#define udot_prod_optab (optab_table[OTI_udot_prod])

#define vec_set_optab (optab_table[OTI_vec_set])#define vec_extract_optab (optab_table[OTI_vec_extract])#define vec_extract_even_optab (optab_table[OTI_vec_extract_even])#define vec_extract_odd_optab (optab_table[OTI_vec_extract_odd])#define vec_interleave_high_optab (optab_table[OTI_vec_interleave_high])#define vec_interleave_low_optab (optab_table[OTI_vec_interleave_low])#define vec_init_optab (optab_table[OTI_vec_init])#define vec_shl_optab (optab_table[OTI_vec_shl])#define vec_shr_optab (optab_table[OTI_vec_shr])#define vec_realign_load_optab (optab_table[OTI_vec_realign_load])#define vec_widen_umult_hi_optab (optab_table[OTI_vec_widen_umult_hi])#define vec_widen_umult_lo_optab (optab_table[OTI_vec_widen_umult_lo])#define vec_widen_smult_hi_optab (optab_table[OTI_vec_widen_smult_hi])#define vec_widen_smult_lo_optab (optab_table[OTI_vec_widen_smult_lo])#define vec_unpacks_hi_optab (optab_table[OTI_vec_unpacks_hi])#define vec_unpacku_hi_optab (optab_table[OTI_vec_unpacku_hi])#define vec_unpacks_lo_optab (optab_table[OTI_vec_unpacks_lo])#define vec_unpacku_lo_optab (optab_table[OTI_vec_unpacku_lo])#define vec_pack_mod_optab (optab_table[OTI_vec_pack_mod])#define vec_pack_ssat_optab (optab_table[OTI_vec_pack_ssat])#define vec_pack_usat_optab (optab_table[OTI_vec_pack_usat])

/* Functions relating to vectorization. */

struct vectorize

{

tree (* builtin_mask_for_load) (void);

tree (* builtin_vectorized_function)

(unsigned, tree);

tree (* builtin_mul_widen_even) (tree);

tree (* builtin_mul_widen_odd) (tree);

} vectorize;

gcc/gcc:rtl.deftarget.hoptabs.h


IBM Labs in Haifa

17

testcases are in gcc/gcc/testsuite/gcc.dg/vect

additional target-specific testcases testsuite/gcc.target/i386/vect1.c

vect.exp: add logic to decide whether to compile/run and with which target-specific options

Add where relevant in:testsuite/lib/target-supports.exp:


if [istarget "powerpc*-*-*"] {

…

}

} elseif { [istarget "spu-*-*"] } {

set dg-do-what-default run

} elseif { [istarget "i?86-*-*"] || [istarget "x86_64-*-*"] } {

lappend DEFAULT_VECTCFLAGS "-msse2"


} elseif { [istarget "mipsisa64*-*-*"]

&& [check_effective_target_mpaired_single] } {

lappend DEFAULT_VECTCFLAGS "-mpaired-single"


} elseif [istarget "sparc*-*-*"] {

…

} elseif [istarget "alpha*-*-*"] {

lappend DEFAULT_VECTCFLAGS "-mmax"

if [check_alpha_max_hw_available] {


} else {

set dg-do-what-default compile

}

} elseif [istarget "ia64-*-*"] {


} else {

return

Enable the vectorizer testcases

IBM Labs in Haifa

18

testcases are in gcc/gcc/testsuite/gcc.dg/vect

additional target-specific testcases testsuite/gcc.target/i386/vect1.c

vect.exp: add logic to decide whether to compile/run and with which target-specific options

Add where relevant in:testsuite/lib/target-supports.exp:

Enabling vectorization for a new portEnable the vectorizer testcases

proc check_effective_target_vect_int

check_effective_target_vect_shift

check_effective_target_vect_long

proc check_effective_target_vect_float

proc check_effective_target_vect_double { } {

global et_vect_double_saved

if [info exists et_vect_double_saved] {

verbose "using cached result" 2

} else {

set et_vect_double_saved 0

if { [istarget i?86-*-*]

|| [istarget x86_64-*-*]

|| [istarget spu-*-*] } {

set et_vect_double_saved 1

}

}

return $et_vect_double_saved

}

check_effective_target_vect_no_int_max

check_effective_target_vect_no_int_add

check_effective_target_vect_sdot_hi

check_effective_target_vect_udot_hi

check_effective_target_vect_sdot_si

check_effective_target_vect_udot_si

….

IBM Labs in Haifa

19

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees


assembly

RTL

back-end

vectorization






IBM Labs in Haifa

20

A tree-level pass

New C file in gcc/gcc: tree-vectorizer.c tree-vect-analyze.c tree-vect-trasnform.c tree-vect-patterns.c tree-vectorizer.h

tree-flow.h – prototype for pass functionunsigned vectorize_loops (void);

gcc/Makefile.in entries

The pass is invoked for each function

unsigned vectorize_loops (void)

{

unsigned int i;

unsigned int num_vectorized_loops = 0;

unsigned int vect_loops_num;

loop_iterator li;

struct loop *loop;

…

vect_loops_num = number_of_loops ();

FOR_EACH_LOOP (li, loop, LI_ONLY_OLD)

{

loop_vec_info loop_vinfo;

vect_loop_location = find_loop_location (loop);

loop_vinfo = vect_analyze_loop (loop);

loop->aux = loop_vinfo;

if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))

continue;

vect_transform_loop (loop_vinfo);

num_vectorized_loops++;

}

if (vect_print_dump_info (REPORT_VECTORIZED_LOOPS))

fprintf (vect_dump, "vectorized %u loops in function.\n",

num_vectorized_loops);

…

}

IBM Labs in Haifa

21

A tree-level pass

… NEXT_PASS (pass_split_crit_edges); NEXT_PASS (pass_pre); NEXT_PASS (pass_may_alias); NEXT_PASS (pass_sink_code); NEXT_PASS (pass_tree_loop); NEXT_PASS (pass_cse_reciprocals); NEXT_PASS (pass_reassoc); NEXT_PASS (pass_vrp); NEXT_PASS (pass_dominator); p = &pass_tree_loop.sub; NEXT_PASS (pass_tree_loop_init); NEXT_PASS (pass_copy_prop); NEXT_PASS (pass_lim); NEXT_PASS (pass_tree_unswitch); NEXT_PASS (pass_scev_cprop); NEXT_PASS (pass_empty_loop); NEXT_PASS (pass_record_bounds); NEXT_PASS (pass_linear_transform); NEXT_PASS (pass_iv_canon); NEXT_PASS (pass_if_conversion); NEXT_PASS (pass_vectorize); NEXT_PASS (pass_complete_unroll); NEXT_PASS (pass_loop_prefetch); NEXT_PASS (pass_iv_optimize); NEXT_PASS (pass_tree_loop_done); *p = NULL;

p = &pass_vectorize.sub; NEXT_PASS (pass_lower_vector_ssa); NEXT_PASS (pass_dce_loop); *p = NULL;

add the pass to the pass hierarchy in passes.c

in tree-pass.h – prototype for pass structureextern struct tree_opt_pass pass_vectorize;

pass-structure definitionin tree-ssa-loop.c

IBM Labs in Haifa

22

A tree-level pass

• pass structure definition:struct tree_opt_pass pass_vectorize ={ "vect", /* name */ gate_tree_vectorize, /* gate */ tree_vectorize, /* execute */ NULL, /* sub */ NULL, /* next */ 0, /* static_pass_number */ TV_TREE_VECTORIZATION, /* tv_id */ PROP_cfg | PROP_ssa, /* properties_required */ 0, /* properties_provided */ 0, /* properties_destroyed */ TODO_verify_loops, /* todo_flags_start */ TODO_dump_func

| TODO_update_ssa, /* todo_flags_finish */ 0 /* letter */};

• timevar.def: variable used for timing and for identification in timing reports:DEFTIMEVAR (TV_TREE_VECTORIZATION , "tree vectorization")

• static boolgate_tree_vectorize (void){ return flag_tree_vectorize

&& current_loops;}

• static unsigned inttree_vectorize (void){ return vectorize_loops ();}

• common.optAdd command line option

ftree-vectorize

Common Report Var(flag_tree_vectorize)

Enable loop vectorization on trees

IBM Labs in Haifa

23

A tree-level pass

invoke.texi:Document the pass for the GCC manual:

@item -ftree-vectorizePerform loop vectorization on trees.

@item vect@opindex fdump-tree-vectDump each function after applying vectorization of loops. The file name ismade by appending @file{.vect} to the source file name.

gcc –O2 –ftree-vectorize example.c gcc –O2 –ftree-vectorize –maltivec example.c gcc –O2 –ftree-vectorize –msse2 example.c gcc –O2 –ftree-vectorize –maltivec –fdump-tree-vect example.c gcc –O2 –ftree-vectorize –maltivec –fdump-tree-vect-details example.c gcc –O2 –ftree-vectorize –maltivec –ftree-vectorizer-verbose=2 example.c gcc –O2 –ftree-vectorize –maltivec –ftree-vectorizer-verbose=7 –fdump-tree-vect

example.c



1. [tree-vect*.c]

2. tree-flow.h

3. Makefile.in

4. [tree-ssa-loop.c]

5. timevar.def

6. common.opt

7. Invoke.texi

IBM Labs in Haifa

24

Example: vectorizer dump reports

int main1 (short *in, int off, short scale, int n)

{

int i, sum = 0;

for (i = 0; i < n; i++) {

sum += ((int) in[i] * (int) in[i+off]) >> scale;

}

return sum;

}

autocorrelation

Speedups:- powerpc970 – 5-6x- Cell SPU – 4-5x

[dorit@mac-ira vect]$ gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=5 vect-widen-mult-sum.c

vect-widen-mult-sum.c:16: note: Vectorizing an unaligned access.

vect-widen-mult-sum.c:16: note: Vectorizing an unaligned access.

vect-widen-mult-sum.c:16: note: LOOP VECTORIZED.

vect-widen-mult-sum.c:12: note: vectorized 1 loops in function.

IBM Labs in Haifa

25

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees


assembly

RTL

back-end

vectorization






IBM Labs in Haifa

26

Auto-vectorization Skeleton

vect_analyze_loop (loop) { if (!1_analyze_loop_form (loop)) FAIL if (!2_analyze_data_refs (loop)) FAIL if (!3_analyze_scalar_dependence_cycles (loop)) FAIL if (!4_pattern_recog (loop)) FAIL if (!5_analyze_data_alignment (loop)) FAIL if (!6_determine_VF (loop)) FAIL if (!7_analyze_data_dependence_distances (loop)) FAIL if (!8_analyze_memory_access_patterns (loop)) FAIL if (!9_analyze_all_operations_supported (loop)) FAIL

SUCCEED}

if SUCCEED:vect_transform_loop (loop) { FOR_ALL_STMTS_IN_LOOP(loop, stmt)

replace_OP_by_VOP (stmt); decrease_loop_bound_by_factor_VF (loop);}

tree-vect-analyze.c

tree-vect-transform.c

IBM Labs in Haifa

27

Auto-Vectorization Transformation

original serial loop:for(i=0; i<N; i++){ a[i] = a[i] + b[i];}

loop in vector notation:for (i=0; i<N; i+=VF){ a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1];}

loop in vector notation:for (i=0; i<(N-N%VF); i+=VF){ a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1];}

for ( ; i < N; i++) { a[i] = a[i] + b[i];}

vectorization

Modify loop bound - strip-mine - create epilog loop

Replace scalar statements with vector statements

vectorized loop

epilog loop

IBM Labs in Haifa

28

Vectorization on SSA-ed GIMPLE trees

float T.1, T.2, T.3;

loop:

if ( i < 16 ) break;

S1: T.1 = a[i ];

S2: T.2 = b[i ];

S3: T.3 = T.1 * T.2;

S4: a[i] = T.3;

S5: i = i + 1;

goto loop;

loop: if (i < 16) break; T.11 = a[i ]; T.12 = a[i+1]; T.13 = a[i+2]; T.14 = a[i+3]; T.21 = b[i ]; T.22 = b[i+1]; T.23 = b[i+2]; T.24 = b[i+3]; T.31 = T.11 * T.21; T.32 = T.12 * T.22; T.33 = T.13 * T.23; T.34 = T.14 * T.24; a[i] = T.31; a[i+1] = T.32; a[i+2] = T.33; a[i+3] = T.34; i = i + 4; goto loop;

VF = 4 “unroll by VF and replace”

int i;

float a[N], b[N];

for (i=0; i < 16; i++)

a[i] = a[i ] * b[i ];

v4sf VT.1, VT.2, VT.3;

v4sf *VPa = (v4sf *)a, *VPb = (v4sf *)b;

int indx;

loop:

if ( indx < 4 ) break;

VT.1 = VPa[indx ];

VT.2 = VPb[indx ];

VT.3 = VT.1 * VT.2;

VPa[indx] = VT.3;

indx = indx + 1;

goto loop;

IBM Labs in Haifa

29

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees


assembly

RTL

back-end

vectorization






IBM Labs in Haifa

30

Vectorizer analyses and transformation: Reduction

s = 0;

for (i=0; i<N; i++) {

s += a[i] * b[i];

}

loop:

s_1 = phi (0, s_2)

i_1 = phi (0, i_2)

xa_1 = a[i_1]

xb_1 = b[i_1]

tmp_1 = xa * xb

s_2 = s_1 + tmp_1

i_2 = i_1 + 1

if (i_2 < N) goto loop

cross iteration dependences

reduction

induction

Analysis

Detect scalar dependece cycles

Identify scalar cycles that are reduction/induction

0 1 2 3 4 5 6 7 8 9 10 11

0 0 0 00 1 2 3

tmp_1

4 6 8 1012 15 18 21

IBM Labs in Haifa

31

static void

vect_analyze_scalar_cycles (loop_vec_info loop_vinfo)

{

tree phi;

struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);

basic_block bb = loop->header;


fprintf (vect_dump, "=== vect_analyze_scalar_cycles ===");

for (phi = phi_nodes (bb); phi; phi = PHI_CHAIN (phi))

{

stmt_vec_info stmt_vinfo = vinfo_for_stmt (phi);

tree def = PHI_RESULT (phi);

if (!is_gimple_reg (SSA_NAME_VAR (def)))

continue;

STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_unknown_def_type;

tree access_fn = analyze_scalar_evolution (loop, def);

if (!access_fn)

continue;

if (vect_is_simple_iv_evolution (loop->num, access_fn)

{

STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_induction_def;

continue;

}

tree rstmt = vect_is_simple_reduction (loop, phi);

if (rstmt)

{

STMT_VINFO_DEF_TYPE (stmt_vinfo) =

STMT_VINFO_DEF_TYPE (vinfo_for_stmt (rstmt)) =

vect_reduction_def;

}

else


fprintf (vect_dump, "Unknown def-use cycle pattern.");

} /* End for loop */

return;

}

s_1 = phi (0, s_2)

i_1 = phi (0, i_2)

xa_1 = a[i_1]

xb_1 = b[i_1]

tmp_1 = xa * xb

s_2 = s_1 + tmp_1

i_2 = i_1 + 1

unknownreduc

tree-vect-analyze.c

IBM Labs in Haifa

32

edge latch_e = loop_latch_edge (loop); tree loop_arg = PHI_ARG_DEF_FROM_EDGE (phi, latch_e); tree def_stmt = SSA_NAME_DEF_STMT (loop_arg); tree operation = GIMPLE_STMT_OPERAND (def_stmt, 1); enum tree_code code = TREE_CODE (operation);… if (!commutative_tree_code (code) || !associative_tree_code (code)) { if (vect_print_dump_info (REPORT_DETAILS)) { fprintf (vect_dump, "reduction: not commutative/associative: "); print_generic_expr (vect_dump, operation, TDF_SLIM); } return NULL_TREE; } if (SCALAR_FLOAT_TYPE_P (type) && !flag_unsafe_math_optimizations) { if (vect_print_dump_info (REPORT_DETAILS)) { fprintf (vect_dump, "reduction: unsafe fp math optimization: "); print_generic_expr (vect_dump, operation, TDF_SLIM); } return NULL_TREE; }…

s_1 = phi (0, s_2)

i_1 = phi (0, i_2)

xa_1 = a[i_1]

xb_1 = b[i_1]

tmp_1 = xa * xb

s_2 = s_1 + tmp_1

i_2 = i_1 + 1

Snippet from vect_is_simple_reduction:

tree-vectorizer.c

IBM Labs in Haifa

33


loop:

s_1 = phi (0, s_2)

i_1 = phi (0, i_1)

xa_1 = a[i_1]

xb_1 = b[i_1]

tmp_1 = xa * xb

s_2 = s_1 + tmp_1

i_2 = i_1 + 1

if (i_2 < N) goto loop

Transformation

loop:

vs_1 = phi (vs_0, vs_2)

i_1 = phi (0, i_1)

vxa_1 = vpa[i_1]

vxb_1 = vpb[i_1]

vtmp_1 = vxa * vxb

vs_2 = vs_1 + vtmp_1

i_2 = i_1 + 1

if (i_2 < N/VF) goto loop

vec_dest = vect_create_destination_var (scalar_dest, vectype);

expr = build2 (code, vectype, loop_vec_def0, reduc_def);

new_stmt = build2 (GIMPLE_MODIFY_STMT, void_type_node, vec_dest, expr);

new_temp = make_ssa_name (vec_dest, new_stmt);

GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;

bsi_insert_before (bsi, vec_stmt, BSI_SAME_STMT);

tree-vect-transform.c

IBM Labs in Haifa

34

0 1 2 3


s = 0;

for (i=0; i<N; i++) {

s += a[i] * b[i];

}

printf (“sum = %f\n”, s);

Transformation

28

0 1 2 3+

4 5 6 7+

0 0 0 0s1,s2,s3,s4

loop:

vs_1 = phi (vs_0, vs_2)

i_1 = phi (0, i_2)

vxa_1 = vpa[i_1]

vxb_1 = vpb[i_1]

vtmp_1 = vxa * vxb

vs_2 = vs_1 + vtmp_1

i_2 = i_1 + 1

if (i_2 < N/VF) goto loop

4 6 8 10

8 10

+

12 16+

28

16

scalar epilog

whole vector shifts

sum across

vs_0

vtmp_1

vs_2

vtmp_1

s

IBM Labs in Haifa

35

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees


assembly

RTL

back-end

vectorization






IBM Labs in Haifa

36

Adding new idioms

tree.def: define the tree-code:

/* Reduction operations. Operations that take a vector of elements and "reduce" it to a scalar result (e.g. summing the elements of the vector, finding the minimum over the vector elements, etc). Operand 0 is a vector; the first element in the vector has the result. Operand 1 is a vector. */

DEFTREECODE (REDUC_PLUS_EXPR, "reduc_plus_expr", tcc_unary, 1)

tree-pretty-print.cdump_generic_node, op_prio, op_symbol

tree-inline.c: estimate_num_insns_1 ()

IBM Labs in Haifa

37

Adding new idioms

optabs.h: add a new operator table (optab) index to enum optab_index

/* Reduction operations on a vector operand. */ OTI_reduc_splus, OTI_reduc_uplus,

optabs.h: define matching shortcuts

#define reduc_splus_optab (optab_table[OTI_reduc_splus])#define reduc_uplus_optab (optab_table[OTI_reduc_uplus])

IBM Labs in Haifa

38

Adding new idioms

optabs.c: add selection of appropriate optab in the dispatch function optab_for_tree_code():

case REDUC_PLUS_EXPR: return TYPE_UNSIGNED (type) ? reduc_uplus_optab : reduc_splus_optab;

optabs.c: initialize the new optabs in init_optabs()

reduc_splus_optab = init_optab (UNKNOWN); reduc_uplus_optab = init_optab (UNKNOWN);

IBM Labs in Haifa

39

Adding new idioms

genopinit.c: fill in the optabs:

"reduc_splus_optab->handlers[$A].insn_code = CODE_FOR_$(reduc_splus_$a$)" ,

"reduc_uplus_optab->handlers[$A].insn_code = CODE_FOR_$(reduc_uplus_$a$)",

optab/typeqihisiv8qiv4hiv2si

reduc_splus_optabCODE_FOR_nothing

CODE_FOR_nothing

CODE_FOR_nothing

reduc_uplus_optab

CODE_FOR_nothing

CODE_FOR_nothing

CODE_FOR_nothing



1. tree.def

2. tree-pretty-print.c

3. tree-inline.c

4. optabs.h

5. optabs.c

6. genopinit.c

7. expr.c

8. <target>.md

IBM Labs in Haifa

40

Adding new idioms expr.c: tree-to-rtl expansion: case REDUC_PLUS_EXPR: { op0 = expand_normal (TREE_OPERAND (exp, 0)); this_optab = optab_for_tree_code (code, type); temp = expand_unop (mode, this_optab, op0, target, unsignedp); gcc_assert (temp); return temp; }

<target>.md: RTL instruction definition:(define_expand "reduc_splus_<mode>" [(set (match_operand:VIshort 0 "register_operand" "=v") (unspec:VIshort [(match_operand:VIshort 1 "register_operand" "v")]

UNSPEC_REDUC_PLUS))] "TARGET_ALTIVEC" "{rtx vzero = gen_reg_rtx (V4SImode); rtx vtmp1 = gen_reg_rtx (V4SImode); emit_insn (gen_altivec_vspltisw (vzero, const0_rtx)); emit_insn (gen_altivec_vsum4s<VI_char>s (vtmp1, operands[1], vzero)); emit_insn (gen_altivec_vsumsws_nomode (operands[0], vtmp1, vzero)); DONE;}")

1. tree.def

2. tree-pretty-print.c

3. tree-inline.c

4. optabs.h

5. optabs.c

6. genopinit.c

7. expr.c

8. <target>.md

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees


assembly

RTL

back-end

vectorization

expand

IBM Labs in Haifa

41

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees


assembly

RTL

back-end

vectorization






IBM Labs in Haifa

42

vect-reduc-min.c#define N 16

int main1 ()

{

int i;

float c[N] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};

float min = 10;

for (i = 0; i < N; i++) {

min = min > c[i] ? c[i] : min;

}

/* check results: */

if (min != 0)

abort ();

return 0;

}

gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=4 vect-reduc-min.c

vect-reduc-min.c:14: note: not vectorized: unsupported use in stmt.vect-reduc-min.c:9: note: vectorized 0 loops in function.

gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=7 vect-reduc-min.c

…vect-reduc-min.c:14: note: === vect_analyze_scalar_cycles ===vect-reduc-min.c:14: note: Analyze phi: min_6 = PHI <min_3(6), 1.0e+1(2)>vect-reduc-min.c:14: note: reduction: not commutative/associative:

min_6 > min_7 ? min_7 : min_6

vect-reduc-min.c:14: note: Unknown def-use cycle pattern…vect-reduc-min.c:14: note: Unsupported pattern.vect-reduc-min.c:14: note: not vectorized: unsupported use in stmt.vect-reduc-min.c:14: note: unexpected pattern.vect-reduc-min.c:9: note: vectorized 0 loops in function.

gcc -O2 -ftree-vectorize -maltivec vect-reduc-min.c -ftree-vectorizer-verbose=4 -ffast-math

vect-reduc-min.c:14: note: LOOP VECTORIZED.vect-reduc-min.c:9: note: vectorized 1 loops in function.

Compilation Flow Example

IBM Labs in Haifa

43

vect-min.c.081t.ifcvt

main1 (){ unsigned int ivtmp.31; int pretmp.25; float min; float c[16]; int i; float D.2429; static float C.3[16] = {…};

<bb 2>: c = C.3;

# ivtmp.31_2 = PHI <ivtmp.31_3(4), 16(2)> # min_15 = PHI <min_7(4), 1.0e+1(2)> # i_14 = PHI <i_8(4), 0(2)><L0>:; D.2429_6 = c[i_14]; min_7 = MIN_EXPR <D.2429_6, min_15>; i_8 = i_14 + 1; ivtmp.31_3 = ivtmp.31_2 - 1; if (ivtmp.31_3 != 0) goto <L8>; else goto <L2>;

<L8>:; goto <bb 3> (<L0>);

# min_1 = PHI <min_7(3)><L2>:; if (min_1 != 0.0) goto <L3>; else goto <L4>;

<L3>:; abort ();

<L4>:; return 0;}

vect-min.c.004t.gimple

c = C.3; min = 1.0e+1;

i = 0;

goto <D2425>;

<D2424>:;

i.4 = i;

D.2429 = c[i.4];

min = MIN_EXPR <D.2429, min>;

i = i + 1;

<D2425>:;

if (i <= 15)

{

goto <D2424>;

}

else

{

goto <D2426>;

}

<D2426>:;

if (min != 0.0)

{

abort ();

}

else

{

}

D.2430 = 0;

return D.2430;

-fdump-tree-all -da

IBM Labs in Haifa

44

vect-min.c.082t.vect

<bb 2>:

c = C.3;

vect_pc.32_5 = (__vector float *) &c;

vect_cst_.40_21 = { 1.0e+1, 1.0e+1, 1.0e+1, 1.0e+1 };

# ivtmp.43_28 = PHI <ivtmp.43_29(4), 0(2)>

# vect_var.39_19 = PHI <vect_var.39_20, vect_cst.40_21>

# ivtmp.37_16 = PHI <ivtmp.37_17(4), vect_pc.32_5(2)>

# ivtmp.31_2 = PHI <ivtmp.31_3(4), 16(2)>

# min_15 = PHI <min_7(4), 1.0e+1(2)>

# i_14 = PHI <i_8(4), 0(2)>

<L0>:;

vect_var_.38_18 = *ivtmp.37_16;

D.2429_6 = c[i_14];

vect_var.39_20 = MIN_EXPR <vect_var.38_18, vect_var.39_19>;

min_7 = MIN_EXPR <D.2429_6, min_15>;

i_8 = i_14 + 1;

ivtmp.31_3 = ivtmp.31_2 - 1;

ivtmp.37_17 = ivtmp.37_16 + 16B;

ivtmp.43_29 = ivtmp.43_28 + 1;

if (ivtmp.43_29 < 4) goto <L8>; else goto <L2>;

<L8>:;

goto <bb 3> (<L0>);

Continued:

# vect_var_.39_22 = PHI <vect_var_.39_20(3)>

# min_1 = PHI <min_7(3)>

<L2>:;

vect_var_.42_23 = vect_var_.39_22 v>> 64;

vect_var.42_24 =

MIN_EXPR <vect_var.42_23, vect_var.39_22>;


vect_var_.42_26 =

MIN_EXPR <vect_var_.42_25, vect_var_.42_24>;

vect_var_.41_27 =

BIT_FIELD_REF <vect_var_.42_26, 32, 96>;

if (vect_var_.41_27 != 0.0) goto <L3>; else goto <L4>;

<L3>:;

abort ();

<L4>:;

return 0;

}

IBM Labs in Haifa

45

vect-min.c.095t.dse2

c = C.3;

vect_pc.36_4 = (__vector float *) &c;

vect_var_.38_6 = *vect_pc.36_4;

vect_var_.39_1 = MIN_EXPR <vect_var_.38_6, { 1.0e+1, 1.0e+1, 1.0e+1, 1.0e+1 }>;

ivtmp.37_14 = vect_pc.36_4 + 16B;

vect_var_.38_32 = *ivtmp.37_14;

vect_var_.39_33 = MIN_EXPR <vect_var_.39_1, vect_var_.38_32>;

ivtmp.37_34 = ivtmp.37_14 + 16B;

vect_var_.38_39 = *ivtmp.37_34;


ivtmp.37_41 = ivtmp.37_34 + 16B;

vect_var_.38_18 = *ivtmp.37_41;






vect_var_.41_27 = BIT_FIELD_REF <vect_var_.42_26, 32, 96>;

if (vect_var_.41_27 != 0.0) goto <L3>; else goto <L4>;

<L3>:;

abort ();

<L4>:;

return 0;

}

IBM Labs in Haifa

46

vect-min.c.138r.life2

(insn:HI 26 25 27 2 (set (reg:V4SF 138) (mem/u/c/i:V4SF (reg/f:SI 139) [2 S16 A128])) 632

{altivec_lvx_v4sf} ))

(insn:HI 27 26 28 2 (set (reg:V4SF 141) (mem:V4SF (plus:SI (reg/f:SI 113 sfp) (const_int 16 [0x10])) [2 S16 A128])) 632

{altivec_lvx_v4sf} (nil) (nil))

(insn:HI 28 27 29 2 (set (reg:V4SF 126 [ vect_var_.39 ]) (smin:V4SF (reg:V4SF 138) (reg:V4SF 141))) 706 {sminv4sf3}))

(insn:HI 29 28 30 2 (set (reg/f:SI 127 [ ivtmp.37 ]) (plus:SI (reg/f:SI 134) (const_int 16 [0x10]))) 79 {*addsi3_internal1} (nil) (nil))

(insn:HI 30 29 31 2 (set (reg:V4SF 142) (mem:V4SF (plus:SI (reg/f:SI 134) (const_int 16 [0x10])) [2 S16 A128])) 632

{altivec_lvx_v4sf} (nil) (nil)))

(insn:HI 31 30 32 2 (set (reg:V4SF 121 [ vect_var_.50 ]) (smin:V4SF (reg:V4SF 126 [ vect_var_.39 ]) (reg:V4SF 142))) 706 {sminv4sf3} (nil))))

(insn:HI 33 32 34 2 (set (reg:V4SF 143) (mem:V4SF (plus:SI (reg/f:SI 127 [ ivtmp.37 ]) (const_int 16 [0x10])) [2 S16 A128])) 632

{altivec_lvx_v4sf} (nil)) (nil))

(insn:HI 34 33 35 2 (set (reg:V4SF 119 [ vect_var_.53 ]) (smin:V4SF (reg:V4SF 121 [ vect_var_.50 ]) (reg:V4SF 143))) 706 {sminv4sf3} (nil))))

vect-min.c.153r.sched2

(insn:TI 82 84 89 2 (set (reg:V4SF 77 0 [138])

(mem/u/c/i:V4SF (reg/f:SI 9 9 [139]) [2 S16 A128])) 632 {altivec_lvx_v4sf} (nil) (nil))))

(insn 89 82 83 2 (set (reg:SI 9 9) (plus:SI (reg/f:SI 1 1)

(const_int 16 [0x10]))) 79 {*addsi3_internal1} (nil) (nil))

(insn:TI 83 89 90 2 (set (reg:V4SF 78 1 [141])

(mem:V4SF (reg:SI 9 9) [2 S16 A128])){altivec_lvx_v4sf} ))

(insn 90 83 92 2 (set (reg:SI 9 9)

(plus:SI (reg/f:SI 29 29 [orig:127 ivtmp.37 ] [127])


(insn 92 90 28 2 (set (reg:SI 29 29)

(plus:SI (reg/f:SI 29 29 [orig:127 ivtmp.37 ] [127])


(insn:TI 28 92 33 2 (set (reg:V4SF 77 0[orig:126 vect_var.39] [126]) (smin:V4SF (reg:V4SF 77 0 [138])

(reg:V4SF 78 1 [141]))) 706 {sminv4sf3} (nil) (nil)))

(insn 33 28 35 2 (set (reg:V4SF 78 1 [143])

(mem:V4SF (reg:SI 9 9) [2 S16 A128])){altivec_lvx_v4sf}

(insn 35 33 93 2 (set (reg:V4SF 89 12 [144])

(mem:V4SF (reg:SI 29 29) [2 S16 A128])) {altivec_lvx_v4sf}))

IBM Labs in Haifa

47

vect-min.s

main1: stwu 1,-128(1) lis 4,.LANCHOR0@ha mflr 0 la 4,.LANCHOR0@l(4) li 5,64 stw 29,116(1) stw 0,132(1) addi 29,1,16 mr 3,29 bl memcpy addi 9,29,16 addi 29,29,16 lvx 13,0,9 lis 9,.LC0@ha la 9,.LC0@l(9) lvx 0,0,9 addi 9,1,16 lvx 1,0,9 addi 9,29,16 addi 29,29,32 vminfp 0,0,1 lvx 1,0,9 lvx 12,0,29 addi 9,1,108

vminfp 0,0,13

vminfp 0,0,1

vminfp 0,0,12

vsldoi 13,0,0,8

vminfp 0,0,13

vsldoi 1,0,0,12

vminfp 1,1,0

stvewx 1,0,9

lis 9,.LC1@ha

lfs 13,108(1)

lfs 0,.LC1@l(9)

fcmpu 7,13,0

bne- 7,.L7

lwz 0,132(1)

lwz 29,116(1)

li 3,0

addi 1,1,128

mtlr 0

blr

IBM Labs in Haifa

48

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees


assembly

RTL

back-end

vectorization






IBM Labs in Haifa

49

Using the Vectorizer – Programming Hints Don’t unroll the loop

for (i=0; i<N; i+=4){ a[i] = x; a[i+1] = x; a[i+2] = x; a[i+3] = x;}

Use countable loops, with no side-effects No function-calls in the loop (distribute into a separate loop); No ‘break’/’continue’

Avoid aliasing problems Use __restrict__ qualified pointers

Keep the memory access-pattern simple Don’t use array of structures, e.g.:

struct {int f1; int f2;} a[N]; for (i=0; i<N; i++) a[i].f1 = x;

Use constant increment. i.e., don’t use the following:for (i=0; i<N; i+=incr) a[i] = x;

Alignment Use alignment attributes If have more than a single misaligned store – distribute into a separate loop (currently the

vectorizer peels the loop to align a misaligned store).

for (i=0; i<N; i++)

a[i] = x;

foo (float * __restrict__ p, float * __restrict__ q)

int af1[N], af2[N];

for (i=0; i<N; i++)af1[i] = x;

IBM Labs in Haifa

50

-ffast-math if operating on floats in a reduction

computation (to allow the vectorizer to change the order of the computation)

-fwrapv if operating on signed subword integers (to

avoid casts to int that currently confuse the vectorizer)

--param min-vect-loop-bound=[X] if have loops with a short trip-count

-fno-vect-loop-version if worried about code size

-funroll-loops –fvariable-expansion-in-unroller –param max-variable-expansions-in-unroller=[X] for improved scheduling of summation

(breaking the accumulation into X+1 accumulator to increase ILP).

float *b, *c, diff, min, max;

for (i = 0; i < N; i++) {

diff += (b[i] - c[i]);

}

for (i = 0; i < N; i++) {

max = max < c[i] ? c[i] : max;

}

for (i = 0; i < N; i++) {

min = min > c[i] ? c[i] : min;

}

signed char *b, *c, diff;

for (i = 0; i < N; i++) {

diff += (signed char)(b[i] - c[i]);

}

for (i=0; i<N; i++){

p[i] = q[i];

}

Loop versioning:

if (q is aligned) {

for (i=0; i<N; i++){

x = q[i]; // q is aligned

p[i] = x;

}else {

for (i=0; i<N; i++){

x = q[i]; // q’s alignment unknown

p[i] = x;

}

Using the Vectorizer – Tuning Hints

IBM Labs in Haifa

51

More information

Vectorizer: http://gcc.gnu.org/projects/tree-ssa/vectorization.html http://gcc.gnu.org/wiki/VectorizationTasks Summit papers

- http://www.gccsummit.org/2006/2006-GCC-Summit-Proceedings.pdf- ftp://gcc.gnu.org/pub/gcc/summit/2004/Autovectorization.pdf

General http://gcc.gnu.org/onlinedocs/gccint/ http://gcc.gnu.org/wiki Summit papers

Happy Hacking!

IBM Labs in Haifa

52

The End

IBM Labs in Haifa

53

for (i = 0; i < n; i++) {

sum += ((int) in[i] * (int) in[i+off]) >> scale;

}

IBM Labs in Haifa

54

…mips port

middle-end

GIMPLE trees

machine description

front-end

parse trees


assembly

RTL

back-end

vectorization






IBM Labs in Haifa

55

Non-consecutive access patterns

a b c d e f g h i j k l m n o p

OP(a)

OP(f)

OP(k)

OP(p)

Data in Memory:

VOP( a, f, k, p ) VR5

a b c dVR1

VR2

VR3

VR4

VR5

0 1 2 3

e f g h

i j k l

m n o p

a

f

k

p

a f k p

a f k p

A[i], i={0,5,10,15,…}; access_fn(i) = (0,+,5)

IBM Labs in Haifa

56

Basic unpacking and packing operations for strided access

Use two pairs of inverse operations widely supported on SIMD platforms:

extract_even, extract_odd:

interleave_high, interleave_low:

Use them recursively to support strided accesses with power-of-2 strides Support several data types

IBM Labs in Haifa

57

28

S1: a = x [8*i]

S2: b = x [8*i+1]

S3: c = x [8*i+2]

S4: d = x [8*i+3]

S5: e = x [8*i+4]

S6: f = x [8*i+5]

S7: g = x [8*i+6]

S8: h = x [8*i+7]

S9: y [2*i] = k = f (a,…,h)

S10: y [2*i+1] = l = g (a,…,h)

0 1 2 3 4 5 6 7 8 9 1011 12 131415 16 171819 20212223 24252627 282930 31

a b c d e f y h

0 1 2 3 4 5 6 7

k l

0 1

0 2 4 6 8 1012 14 222016 18 24262830 1 3 5 7 9 111315 23 31

4 80 12

1719 21 252729

16 2420

2480 16

311 5 9 13 17 21 2925 2 6 10 14 22 263018 3 7 1115 2719 23

284 12 201 9 17 25 2 10 2618 273 11 19 5 132129 306 14 22 317 1523

δ=8 VF=4

load δ *VF elements

generate δ *log δ extracts (odd/even)

IBM Labs in Haifa

59

Very common in real world computations Complex data rgba images (alpha blend) multi-channel audio streams (down mix)

Viterbi decoder: 5x improvement on entire benchmark

PLDI 2006

Strided Accesses (Interleaved Data)

IBM Labs in Haifa

60

Mixed data types

short b[N];int a[N];for (i=0; i<N; i++) a[i] = (int) b[i];

Unpack

IBM Labs in Haifa

61

Multiple Data-Types & Type Conversions

S1:x_int = memref

S2:z_int = x_int + 1

S3:y_char = memref

….

VS1.0: vx0 = memref0




VS2.0: vz0 = vx0 + v1

VS2.1: vz1 = vx1 + v1

VS2.2: vz2 = vx2 + v1

VS2.3: vz3 = vx3 + v1

V1 = {1, 1, 1, 1}

VS3: vy = memref

VF = 16

4

4

16

VS3.0: vy0 = vpack (vz0, vz1)

VS3.1: vy1 = vpack (vz2, vz3)

VS3: vy = vpack (vy0, vy1)

(char) z_int

units

“unroll” by VF/units

IBM Labs in Haifa

62

Very common in multimedia computations Video: unsigned chars shorts Audio: signed shorts ints Filters, autocorrelation, dot product, alpha-blending…

Autocorrelation: 6x improvement on benchmarkfor (i = 0; i < n; i++) {

acc += ((int) short_in1[i] * (int) short_in2[i+lag]) >> Scale;

}

Multiple Data-Types & Type Conversions

IBM Labs in Haifa

63

Date post:	01-Apr-2015
Category:	Documents
Upload:	reed-baggs
View:	230 times
Download:	0 times

IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman...

Documents