Date post: | 01-Apr-2015 |
Category: |
Documents |
Upload: | reed-baggs |
View: | 230 times |
Download: | 0 times |
IBM Labs in Haifa
1
GCC Tutorial – The compilation flow of the auto-vectorizer
Dorit Nuzman
Haifa IBM Labs
2nd HiPEAC GCC Tutorial
Ghent, Belgium, January 2007
IBM Labs in Haifa
2
a b c d e f g h i j k l m n o p
OP(a)
OP(b)
OP(c)
OP(d)
Data in Memory:
VOP( a, b, c, d ) VR1
a b c dVR1
VR2
VR3
VR4
VR5
0 1 2 3
What is vectorization
Vector Registers
Vector operation
Data elements packed into vectors Vector length Vectorization Factor (VF)
VF = 4 original serial loop:
for(i=0; i<N; i++){ a[i] = a[i] + b[i];}
loop in vector notation:for (i=0; i<N; i+=VF) { a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1];}
vectorization
IBM Labs in Haifa
3
…mips port
…Ada front-end
middle-end
GIMPLE trees
back-end
RTL
GCC Passes
machine description
Fortran front-endC front-end
C++ front-end
parse trees
rs6000 porti386 port
assembly
loop analyses and optimizations
data-dependence
scalar-evolution
number of iters
invariant motion
iv-canon/optimize
linear transform
unswitching
if-conversion
unrolling
vectorization
- loop form ok?
- any data-deps?
- scalar-cycles?
- aliasing?
- access-patterns?
original serial loop:for(i=0; i<N; i++){ a[i] = a[i] + b[i];}
loop in vector notation:for (i=0; i<N; i+=VF) { a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1];}
- vector size?
- supportable?
- alignment?
- data shuffle?
- cost?
Why study the vectorizer?
- middle-end & back-end aspects
- performance impact potential
- there’s a lot to do…
IBM Labs in Haifa
4
Talk Layout What is vectorization
Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port
Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation
Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases
Using the vectorizer Programming and tuning hints
…mips port
middle-end
GIMPLE trees
machine description
front-end
parse trees
rs6000 porti386 port
assembly
RTL
back-end
vectorization
IBM Labs in Haifa
5
A GCC “port”: Target specific files
gcc/gcc/config/<myport>/– for example: i386, ia64, rs6000, spu…
target-specific compiler options: <target>.opt– command-line options of GCC specific to the target– for example: -maltivec, -msse2, -mtune=power4, -minsert-sched-nops=
target-specific definitions: <target>.h– basic parameters and features – for example:
target-specific support functions: <target>.c– target predicates, code generation functions, target variants
machine description: <target>.md– definition of RTL instructions and their translations to assembly– content of machine description determines which features (operations, modes) are available
GCC Backend – machine-description files and operation tables
#define POINTER_SIZE (TARGET_32BIT ? 32 : 64)#define BYTES_BIG_ENDIAN 1#define FIXED_REGISTERS \
{0, 1, FIXED_R2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, FIXED_R13, 0, 0, \
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ ….#define CALL_USED_REGISTERS \
{1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, FIXED_R13, 0, 0, \ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \...
IBM Labs in Haifa
6
machine-description file
alpha/alpha.md
(define_insn "sminqi3"
[(set (match_operand:QI 0 "register_operand" "=r")
(smin:QI (match_operand:QI 1 "reg_or_0_operand" "%rJ")
(match_operand:QI 2 "reg_or_8bit_operand" "rI")))]
"TARGET_MAX"
"minsb8 %r1,%2,%0"
[(set_attr "type" "mvi")])
(define_insn "sminv8qi3"
[(set (match_operand:V8QI 0 "register_operand" "=r")
(smin:V8QI (match_operand:V8QI 1 "reg_or_0_operand" "rW")
(match_operand:V8QI 2 "reg_or_0_operand" "rW")))]
"TARGET_MAX"
"minsb8 %r1,%r2,%0"
[(set_attr "type" "mvi")])
RTL operations: rtl.defDEF_RTL_EXPR(SMIN, "smin", "ee", RTX_COMM_ARITH)
DEF_RTL_EXPR(SMAX, "smax", "ee", RTX_COMM_ARITH)
DEF_RTL_EXPR(UMIN, "umin", "ee", RTX_COMM_ARITH)
DEF_RTL_EXPR(UMAX, "umax", "ee", RTX_COMM_ARITH)
gcc/gcc:rtl.def
gcc/gcc/config/<port>:<target>.opt<target>.h<target>.c<target>.md
http://gcc.gnu.org/onlinedocs/gccint/
IBM Labs in Haifa
7
alpha/alpha.md
(define_insn "sminqi3"
[(set (match_operand:QI 0 "register_operand" "=r")
(smin:QI (match_operand:QI 1 "reg_or_0_operand" "%rJ")
(match_operand:QI 2 "reg_or_8bit_operand" "rI")))]
"TARGET_MAX"
"minsb8 %r1,%2,%0"
[(set_attr "type" "mvi")])
(define_insn "sminv8qi3"
[(set (match_operand:V8QI 0 "register_operand" "=r")
(smin:V8QI (match_operand:V8QI 1 "reg_or_0_operand" "rW")
(match_operand:V8QI 2 "reg_or_0_operand" "rW")))]
"TARGET_MAX"
"minsb8 %r1,%r2,%0"
[(set_attr "type" "mvi")])
machine-description fileRTL operations: rtl.defDEF_RTL_EXPR(SMIN, "smin", "ee", RTX_COMM_ARITH)
DEF_RTL_EXPR(SMAX, "smax", "ee", RTX_COMM_ARITH)
DEF_RTL_EXPR(UMIN, "umin", "ee", RTX_COMM_ARITH)
DEF_RTL_EXPR(UMAX, "umax", "ee", RTX_COMM_ARITH)
- machine-modes:qi, hi, si, di, sf, df
- vector machine-modes:alpha: v8qi, v4hi
altivec: v16qi, v8hi, v4si
- constraints
- conditions
- attributes
- assembly
- scalar and vector operations differ only in operand modes
IBM Labs in Haifa
8
rs6000/rs6000.md
(define_expand "sminsi3"
[(set (match_dup 3)
(if_then_else:SI (gt:SI (match_operand:SI 1 "gpc_reg_operand" "")
(match_operand:SI 2 "reg_or_short_operand" ""))
(const_int 0)
(minus:SI (match_dup 2) (match_dup 1))))
(set (match_operand:SI 0 "gpc_reg_operand" "")
(minus:SI (match_dup 2) (match_dup 3)))]
"TARGET_POWER || TARGET_ISEL"
"{
if (TARGET_ISEL) {
operands[2] = force_reg (SImode, operands[2]);
rs6000_emit_minmax (operands[0], SMIN, operands[1], operands[2]);
DONE;
}
operands[3] = gen_reg_rtx (SImode);
}")
RTL operations: rtl.defDEF_RTL_EXPR(IF_THEN_ELSE, "if_then_else", "eee", RTX_TERNARY)
DEF_RTL_EXPR(GT, "gt", "ee", RTX_COMPARE)
DEF_RTL_EXPR(MINUS, "minus", "ee", RTX_BIN_ARITH)
rs6000/rs6000.c
IBM Labs in Haifa
9
;; Vec int modes(define_mode_macro VI [V4SI V8HI V16QI])
(define_insn "smin<mode>3" [(set (match_operand:VI 0 "register_operand" "=v") (smin:VI (match_operand:VI 1 "register_operand" "v") (match_operand:VI 2 "register_operand" "v")))] "TARGET_ALTIVEC" "vmins<VI_char> %0,%1,%2" [(set_attr "type" "vecsimple")])
rs6000/altivec.md
(define_insn "sminv4sf3" [(set (match_operand:V4SF 0 "register_operand" "=v") (smin:V4SF (match_operand:V4SF 1 "register_operand" "v") (match_operand:V4SF 2 "register_operand" "v")))] "TARGET_ALTIVEC" "vminfp %0,%1,%2" [(set_attr "type" "veccmp")])
When the same pattern applies to multiple modes:
use mode macros to generate an entire family of patterns
IBM Labs in Haifa
10
optabs.c,h
optab/typeqihisiv4siv2si…
smin_optab700701CODE_FOR_nothing
753CODE_FOR_nothing
…
umin_optab702703CODE_FOR_nothing
754CODE_FOR_nothing
…
build/gcc/insn-emit.crtx
gen_sminv4si3 (rtx operand0 ATTRIBUTE_UNUSED,
rtx operand1 ATTRIBUTE_UNUSED,
rtx operand2 ATTRIBUTE_UNUSED) {
return gen_rtx_SET (VOIDmode,
operand0,
gen_rtx_SMIN (V4SImode, operand1, operand2));
}
build/gcc/insn-output.c { "sminv4si3",
{
"vminsw %0,%1,%2", 0, 0 },
(insn_gen_fn) gen_sminv4si3,
&operand_data[1427],
3, 0, 1, 1 }
- tables of RTL operations sharing common semantics, butdiffering in operand size and/or structure
- no type information available anymore
GCC Backend – machine-description files and operation tables
IBM Labs in Haifa
11
optabs.c,h
optab/typeqihisiv4siv2si…
smin_optab700701CODE_FOR_nothing
753CODE_FOR_nothing
…
umin_optab702703CODE_FOR_nothing
754CODE_FOR_nothing
…
build/gcc/insn-emit.crtx
gen_sminv4si3 (rtx operand0 ATTRIBUTE_UNUSED,
rtx operand1 ATTRIBUTE_UNUSED,
rtx operand2 ATTRIBUTE_UNUSED) {
return gen_rtx_SET (VOIDmode,
operand0,
gen_rtx_SMIN (V4SImode, operand1, operand2));
}
build/gcc/insn-output.c { "sminv4si3",
{
"vminsw %0,%1,%2", 0, 0 },
(insn_gen_fn) gen_sminv4si3,
&operand_data[1427],
3, 0, 1, 1 }
- tables of RTL operations sharing common semantics, butdiffering in operand size and/or structure
- no type information available anymore
GCC Backend – machine-description files and operation tables
gcc/gcc:rtl.def
gcc/gcc/config/<port>:<target>.opt<target>.h<target>.c<target>.md
gcc/gcc:rtl.def
gcc/gcc/config/<port>:<target>.opt<target>.h<target>.c<target>.md
optabqihisiv8qiv4hiv2si
smin
umin
IBM Labs in Haifa
12
min_27 = MIN_EXPR <tmp_26, min_50>;
optab = optab_for_tree_code (code, vectype);
vec_mode = TYPE_MODE (vectype);
icode = (int) optab->handlers[(int) vec_mode].insn_code;
if (icode == CODE_FOR_nothing)
{
if (vect_print_dump_info (REPORT_DETAILS))
fprintf (vect_dump, "operation not supported by target.");
return false;
}
optab/typeqihisiv8qiv4hiv2si
smin_optab700701CODE_FOR_nothing
752753CODE_FOR_nothing
umin_optab702703CODE_FOR_nothing
754755CODE_FOR_nothing
Querying the backend for target support in the vectorizer
vector int
v2si
smin_optab
IBM Labs in Haifa
13
…mips port
middle-end
GIMPLE trees
machine description
front-end
parse trees
rs6000 porti386 port
assembly
RTL
back-end
vectorization
Talk Layout What is vectorization
Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port
Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation
Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases
Using the vectorizer Programming and tuning hints
IBM Labs in Haifa
14
Enabling vectorization for a new port
<target.md> - distinction between scalar and vector ops: operand modes- availability of vector ops: deduced from MD file
<target>.h- specify supported vector length in bytes: #define UNITS_PER_SIMD_WORD 16
<target>-modes.def - specify supported vector modes:
/* Vector modes. */VECTOR_MODES (INT, 8); /* V8QI V4HI V2SI */VECTOR_MODES (INT, 16); /* V16QI V8HI V4SI V2DI */VECTOR_MODE (INT, DI, 1);VECTOR_MODES (FLOAT, 8); /* V4HF V2SF */VECTOR_MODES (FLOAT, 16); /* V8HF V4SF V2DF */
Basic features:
IBM Labs in Haifa
15
Enabling vectorization for a new port
Special idioms: generic vector operations:
look over list of idioms in optabs.h
specialized vector operations:look over target.h
Advanced features:
#define reduc_smax_optab (optab_table[OTI_reduc_smax])#define reduc_umax_optab (optab_table[OTI_reduc_umax])#define reduc_smin_optab (optab_table[OTI_reduc_smin])#define reduc_umin_optab (optab_table[OTI_reduc_umin])#define reduc_splus_optab (optab_table[OTI_reduc_splus])#define reduc_uplus_optab (optab_table[OTI_reduc_uplus])
#define ssum_widen_optab (optab_table[OTI_ssum_widen])#define usum_widen_optab (optab_table[OTI_usum_widen])#define sdot_prod_optab (optab_table[OTI_sdot_prod])#define udot_prod_optab (optab_table[OTI_udot_prod])
#define vec_set_optab (optab_table[OTI_vec_set])#define vec_extract_optab (optab_table[OTI_vec_extract])#define vec_extract_even_optab (optab_table[OTI_vec_extract_even])#define vec_extract_odd_optab (optab_table[OTI_vec_extract_odd])#define vec_interleave_high_optab (optab_table[OTI_vec_interleave_high])#define vec_interleave_low_optab (optab_table[OTI_vec_interleave_low])#define vec_init_optab (optab_table[OTI_vec_init])#define vec_shl_optab (optab_table[OTI_vec_shl])#define vec_shr_optab (optab_table[OTI_vec_shr])#define vec_realign_load_optab (optab_table[OTI_vec_realign_load])#define vec_widen_umult_hi_optab (optab_table[OTI_vec_widen_umult_hi])#define vec_widen_umult_lo_optab (optab_table[OTI_vec_widen_umult_lo])#define vec_widen_smult_hi_optab (optab_table[OTI_vec_widen_smult_hi])#define vec_widen_smult_lo_optab (optab_table[OTI_vec_widen_smult_lo])#define vec_unpacks_hi_optab (optab_table[OTI_vec_unpacks_hi])#define vec_unpacku_hi_optab (optab_table[OTI_vec_unpacku_hi])#define vec_unpacks_lo_optab (optab_table[OTI_vec_unpacks_lo])#define vec_unpacku_lo_optab (optab_table[OTI_vec_unpacku_lo])#define vec_pack_mod_optab (optab_table[OTI_vec_pack_mod])#define vec_pack_ssat_optab (optab_table[OTI_vec_pack_ssat])#define vec_pack_usat_optab (optab_table[OTI_vec_pack_usat])
/* Functions relating to vectorization. */
struct vectorize
{
tree (* builtin_mask_for_load) (void);
tree (* builtin_vectorized_function)
(unsigned, tree);
tree (* builtin_mul_widen_even) (tree);
tree (* builtin_mul_widen_odd) (tree);
} vectorize;
IBM Labs in Haifa
16
Enabling vectorization for a new port
Special idioms: generic vector operations:
look over list of idioms in optabs.h
specialized vector operations:look over target.h
Advanced features:
#define reduc_smax_optab (optab_table[OTI_reduc_smax])#define reduc_umax_optab (optab_table[OTI_reduc_umax])#define reduc_smin_optab (optab_table[OTI_reduc_smin])#define reduc_umin_optab (optab_table[OTI_reduc_umin])#define reduc_splus_optab (optab_table[OTI_reduc_splus])#define reduc_uplus_optab (optab_table[OTI_reduc_uplus])
#define ssum_widen_optab (optab_table[OTI_ssum_widen])#define usum_widen_optab (optab_table[OTI_usum_widen])#define sdot_prod_optab (optab_table[OTI_sdot_prod])#define udot_prod_optab (optab_table[OTI_udot_prod])
#define vec_set_optab (optab_table[OTI_vec_set])#define vec_extract_optab (optab_table[OTI_vec_extract])#define vec_extract_even_optab (optab_table[OTI_vec_extract_even])#define vec_extract_odd_optab (optab_table[OTI_vec_extract_odd])#define vec_interleave_high_optab (optab_table[OTI_vec_interleave_high])#define vec_interleave_low_optab (optab_table[OTI_vec_interleave_low])#define vec_init_optab (optab_table[OTI_vec_init])#define vec_shl_optab (optab_table[OTI_vec_shl])#define vec_shr_optab (optab_table[OTI_vec_shr])#define vec_realign_load_optab (optab_table[OTI_vec_realign_load])#define vec_widen_umult_hi_optab (optab_table[OTI_vec_widen_umult_hi])#define vec_widen_umult_lo_optab (optab_table[OTI_vec_widen_umult_lo])#define vec_widen_smult_hi_optab (optab_table[OTI_vec_widen_smult_hi])#define vec_widen_smult_lo_optab (optab_table[OTI_vec_widen_smult_lo])#define vec_unpacks_hi_optab (optab_table[OTI_vec_unpacks_hi])#define vec_unpacku_hi_optab (optab_table[OTI_vec_unpacku_hi])#define vec_unpacks_lo_optab (optab_table[OTI_vec_unpacks_lo])#define vec_unpacku_lo_optab (optab_table[OTI_vec_unpacku_lo])#define vec_pack_mod_optab (optab_table[OTI_vec_pack_mod])#define vec_pack_ssat_optab (optab_table[OTI_vec_pack_ssat])#define vec_pack_usat_optab (optab_table[OTI_vec_pack_usat])
/* Functions relating to vectorization. */
struct vectorize
{
tree (* builtin_mask_for_load) (void);
tree (* builtin_vectorized_function)
(unsigned, tree);
tree (* builtin_mul_widen_even) (tree);
tree (* builtin_mul_widen_odd) (tree);
} vectorize;
gcc/gcc:rtl.deftarget.hoptabs.h
gcc/gcc/config/<port>:<target>.opt<target>.h<target>.c<target>.md
IBM Labs in Haifa
17
testcases are in gcc/gcc/testsuite/gcc.dg/vect
additional target-specific testcases testsuite/gcc.target/i386/vect1.c
vect.exp: add logic to decide whether to compile/run and with which target-specific options
Add where relevant in:testsuite/lib/target-supports.exp:
Enabling vectorization for a new port
if [istarget "powerpc*-*-*"] {
…
}
} elseif { [istarget "spu-*-*"] } {
set dg-do-what-default run
} elseif { [istarget "i?86-*-*"] || [istarget "x86_64-*-*"] } {
lappend DEFAULT_VECTCFLAGS "-msse2"
set dg-do-what-default run
} elseif { [istarget "mipsisa64*-*-*"]
&& [check_effective_target_mpaired_single] } {
lappend DEFAULT_VECTCFLAGS "-mpaired-single"
set dg-do-what-default run
} elseif [istarget "sparc*-*-*"] {
…
} elseif [istarget "alpha*-*-*"] {
lappend DEFAULT_VECTCFLAGS "-mmax"
if [check_alpha_max_hw_available] {
set dg-do-what-default run
} else {
set dg-do-what-default compile
}
} elseif [istarget "ia64-*-*"] {
set dg-do-what-default run
} else {
return
Enable the vectorizer testcases
IBM Labs in Haifa
18
testcases are in gcc/gcc/testsuite/gcc.dg/vect
additional target-specific testcases testsuite/gcc.target/i386/vect1.c
vect.exp: add logic to decide whether to compile/run and with which target-specific options
Add where relevant in:testsuite/lib/target-supports.exp:
Enabling vectorization for a new portEnable the vectorizer testcases
proc check_effective_target_vect_int
check_effective_target_vect_shift
check_effective_target_vect_long
proc check_effective_target_vect_float
proc check_effective_target_vect_double { } {
global et_vect_double_saved
if [info exists et_vect_double_saved] {
verbose "using cached result" 2
} else {
set et_vect_double_saved 0
if { [istarget i?86-*-*]
|| [istarget x86_64-*-*]
|| [istarget spu-*-*] } {
set et_vect_double_saved 1
}
}
return $et_vect_double_saved
}
check_effective_target_vect_no_int_max
check_effective_target_vect_no_int_add
check_effective_target_vect_sdot_hi
check_effective_target_vect_udot_hi
check_effective_target_vect_sdot_si
check_effective_target_vect_udot_si
….
IBM Labs in Haifa
19
…mips port
middle-end
GIMPLE trees
machine description
front-end
parse trees
rs6000 porti386 port
assembly
RTL
back-end
vectorization
Talk Layout What is vectorization
Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port
Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation
Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases
Using the vectorizer Programming and tuning hints
IBM Labs in Haifa
20
A tree-level pass
New C file in gcc/gcc: tree-vectorizer.c tree-vect-analyze.c tree-vect-trasnform.c tree-vect-patterns.c tree-vectorizer.h
tree-flow.h – prototype for pass functionunsigned vectorize_loops (void);
gcc/Makefile.in entries
The pass is invoked for each function
unsigned vectorize_loops (void)
{
unsigned int i;
unsigned int num_vectorized_loops = 0;
unsigned int vect_loops_num;
loop_iterator li;
struct loop *loop;
…
vect_loops_num = number_of_loops ();
FOR_EACH_LOOP (li, loop, LI_ONLY_OLD)
{
loop_vec_info loop_vinfo;
vect_loop_location = find_loop_location (loop);
loop_vinfo = vect_analyze_loop (loop);
loop->aux = loop_vinfo;
if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo))
continue;
vect_transform_loop (loop_vinfo);
num_vectorized_loops++;
}
if (vect_print_dump_info (REPORT_VECTORIZED_LOOPS))
fprintf (vect_dump, "vectorized %u loops in function.\n",
num_vectorized_loops);
…
}
IBM Labs in Haifa
21
A tree-level pass
… NEXT_PASS (pass_split_crit_edges); NEXT_PASS (pass_pre); NEXT_PASS (pass_may_alias); NEXT_PASS (pass_sink_code); NEXT_PASS (pass_tree_loop); NEXT_PASS (pass_cse_reciprocals); NEXT_PASS (pass_reassoc); NEXT_PASS (pass_vrp); NEXT_PASS (pass_dominator); p = &pass_tree_loop.sub; NEXT_PASS (pass_tree_loop_init); NEXT_PASS (pass_copy_prop); NEXT_PASS (pass_lim); NEXT_PASS (pass_tree_unswitch); NEXT_PASS (pass_scev_cprop); NEXT_PASS (pass_empty_loop); NEXT_PASS (pass_record_bounds); NEXT_PASS (pass_linear_transform); NEXT_PASS (pass_iv_canon); NEXT_PASS (pass_if_conversion); NEXT_PASS (pass_vectorize); NEXT_PASS (pass_complete_unroll); NEXT_PASS (pass_loop_prefetch); NEXT_PASS (pass_iv_optimize); NEXT_PASS (pass_tree_loop_done); *p = NULL;
p = &pass_vectorize.sub; NEXT_PASS (pass_lower_vector_ssa); NEXT_PASS (pass_dce_loop); *p = NULL;
add the pass to the pass hierarchy in passes.c
in tree-pass.h – prototype for pass structureextern struct tree_opt_pass pass_vectorize;
pass-structure definitionin tree-ssa-loop.c
IBM Labs in Haifa
22
A tree-level pass
• pass structure definition:struct tree_opt_pass pass_vectorize ={ "vect", /* name */ gate_tree_vectorize, /* gate */ tree_vectorize, /* execute */ NULL, /* sub */ NULL, /* next */ 0, /* static_pass_number */ TV_TREE_VECTORIZATION, /* tv_id */ PROP_cfg | PROP_ssa, /* properties_required */ 0, /* properties_provided */ 0, /* properties_destroyed */ TODO_verify_loops, /* todo_flags_start */ TODO_dump_func
| TODO_update_ssa, /* todo_flags_finish */ 0 /* letter */};
• timevar.def: variable used for timing and for identification in timing reports:DEFTIMEVAR (TV_TREE_VECTORIZATION , "tree vectorization")
• static boolgate_tree_vectorize (void){ return flag_tree_vectorize
&& current_loops;}
• static unsigned inttree_vectorize (void){ return vectorize_loops ();}
• common.optAdd command line option
ftree-vectorize
Common Report Var(flag_tree_vectorize)
Enable loop vectorization on trees
IBM Labs in Haifa
23
A tree-level pass
invoke.texi:Document the pass for the GCC manual:
@item -ftree-vectorizePerform loop vectorization on trees.
@item vect@opindex fdump-tree-vectDump each function after applying vectorization of loops. The file name ismade by appending @file{.vect} to the source file name.
gcc –O2 –ftree-vectorize example.c gcc –O2 –ftree-vectorize –maltivec example.c gcc –O2 –ftree-vectorize –msse2 example.c gcc –O2 –ftree-vectorize –maltivec –fdump-tree-vect example.c gcc –O2 –ftree-vectorize –maltivec –fdump-tree-vect-details example.c gcc –O2 –ftree-vectorize –maltivec –ftree-vectorizer-verbose=2 example.c gcc –O2 –ftree-vectorize –maltivec –ftree-vectorizer-verbose=7 –fdump-tree-vect
example.c
gcc/gcc:rtl.deftarget.hoptabs.h
gcc/gcc/config/<port>:<target>.opt<target>.h<target>.c<target>.md
1. [tree-vect*.c]
2. tree-flow.h
3. Makefile.in
4. [tree-ssa-loop.c]
5. timevar.def
6. common.opt
7. Invoke.texi
IBM Labs in Haifa
24
Example: vectorizer dump reports
int main1 (short *in, int off, short scale, int n)
{
int i, sum = 0;
for (i = 0; i < n; i++) {
sum += ((int) in[i] * (int) in[i+off]) >> scale;
}
return sum;
}
autocorrelation
Speedups:- powerpc970 – 5-6x- Cell SPU – 4-5x
[dorit@mac-ira vect]$ gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=5 vect-widen-mult-sum.c
vect-widen-mult-sum.c:16: note: Vectorizing an unaligned access.
vect-widen-mult-sum.c:16: note: Vectorizing an unaligned access.
vect-widen-mult-sum.c:16: note: LOOP VECTORIZED.
vect-widen-mult-sum.c:12: note: vectorized 1 loops in function.
IBM Labs in Haifa
25
…mips port
middle-end
GIMPLE trees
machine description
front-end
parse trees
rs6000 porti386 port
assembly
RTL
back-end
vectorization
Talk Layout What is vectorization
Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port
Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation
Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases
Using the vectorizer Programming and tuning hints
IBM Labs in Haifa
26
Auto-vectorization Skeleton
vect_analyze_loop (loop) { if (!1_analyze_loop_form (loop)) FAIL if (!2_analyze_data_refs (loop)) FAIL if (!3_analyze_scalar_dependence_cycles (loop)) FAIL if (!4_pattern_recog (loop)) FAIL if (!5_analyze_data_alignment (loop)) FAIL if (!6_determine_VF (loop)) FAIL if (!7_analyze_data_dependence_distances (loop)) FAIL if (!8_analyze_memory_access_patterns (loop)) FAIL if (!9_analyze_all_operations_supported (loop)) FAIL
SUCCEED}
if SUCCEED:vect_transform_loop (loop) { FOR_ALL_STMTS_IN_LOOP(loop, stmt)
replace_OP_by_VOP (stmt); decrease_loop_bound_by_factor_VF (loop);}
tree-vect-analyze.c
tree-vect-transform.c
IBM Labs in Haifa
27
Auto-Vectorization Transformation
original serial loop:for(i=0; i<N; i++){ a[i] = a[i] + b[i];}
loop in vector notation:for (i=0; i<N; i+=VF){ a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1];}
loop in vector notation:for (i=0; i<(N-N%VF); i+=VF){ a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1];}
for ( ; i < N; i++) { a[i] = a[i] + b[i];}
vectorization
Modify loop bound - strip-mine - create epilog loop
Replace scalar statements with vector statements
vectorized loop
epilog loop
IBM Labs in Haifa
28
Vectorization on SSA-ed GIMPLE trees
float T.1, T.2, T.3;
loop:
if ( i < 16 ) break;
S1: T.1 = a[i ];
S2: T.2 = b[i ];
S3: T.3 = T.1 * T.2;
S4: a[i] = T.3;
S5: i = i + 1;
goto loop;
loop: if (i < 16) break; T.11 = a[i ]; T.12 = a[i+1]; T.13 = a[i+2]; T.14 = a[i+3]; T.21 = b[i ]; T.22 = b[i+1]; T.23 = b[i+2]; T.24 = b[i+3]; T.31 = T.11 * T.21; T.32 = T.12 * T.22; T.33 = T.13 * T.23; T.34 = T.14 * T.24; a[i] = T.31; a[i+1] = T.32; a[i+2] = T.33; a[i+3] = T.34; i = i + 4; goto loop;
VF = 4 “unroll by VF and replace”
int i;
float a[N], b[N];
for (i=0; i < 16; i++)
a[i] = a[i ] * b[i ];
v4sf VT.1, VT.2, VT.3;
v4sf *VPa = (v4sf *)a, *VPb = (v4sf *)b;
int indx;
loop:
if ( indx < 4 ) break;
VT.1 = VPa[indx ];
VT.2 = VPb[indx ];
VT.3 = VT.1 * VT.2;
VPa[indx] = VT.3;
indx = indx + 1;
goto loop;
IBM Labs in Haifa
29
…mips port
middle-end
GIMPLE trees
machine description
front-end
parse trees
rs6000 porti386 port
assembly
RTL
back-end
vectorization
Talk Layout What is vectorization
Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port
Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation
Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases
Using the vectorizer Programming and tuning hints
IBM Labs in Haifa
30
Vectorizer analyses and transformation: Reduction
s = 0;
for (i=0; i<N; i++) {
s += a[i] * b[i];
}
loop:
s_1 = phi (0, s_2)
i_1 = phi (0, i_2)
xa_1 = a[i_1]
xb_1 = b[i_1]
tmp_1 = xa * xb
s_2 = s_1 + tmp_1
i_2 = i_1 + 1
if (i_2 < N) goto loop
cross iteration dependences
reduction
induction
Analysis
Detect scalar dependece cycles
Identify scalar cycles that are reduction/induction
0 1 2 3 4 5 6 7 8 9 10 11
0 0 0 00 1 2 3
tmp_1
4 6 8 1012 15 18 21
IBM Labs in Haifa
31
static void
vect_analyze_scalar_cycles (loop_vec_info loop_vinfo)
{
tree phi;
struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
basic_block bb = loop->header;
if (vect_print_dump_info (REPORT_DETAILS))
fprintf (vect_dump, "=== vect_analyze_scalar_cycles ===");
for (phi = phi_nodes (bb); phi; phi = PHI_CHAIN (phi))
{
stmt_vec_info stmt_vinfo = vinfo_for_stmt (phi);
tree def = PHI_RESULT (phi);
if (!is_gimple_reg (SSA_NAME_VAR (def)))
continue;
STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_unknown_def_type;
tree access_fn = analyze_scalar_evolution (loop, def);
if (!access_fn)
continue;
if (vect_is_simple_iv_evolution (loop->num, access_fn)
{
STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_induction_def;
continue;
}
tree rstmt = vect_is_simple_reduction (loop, phi);
if (rstmt)
{
STMT_VINFO_DEF_TYPE (stmt_vinfo) =
STMT_VINFO_DEF_TYPE (vinfo_for_stmt (rstmt)) =
vect_reduction_def;
}
else
if (vect_print_dump_info (REPORT_DETAILS))
fprintf (vect_dump, "Unknown def-use cycle pattern.");
} /* End for loop */
return;
}
s_1 = phi (0, s_2)
i_1 = phi (0, i_2)
xa_1 = a[i_1]
xb_1 = b[i_1]
tmp_1 = xa * xb
s_2 = s_1 + tmp_1
i_2 = i_1 + 1
unknownreduc
tree-vect-analyze.c
IBM Labs in Haifa
32
edge latch_e = loop_latch_edge (loop); tree loop_arg = PHI_ARG_DEF_FROM_EDGE (phi, latch_e); tree def_stmt = SSA_NAME_DEF_STMT (loop_arg); tree operation = GIMPLE_STMT_OPERAND (def_stmt, 1); enum tree_code code = TREE_CODE (operation);… if (!commutative_tree_code (code) || !associative_tree_code (code)) { if (vect_print_dump_info (REPORT_DETAILS)) { fprintf (vect_dump, "reduction: not commutative/associative: "); print_generic_expr (vect_dump, operation, TDF_SLIM); } return NULL_TREE; } if (SCALAR_FLOAT_TYPE_P (type) && !flag_unsafe_math_optimizations) { if (vect_print_dump_info (REPORT_DETAILS)) { fprintf (vect_dump, "reduction: unsafe fp math optimization: "); print_generic_expr (vect_dump, operation, TDF_SLIM); } return NULL_TREE; }…
s_1 = phi (0, s_2)
i_1 = phi (0, i_2)
xa_1 = a[i_1]
xb_1 = b[i_1]
tmp_1 = xa * xb
s_2 = s_1 + tmp_1
i_2 = i_1 + 1
Snippet from vect_is_simple_reduction:
tree-vectorizer.c
IBM Labs in Haifa
33
Vectorizer analyses and transformation: Reduction
loop:
s_1 = phi (0, s_2)
i_1 = phi (0, i_1)
xa_1 = a[i_1]
xb_1 = b[i_1]
tmp_1 = xa * xb
s_2 = s_1 + tmp_1
i_2 = i_1 + 1
if (i_2 < N) goto loop
Transformation
loop:
vs_1 = phi (vs_0, vs_2)
i_1 = phi (0, i_1)
vxa_1 = vpa[i_1]
vxb_1 = vpb[i_1]
vtmp_1 = vxa * vxb
vs_2 = vs_1 + vtmp_1
i_2 = i_1 + 1
if (i_2 < N/VF) goto loop
vec_dest = vect_create_destination_var (scalar_dest, vectype);
expr = build2 (code, vectype, loop_vec_def0, reduc_def);
new_stmt = build2 (GIMPLE_MODIFY_STMT, void_type_node, vec_dest, expr);
new_temp = make_ssa_name (vec_dest, new_stmt);
GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;
bsi_insert_before (bsi, vec_stmt, BSI_SAME_STMT);
tree-vect-transform.c
IBM Labs in Haifa
34
0 1 2 3
Vectorizer analyses and transformation: Reduction
s = 0;
for (i=0; i<N; i++) {
s += a[i] * b[i];
}
printf (“sum = %f\n”, s);
Transformation
28
0 1 2 3+
4 5 6 7+
0 0 0 0s1,s2,s3,s4
loop:
vs_1 = phi (vs_0, vs_2)
i_1 = phi (0, i_2)
vxa_1 = vpa[i_1]
vxb_1 = vpb[i_1]
vtmp_1 = vxa * vxb
vs_2 = vs_1 + vtmp_1
i_2 = i_1 + 1
if (i_2 < N/VF) goto loop
4 6 8 10
8 10
+
12 16+
28
16
scalar epilog
whole vector shifts
sum across
vs_0
vtmp_1
vs_2
vtmp_1
s
IBM Labs in Haifa
35
…mips port
middle-end
GIMPLE trees
machine description
front-end
parse trees
rs6000 porti386 port
assembly
RTL
back-end
vectorization
Talk Layout What is vectorization
Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port
Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation
Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases
Using the vectorizer Programming and tuning hints
IBM Labs in Haifa
36
Adding new idioms
tree.def: define the tree-code:
/* Reduction operations. Operations that take a vector of elements and "reduce" it to a scalar result (e.g. summing the elements of the vector, finding the minimum over the vector elements, etc). Operand 0 is a vector; the first element in the vector has the result. Operand 1 is a vector. */
DEFTREECODE (REDUC_PLUS_EXPR, "reduc_plus_expr", tcc_unary, 1)
tree-pretty-print.cdump_generic_node, op_prio, op_symbol
tree-inline.c: estimate_num_insns_1 ()
IBM Labs in Haifa
37
Adding new idioms
optabs.h: add a new operator table (optab) index to enum optab_index
/* Reduction operations on a vector operand. */ OTI_reduc_splus, OTI_reduc_uplus,
optabs.h: define matching shortcuts
#define reduc_splus_optab (optab_table[OTI_reduc_splus])#define reduc_uplus_optab (optab_table[OTI_reduc_uplus])
IBM Labs in Haifa
38
Adding new idioms
optabs.c: add selection of appropriate optab in the dispatch function optab_for_tree_code():
case REDUC_PLUS_EXPR: return TYPE_UNSIGNED (type) ? reduc_uplus_optab : reduc_splus_optab;
optabs.c: initialize the new optabs in init_optabs()
reduc_splus_optab = init_optab (UNKNOWN); reduc_uplus_optab = init_optab (UNKNOWN);
IBM Labs in Haifa
39
Adding new idioms
genopinit.c: fill in the optabs:
"reduc_splus_optab->handlers[$A].insn_code = CODE_FOR_$(reduc_splus_$a$)" ,
"reduc_uplus_optab->handlers[$A].insn_code = CODE_FOR_$(reduc_uplus_$a$)",
optab/typeqihisiv8qiv4hiv2si
reduc_splus_optabCODE_FOR_nothing
CODE_FOR_nothing
CODE_FOR_nothing
reduc_uplus_optab
CODE_FOR_nothing
CODE_FOR_nothing
CODE_FOR_nothing
gcc/gcc:rtl.deftarget.hoptabs.h
gcc/gcc/config/<port>:<target>.opt<target>.h<target>.c<target>.md
1. tree.def
2. tree-pretty-print.c
3. tree-inline.c
4. optabs.h
5. optabs.c
6. genopinit.c
7. expr.c
8. <target>.md
IBM Labs in Haifa
40
Adding new idioms expr.c: tree-to-rtl expansion: case REDUC_PLUS_EXPR: { op0 = expand_normal (TREE_OPERAND (exp, 0)); this_optab = optab_for_tree_code (code, type); temp = expand_unop (mode, this_optab, op0, target, unsignedp); gcc_assert (temp); return temp; }
<target>.md: RTL instruction definition:(define_expand "reduc_splus_<mode>" [(set (match_operand:VIshort 0 "register_operand" "=v") (unspec:VIshort [(match_operand:VIshort 1 "register_operand" "v")]
UNSPEC_REDUC_PLUS))] "TARGET_ALTIVEC" "{rtx vzero = gen_reg_rtx (V4SImode); rtx vtmp1 = gen_reg_rtx (V4SImode); emit_insn (gen_altivec_vspltisw (vzero, const0_rtx)); emit_insn (gen_altivec_vsum4s<VI_char>s (vtmp1, operands[1], vzero)); emit_insn (gen_altivec_vsumsws_nomode (operands[0], vtmp1, vzero)); DONE;}")
1. tree.def
2. tree-pretty-print.c
3. tree-inline.c
4. optabs.h
5. optabs.c
6. genopinit.c
7. expr.c
8. <target>.md
…mips port
middle-end
GIMPLE trees
machine description
front-end
parse trees
rs6000 porti386 port
assembly
RTL
back-end
vectorization
expand
IBM Labs in Haifa
41
…mips port
middle-end
GIMPLE trees
machine description
front-end
parse trees
rs6000 porti386 port
assembly
RTL
back-end
vectorization
Talk Layout What is vectorization
Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port
Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation
Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases
Using the vectorizer Programming and tuning hints
IBM Labs in Haifa
42
vect-reduc-min.c#define N 16
int main1 ()
{
int i;
float c[N] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
float min = 10;
for (i = 0; i < N; i++) {
min = min > c[i] ? c[i] : min;
}
/* check results: */
if (min != 0)
abort ();
return 0;
}
gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=4 vect-reduc-min.c
vect-reduc-min.c:14: note: not vectorized: unsupported use in stmt.vect-reduc-min.c:9: note: vectorized 0 loops in function.
gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=7 vect-reduc-min.c
…vect-reduc-min.c:14: note: === vect_analyze_scalar_cycles ===vect-reduc-min.c:14: note: Analyze phi: min_6 = PHI <min_3(6), 1.0e+1(2)>vect-reduc-min.c:14: note: reduction: not commutative/associative:
min_6 > min_7 ? min_7 : min_6
vect-reduc-min.c:14: note: Unknown def-use cycle pattern…vect-reduc-min.c:14: note: Unsupported pattern.vect-reduc-min.c:14: note: not vectorized: unsupported use in stmt.vect-reduc-min.c:14: note: unexpected pattern.vect-reduc-min.c:9: note: vectorized 0 loops in function.
gcc -O2 -ftree-vectorize -maltivec vect-reduc-min.c -ftree-vectorizer-verbose=4 -ffast-math
vect-reduc-min.c:14: note: LOOP VECTORIZED.vect-reduc-min.c:9: note: vectorized 1 loops in function.
Compilation Flow Example
IBM Labs in Haifa
43
vect-min.c.081t.ifcvt
main1 (){ unsigned int ivtmp.31; int pretmp.25; float min; float c[16]; int i; float D.2429; static float C.3[16] = {…};
<bb 2>: c = C.3;
# ivtmp.31_2 = PHI <ivtmp.31_3(4), 16(2)> # min_15 = PHI <min_7(4), 1.0e+1(2)> # i_14 = PHI <i_8(4), 0(2)><L0>:; D.2429_6 = c[i_14]; min_7 = MIN_EXPR <D.2429_6, min_15>; i_8 = i_14 + 1; ivtmp.31_3 = ivtmp.31_2 - 1; if (ivtmp.31_3 != 0) goto <L8>; else goto <L2>;
<L8>:; goto <bb 3> (<L0>);
# min_1 = PHI <min_7(3)><L2>:; if (min_1 != 0.0) goto <L3>; else goto <L4>;
<L3>:; abort ();
<L4>:; return 0;}
vect-min.c.004t.gimple
c = C.3; min = 1.0e+1;
i = 0;
goto <D2425>;
<D2424>:;
i.4 = i;
D.2429 = c[i.4];
min = MIN_EXPR <D.2429, min>;
i = i + 1;
<D2425>:;
if (i <= 15)
{
goto <D2424>;
}
else
{
goto <D2426>;
}
<D2426>:;
if (min != 0.0)
{
abort ();
}
else
{
}
D.2430 = 0;
return D.2430;
-fdump-tree-all -da
IBM Labs in Haifa
44
vect-min.c.082t.vect
<bb 2>:
c = C.3;
vect_pc.32_5 = (__vector float *) &c;
vect_cst_.40_21 = { 1.0e+1, 1.0e+1, 1.0e+1, 1.0e+1 };
# ivtmp.43_28 = PHI <ivtmp.43_29(4), 0(2)>
# vect_var.39_19 = PHI <vect_var.39_20, vect_cst.40_21>
# ivtmp.37_16 = PHI <ivtmp.37_17(4), vect_pc.32_5(2)>
# ivtmp.31_2 = PHI <ivtmp.31_3(4), 16(2)>
# min_15 = PHI <min_7(4), 1.0e+1(2)>
# i_14 = PHI <i_8(4), 0(2)>
<L0>:;
vect_var_.38_18 = *ivtmp.37_16;
D.2429_6 = c[i_14];
vect_var.39_20 = MIN_EXPR <vect_var.38_18, vect_var.39_19>;
min_7 = MIN_EXPR <D.2429_6, min_15>;
i_8 = i_14 + 1;
ivtmp.31_3 = ivtmp.31_2 - 1;
ivtmp.37_17 = ivtmp.37_16 + 16B;
ivtmp.43_29 = ivtmp.43_28 + 1;
if (ivtmp.43_29 < 4) goto <L8>; else goto <L2>;
<L8>:;
goto <bb 3> (<L0>);
Continued:
# vect_var_.39_22 = PHI <vect_var_.39_20(3)>
# min_1 = PHI <min_7(3)>
<L2>:;
vect_var_.42_23 = vect_var_.39_22 v>> 64;
vect_var.42_24 =
MIN_EXPR <vect_var.42_23, vect_var.39_22>;
vect_var_.42_25 = vect_var_.42_24 v>> 32;
vect_var_.42_26 =
MIN_EXPR <vect_var_.42_25, vect_var_.42_24>;
vect_var_.41_27 =
BIT_FIELD_REF <vect_var_.42_26, 32, 96>;
if (vect_var_.41_27 != 0.0) goto <L3>; else goto <L4>;
<L3>:;
abort ();
<L4>:;
return 0;
}
IBM Labs in Haifa
45
vect-min.c.095t.dse2
c = C.3;
vect_pc.36_4 = (__vector float *) &c;
vect_var_.38_6 = *vect_pc.36_4;
vect_var_.39_1 = MIN_EXPR <vect_var_.38_6, { 1.0e+1, 1.0e+1, 1.0e+1, 1.0e+1 }>;
ivtmp.37_14 = vect_pc.36_4 + 16B;
vect_var_.38_32 = *ivtmp.37_14;
vect_var_.39_33 = MIN_EXPR <vect_var_.39_1, vect_var_.38_32>;
ivtmp.37_34 = ivtmp.37_14 + 16B;
vect_var_.38_39 = *ivtmp.37_34;
vect_var_.39_40 = MIN_EXPR <vect_var_.39_33, vect_var_.38_39>;
ivtmp.37_41 = ivtmp.37_34 + 16B;
vect_var_.38_18 = *ivtmp.37_41;
vect_var_.39_20 = MIN_EXPR <vect_var_.38_18, vect_var_.39_40>;
vect_var_.42_23 = vect_var_.39_20 v>> 64;
vect_var_.42_24 = MIN_EXPR <vect_var_.39_20, vect_var_.42_23>;
vect_var_.42_25 = vect_var_.42_24 v>> 32;
vect_var_.42_26 = MIN_EXPR <vect_var_.42_25, vect_var_.42_24>;
vect_var_.41_27 = BIT_FIELD_REF <vect_var_.42_26, 32, 96>;
if (vect_var_.41_27 != 0.0) goto <L3>; else goto <L4>;
<L3>:;
abort ();
<L4>:;
return 0;
}
IBM Labs in Haifa
46
vect-min.c.138r.life2
(insn:HI 26 25 27 2 (set (reg:V4SF 138) (mem/u/c/i:V4SF (reg/f:SI 139) [2 S16 A128])) 632
{altivec_lvx_v4sf} ))
(insn:HI 27 26 28 2 (set (reg:V4SF 141) (mem:V4SF (plus:SI (reg/f:SI 113 sfp) (const_int 16 [0x10])) [2 S16 A128])) 632
{altivec_lvx_v4sf} (nil) (nil))
(insn:HI 28 27 29 2 (set (reg:V4SF 126 [ vect_var_.39 ]) (smin:V4SF (reg:V4SF 138) (reg:V4SF 141))) 706 {sminv4sf3}))
(insn:HI 29 28 30 2 (set (reg/f:SI 127 [ ivtmp.37 ]) (plus:SI (reg/f:SI 134) (const_int 16 [0x10]))) 79 {*addsi3_internal1} (nil) (nil))
(insn:HI 30 29 31 2 (set (reg:V4SF 142) (mem:V4SF (plus:SI (reg/f:SI 134) (const_int 16 [0x10])) [2 S16 A128])) 632
{altivec_lvx_v4sf} (nil) (nil)))
(insn:HI 31 30 32 2 (set (reg:V4SF 121 [ vect_var_.50 ]) (smin:V4SF (reg:V4SF 126 [ vect_var_.39 ]) (reg:V4SF 142))) 706 {sminv4sf3} (nil))))
(insn:HI 33 32 34 2 (set (reg:V4SF 143) (mem:V4SF (plus:SI (reg/f:SI 127 [ ivtmp.37 ]) (const_int 16 [0x10])) [2 S16 A128])) 632
{altivec_lvx_v4sf} (nil)) (nil))
(insn:HI 34 33 35 2 (set (reg:V4SF 119 [ vect_var_.53 ]) (smin:V4SF (reg:V4SF 121 [ vect_var_.50 ]) (reg:V4SF 143))) 706 {sminv4sf3} (nil))))
vect-min.c.153r.sched2
(insn:TI 82 84 89 2 (set (reg:V4SF 77 0 [138])
(mem/u/c/i:V4SF (reg/f:SI 9 9 [139]) [2 S16 A128])) 632 {altivec_lvx_v4sf} (nil) (nil))))
(insn 89 82 83 2 (set (reg:SI 9 9) (plus:SI (reg/f:SI 1 1)
(const_int 16 [0x10]))) 79 {*addsi3_internal1} (nil) (nil))
(insn:TI 83 89 90 2 (set (reg:V4SF 78 1 [141])
(mem:V4SF (reg:SI 9 9) [2 S16 A128])){altivec_lvx_v4sf} ))
(insn 90 83 92 2 (set (reg:SI 9 9)
(plus:SI (reg/f:SI 29 29 [orig:127 ivtmp.37 ] [127])
(const_int 16 [0x10]))) 79 {*addsi3_internal1} (nil) (nil))
(insn 92 90 28 2 (set (reg:SI 29 29)
(plus:SI (reg/f:SI 29 29 [orig:127 ivtmp.37 ] [127])
(const_int 32 [0x20]))) 79 {*addsi3_internal1} (nil) (nil))
(insn:TI 28 92 33 2 (set (reg:V4SF 77 0[orig:126 vect_var.39] [126]) (smin:V4SF (reg:V4SF 77 0 [138])
(reg:V4SF 78 1 [141]))) 706 {sminv4sf3} (nil) (nil)))
(insn 33 28 35 2 (set (reg:V4SF 78 1 [143])
(mem:V4SF (reg:SI 9 9) [2 S16 A128])){altivec_lvx_v4sf}
(insn 35 33 93 2 (set (reg:V4SF 89 12 [144])
(mem:V4SF (reg:SI 29 29) [2 S16 A128])) {altivec_lvx_v4sf}))
IBM Labs in Haifa
47
vect-min.s
main1: stwu 1,-128(1) lis 4,.LANCHOR0@ha mflr 0 la 4,.LANCHOR0@l(4) li 5,64 stw 29,116(1) stw 0,132(1) addi 29,1,16 mr 3,29 bl memcpy addi 9,29,16 addi 29,29,16 lvx 13,0,9 lis 9,.LC0@ha la 9,.LC0@l(9) lvx 0,0,9 addi 9,1,16 lvx 1,0,9 addi 9,29,16 addi 29,29,32 vminfp 0,0,1 lvx 1,0,9 lvx 12,0,29 addi 9,1,108
vminfp 0,0,13
vminfp 0,0,1
vminfp 0,0,12
vsldoi 13,0,0,8
vminfp 0,0,13
vsldoi 1,0,0,12
vminfp 1,1,0
stvewx 1,0,9
lis 9,.LC1@ha
lfs 13,108(1)
lfs 0,.LC1@l(9)
fcmpu 7,13,0
bne- 7,.L7
lwz 0,132(1)
lwz 29,116(1)
li 3,0
addi 1,1,128
mtlr 0
blr
IBM Labs in Haifa
48
…mips port
middle-end
GIMPLE trees
machine description
front-end
parse trees
rs6000 porti386 port
assembly
RTL
back-end
vectorization
Talk Layout What is vectorization
Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port
Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation
Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases
Using the vectorizer Programming and tuning hints
IBM Labs in Haifa
49
Using the Vectorizer – Programming Hints Don’t unroll the loop
for (i=0; i<N; i+=4){ a[i] = x; a[i+1] = x; a[i+2] = x; a[i+3] = x;}
Use countable loops, with no side-effects No function-calls in the loop (distribute into a separate loop); No ‘break’/’continue’
Avoid aliasing problems Use __restrict__ qualified pointers
Keep the memory access-pattern simple Don’t use array of structures, e.g.:
struct {int f1; int f2;} a[N]; for (i=0; i<N; i++) a[i].f1 = x;
Use constant increment. i.e., don’t use the following:for (i=0; i<N; i+=incr) a[i] = x;
Alignment Use alignment attributes If have more than a single misaligned store – distribute into a separate loop (currently the
vectorizer peels the loop to align a misaligned store).
for (i=0; i<N; i++)
a[i] = x;
foo (float * __restrict__ p, float * __restrict__ q)
int af1[N], af2[N];
for (i=0; i<N; i++)af1[i] = x;
IBM Labs in Haifa
50
-ffast-math if operating on floats in a reduction
computation (to allow the vectorizer to change the order of the computation)
-fwrapv if operating on signed subword integers (to
avoid casts to int that currently confuse the vectorizer)
--param min-vect-loop-bound=[X] if have loops with a short trip-count
-fno-vect-loop-version if worried about code size
-funroll-loops –fvariable-expansion-in-unroller –param max-variable-expansions-in-unroller=[X] for improved scheduling of summation
(breaking the accumulation into X+1 accumulator to increase ILP).
float *b, *c, diff, min, max;
for (i = 0; i < N; i++) {
diff += (b[i] - c[i]);
}
for (i = 0; i < N; i++) {
max = max < c[i] ? c[i] : max;
}
for (i = 0; i < N; i++) {
min = min > c[i] ? c[i] : min;
}
signed char *b, *c, diff;
for (i = 0; i < N; i++) {
diff += (signed char)(b[i] - c[i]);
}
for (i=0; i<N; i++){
p[i] = q[i];
}
Loop versioning:
if (q is aligned) {
for (i=0; i<N; i++){
x = q[i]; // q is aligned
p[i] = x;
}else {
for (i=0; i<N; i++){
x = q[i]; // q’s alignment unknown
p[i] = x;
}
Using the Vectorizer – Tuning Hints
IBM Labs in Haifa
51
More information
Vectorizer: http://gcc.gnu.org/projects/tree-ssa/vectorization.html http://gcc.gnu.org/wiki/VectorizationTasks Summit papers
- http://www.gccsummit.org/2006/2006-GCC-Summit-Proceedings.pdf- ftp://gcc.gnu.org/pub/gcc/summit/2004/Autovectorization.pdf
General http://gcc.gnu.org/onlinedocs/gccint/ http://gcc.gnu.org/wiki Summit papers
Happy Hacking!
IBM Labs in Haifa
52
The End
IBM Labs in Haifa
53
for (i = 0; i < n; i++) {
sum += ((int) in[i] * (int) in[i+off]) >> scale;
}
IBM Labs in Haifa
54
…mips port
middle-end
GIMPLE trees
machine description
front-end
parse trees
rs6000 porti386 port
assembly
RTL
back-end
vectorization
Talk Layout What is vectorization
Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port
Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation
Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases
Using the vectorizer Programming and tuning hints
IBM Labs in Haifa
55
Non-consecutive access patterns
a b c d e f g h i j k l m n o p
OP(a)
OP(f)
OP(k)
OP(p)
Data in Memory:
VOP( a, f, k, p ) VR5
a b c dVR1
VR2
VR3
VR4
VR5
0 1 2 3
e f g h
i j k l
m n o p
a
f
k
p
a f k p
a f k p
A[i], i={0,5,10,15,…}; access_fn(i) = (0,+,5)
IBM Labs in Haifa
56
Basic unpacking and packing operations for strided access
Use two pairs of inverse operations widely supported on SIMD platforms:
extract_even, extract_odd:
interleave_high, interleave_low:
Use them recursively to support strided accesses with power-of-2 strides Support several data types
IBM Labs in Haifa
57
28
S1: a = x [8*i]
S2: b = x [8*i+1]
S3: c = x [8*i+2]
S4: d = x [8*i+3]
S5: e = x [8*i+4]
S6: f = x [8*i+5]
S7: g = x [8*i+6]
S8: h = x [8*i+7]
S9: y [2*i] = k = f (a,…,h)
S10: y [2*i+1] = l = g (a,…,h)
0 1 2 3 4 5 6 7 8 9 1011 12 131415 16 171819 20212223 24252627 282930 31
a b c d e f y h
0 1 2 3 4 5 6 7
k l
0 1
0 2 4 6 8 1012 14 222016 18 24262830 1 3 5 7 9 111315 23 31
4 80 12
1719 21 252729
16 2420
2480 16
311 5 9 13 17 21 2925 2 6 10 14 22 263018 3 7 1115 2719 23
284 12 201 9 17 25 2 10 2618 273 11 19 5 132129 306 14 22 317 1523
δ=8 VF=4
load δ *VF elements
generate δ *log δ extracts (odd/even)
IBM Labs in Haifa
59
Very common in real world computations Complex data rgba images (alpha blend) multi-channel audio streams (down mix)
Viterbi decoder: 5x improvement on entire benchmark
PLDI 2006
Strided Accesses (Interleaved Data)
IBM Labs in Haifa
60
Mixed data types
short b[N];int a[N];for (i=0; i<N; i++) a[i] = (int) b[i];
Unpack
IBM Labs in Haifa
61
Multiple Data-Types & Type Conversions
S1:x_int = memref
S2:z_int = x_int + 1
S3:y_char = memref
….
VS1.0: vx0 = memref0
VS1.1: vx1 = memref1
VS1.2: vx2 = memref2
VS1.3: vx3 = memref3
VS2.0: vz0 = vx0 + v1
VS2.1: vz1 = vx1 + v1
VS2.2: vz2 = vx2 + v1
VS2.3: vz3 = vx3 + v1
V1 = {1, 1, 1, 1}
VS3: vy = memref
VF = 16
4
4
16
VS3.0: vy0 = vpack (vz0, vz1)
VS3.1: vy1 = vpack (vz2, vz3)
VS3: vy = vpack (vy0, vy1)
(char) z_int
units
“unroll” by VF/units
IBM Labs in Haifa
62
Very common in multimedia computations Video: unsigned chars shorts Audio: signed shorts ints Filters, autocorrelation, dot product, alpha-blending…
Autocorrelation: 6x improvement on benchmarkfor (i = 0; i < n; i++) {
acc += ((int) short_in1[i] * (int) short_in2[i+lag]) >> Scale;
}
Multiple Data-Types & Type Conversions
IBM Labs in Haifa
63