Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika...

Software Performance Software Performance Tuning ProjectTuning Project

FlakeFlake

Prepared by:Meni OrenbachRoman Kaplan

Advisors:Zvika Guz

Kobi Gottlieb

• FLAC – Free Lossless Audio Codec.

• FLAC specially designed for efficient packing of audio data.

• Can achieve compression ratios of 30% - 50% for most music.

Flake – FLAC encoderFlake – FLAC encoder

PlatformPlatform and Benchmark Usedand Benchmark Used

• Platform: Intel 64 bit Pentium Core 2 duo 2.4GHZ, 2GB of

RAM and with a Windows XP operating System.

• Benchmark:- 238MB song.- Original Encoding Duration: 105.156 Sec

Algorithm descriptionAlgorithm description• The input file is read frame by frame.

• Every frame contains a constant number of channels.

• Each channel is encoded independently with special Huffman codes called RICE.

Flake – Data flowFlake – Data flow

Encoding everyFrame

Encoding the error for

every channel

Using LPC Algorithm

Flake – Optimization methodFlake – Optimization method

• Dealing with the most time consuming functions.

• Two approaches were taken:– Multi-threading.– SIMD.

Optimization Method 1: ThreadsOptimization Method 1: Threads

• Flake was managed by a single thread.

• Parallelization creates simultaneous work.

• While paralleling Flake we considered:

- The algorithm.

- The data flow.

Encoding Process In FlakeEncoding Process In FlakeMultiThread

Here!

Frame 1

Frame 2

Frame n

Channel 1

Channel 2

Channel 8

Input File

Frame iChannel 1

Channel 2

Channel 8

Channel 1

Channel 2

Channel 8

Channel 1

Channel 2

Channel 8

·

·

·

·

·

·

···

···

···

···

Encoded Frame 1

Encoded Frame 2

Encoded Frame n

Encoded Frame i

·

·

·

·

·

·

Output File

MultiThread

Here!

MultiThread

Here!

Conclusions: Possible Ways to Conclusions: Possible Ways to Parallel FlakeParallel Flake

1. Parallel the reads and writes from the file.

2. Parallel the encoding phase for each frame separately.

3. Parallel the encoding phase for each channel separately.

• Combination of the above.

Our ResolutionOur Resolution• We chose to parallel the channel encoding.

• Our reasons for doing so: Limit of channels and limit of threads.

Limited access to a shared device (the disk) for I/O.

Multiple reads of the file needed for frame encoding.

Higher synchronization rate needed for frame encoding.

Implementing The Solution, First TryImplementing The Solution, First Try

• Create as many threads as channels

• Every thread encodes and terminates.

• This solution achieved a speedup of x1.68.

• Overhead from opening and closing threads.

Vtune Thread Profiler, First TryVtune Thread Profiler, First Try

Implementing The Solution, Implementing The Solution, Second TrySecond Try

• Create as many threads as channels.

• Every thread encodes and waits for a signal.

• Save thread handlers to recall the same threads.

• Saving time by not closing the threads!

• Gaining a bigger speedup!

Vtune Thread Profiler, Second TryVtune Thread Profiler, Second Try

Note: in our benchmark there are only 2 channels.

SpeedUp Gained Through SpeedUp Gained Through MultiThreadingMultiThreading

• Total speedup from using MT: x1.85!

0

20

40

60

80

100

120

140

160

180

200

Original MT1 MT2

SpeedUp

Original

MT1

MT2

Optimization Method 2: SIMDOptimization Method 2: SIMD

• Mainly used SSE and SSE2 instructions.

• Operations with Double FP and Integers.

• Two main functions we used SSE on: – calc_rice_params(). – compute_autocorr().

calc_rice_paramscalc_rice_params()() - - ImprovementsImprovements

• Logic operations with Integers.

• The original loop was unrolled by 4.

• The input and output arrays were aligned to prevent ‘Split loads’.

calc_rice_paramscalc_rice_params() () – The code– The code

Old codefor (i=0; i<n; i++) {

udata[i] = (2*data[i]) ^ (data[i]>>31);

}

New codefor (i=0;i<n;i+=4) {

temp1 = _mm_load_si128((data+i));

temp2 = _mm_slli_epi32(temp1, 1);

temp3 = _mm_srai_epi32(temp1, 31);

temp1 = _mm_xor_si128(temp2, temp3);

_mm_store_si128((udata+i),temp1);

}

Shift right by 31 bits

Bitwise XOR

SIMD -SIMD - compute_autocorr()compute_autocorr()

• Contains another Inline function named apply_welch_window() - the first to do calculations.

• Speedup will be calculated for both functions together.

• Old code vs. new code:for (i=0; i<(len >> 1); i++) {

w = 1.0 - ((c-i) * (c-i));

w_data[i] = data[i] * w;

w_data[len-1-i] = data[len-1-i] * w;

}

Conversion to FP and Multiplication

apply_welch_window()apply_welch_window()

iup_align = _mm_load_si128 (data+i);fpup = _mm_cvtepi32_pd (iup_align);fpup = _mm_mul_pd (fpup, w_d_low);_mm_store_pd (w_data+i, fpup);iup_align = _mm_shuffle_epi32 (iup_align, _MM_SHUFFLE(1,0,3,2));fpup = _mm_cvtepi32_pd (iup_align);fpup = _mm_mul_pd (fpup, w_d_high);_mm_store_pd (w_data+i+2, fpup);

Loading 4 Integers at once – Cutting 50% of the load operations

compute_autocorr()compute_autocorr()

• Uses the output array from apply_welch_window().

Loop unrolling steps

1. Every ‘Inner Loop’ unrolled by 2.

2. ‘Main Loop’ unrolled by 2 - every Inner Loop unrolled by 4.

compute_autocorr() –compute_autocorr() – The codeThe codefor (i=0; i<=lag; ++i) { temp = 1.0; temp2 = 1.0;

for (j=0; j<=lag-i; ++j) temp += data1[j+i] * data1[j];

for (j=lag+1; j<=len-1; j+=2) { temp += data1[j] * data1[j-i]; temp2 += data1[j+1] * data1[j+1-i]; } autoc[i] = temp + temp2; }

Short ‘Inner loop’

‘Main loop’

If (lag%2==0) { a_high = a_low = _mm_loadu_pd(data1+j); b_low = _mm_loadu_pd(data1+j-i); b_high = _mm_load_pd(data1+j-i-1);}else { a_high = a_low = _mm_load_pd(data1+j); b_low = _mm_load_pd(data1+j-i); b_high = _mm_loadu_pd(data1+j-i-1);}a_low = _mm_mul_pd(a_low, b_low);c_low = _mm_add_pd(a_low, c_low); a_high = _mm_mul_pd(a_high, b_high); c_high = _mm_add_pd(a_high, c_high);}

Long ‘Inner loop’ (unrolled in the original code)

Using as manyaligned loads (and stores)as we canMultiplying and

adding the result

compute_autocorr() –compute_autocorr() – SpeedupSpeedup

Speedups using SIMD summary:• calc_rice_params() local speedup: x1.14 .

Overall speedup: x1.04 .

• compute_autocorr() local speedup: x1.92!

Overall speedup: x1.03 .

• Total speedup using SIMD: x1.07 .

Intel Tuning AssistantIntel Tuning Assistant

When using aligned arrays, split loads

didn't occur.

No Micro-Architectural problems found in the

optimized code.

Final ResultsFinal Results

A total speedup of x1.985 was achieved by using only MT and SIMD.

0

20

40

60

80

100

120

140

160

180

200

Original SIMD MT All Together IntelCompiler

SpeedUp

Date post:	20-Dec-2015
Category:	Documents
View:	214 times
Download:	0 times

Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika...

Documents