Post on 21-Jan-2017
transcript
Optimizing computer vision problems on mobile platforms
Looksery.com
Fedor Polyakov
Software Engineer, CIOLooksery, INCfedor@looksery.com+380 97 5900009 (mobile)www.looksery.com
Optimize algorithm first
• If your algorithm is suboptimal, “technical” optimizations won’t be as effective as just algo fixes
• When you optimize the algorithm, you’d probably have to change your technical optimizations too
• Single instruction - multiple data• On NEON, 16x128-bit wide registers (up to 4 int32_t’s/floats, 2 doubles)• Uses a bit more cycles per instruction, but can operate on a lot more data• Can ideally give the performance boost of up to 4x times (typically, in my
practice ~2-3x)• Can be used for many image processing algorithms• Especially useful at various linear algebra problems
SIMD operations
• The easiest way - you just use the library and it does everything for you• Eigen - great header-only library for linear algebra• Ne10 - neon-optimized library for some image processing/DSP on android• Accelerate.framework - lots of image processing/DSP on iOS• OpenCV, unfortunately, is quite weakly optimized for ARM SIMD (though,
they’ve optimized ~40 low-level functions in OpenCV 3.0)• There are also some commercial libraries• + Everything is done without any your efforts• - You should still profile and analyze the ASM code to verify that everything
is vectorized as you expect
Using computer vision/algebra/DSP libraries
using v4si = int __attribute__ ((vector_size (VECTOR_SIZE_IN_BYTES)));v4si x, y;
• All common operations with x are now vectorized• Written once and for all architectures• Operations supported +, -, *, /, unary minus, ^, |, &, ~, %, <<, >>, comparisons• Loading from memory in a way like this x = *((v4si*)ptr);• Loading back to memory in a way like this *((v4si*)ptr) = x;• Supports subscript operator for accessing individual elements• Not all SIMD operations supported• May produce suboptimal code
GCC/clang vector extensions
• Provide a custom data types and a set of c functions to vectorize code• Example: float32x4_t vrsqrtsq_f32(float32x4_t a, float32x4_t b);• Generally, are similar to previous approach though give you a better control and
full instruction set.• Cons:
• Have to write separate code for each platform• In all the above approaches, compiler may inject some instructions which
can be avoided in hand-crafted code• Compiler might generate code that won’t use the pipeline efficiently
SIMD intrinsics
• Gives you the most control - you know what code will be generated• So, if created carefully, can sometimes be up to 2 times faster than the code
generated by compiler using previous approaches (usually 10-15% though)• You need to write separate code for each architecture :(• Need to learn• Harder to create• In order to get the maximum performance possible, some additional steps may
be required
Handcrafted ASM code
• Reduce data types to as small as possible• If you can change double to int16_t, you’ll get more than 4x performance boost• Try using pld intrinsic - it “hints” CPU to load some data into caches which will be
used in a near future (can be used as __builtin_prefetch)• If you use intrinsics, watch out for some extra loads/stores which you may be
able to get rid of• Use loop unrolling• Interleave load/store instructions and arithmetical operations• Use proper memory alignment - can cause crashes/slow down performance
Some other tricks
• Sum of matrix rows • Matrices are 128x128, test is repeated 10^5 times
Some benchmarks
// Non-vectorized codefor (int i = 0; i < matSize; i++) { for (int j = 0; j < matSize; j++) { rowSum[j] += testMat[i][j]; }}
// Vectorized codefor (int i = 0; i < matSize; i++) { for (int j = 0; j < matSize; j += vectorSize) { VectorType x = *(VectorType*)(testMat[i] + j); VectorType y = *(VectorType*)(rowSum + j); y += x; *(VectorType*)(rowSum + j) = y; }}
Some benchmarks
Tested on iPhone 5, results on other phones show pretty much the same
Simple Vectorized0123456789
10 int float shortTi
me,
s
Got more than 2x performance boost, mission completed?
Some benchmarks
Simple Vectorized Loop unroll0123456789
10 int float short
Tim
e, s
Got another ~15%
for (int i = 0; i < matSize; i++) { auto ptr = testMat[i]; for (int j = 0; j < matSize; j += 4 * xSize) { auto ptrStart = ptr + j; VT x1 = *(VT*)(ptrStart + 0 * xSize); VT y1 = *(VT*)(rowSum + j + 0 * xSize); y1 += x1; VT x2 = *(VT*)(ptrStart + 1 * xSize); VT y2 = *(VT*)(rowSum + j + 1 * xSize); y2 += x2; VT x3 = *(VT*)(ptrStart + 2 * xSize); VT y3 = *(VT*)(rowSum + j + 2 * xSize); y3 += x3; VT x4 = *(VT*)(ptrStart + 3 * xSize); VT y4 = *(VT*)(rowSum + j + 3 * xSize); y4 += x4; *(VT*)(rowSum + j + 0 * xSize) = y1; *(VT*)(rowSum + j + 1 * xSize) = y2; *(VT*)(rowSum + j + 2 * xSize) = y3; *(VT*)(rowSum + j + 3 * xSize) = y4; }}
Some benchmarks
Let’s take a look at profiler
Some benchmarks
// Non-vectorized codefor (int i = 0; i < matSize; i++) { for (int j = 0; j < matSize; j++) { rowSum[i] += testMat[j][i]; }}
// Vectorized, loop-unrolled codefor (int i = 0; i < matSize; i+=4 * xSize) { VT y1 = *(VT*)(rowSum + i); VT y2 = *(VT*)(rowSum + i + xSize); VT y3 = *(VT*)(rowSum + i + 2*xSize); VT y4 = *(VT*)(rowSum + i + 3*xSize); for (int j = 0; j < matSize; j ++) { x1 = *(VT*)(testMat[j] + i); x2 = *(VT*)(testMat[j] + i + xSize); x3 = *(VT*)(testMat[j] + i + 2*xSize); x4 = *(VT*)(testMat[j] + i + 3*xSize); y1 += x1; y2 += x2; y3 += x3; y4 += x4; } *(VT*)(rowSum + i) = y1; *(VT*)(rowSum + i + xSize) = y2; *(VT*)(rowSum + i + 2*xSize) = y3; *(VT*)(rowSum + i + 3*xSize) = y4;}
Some benchmarks
Simple Vectorized Vect + Loop Vect+Loop+changed order
0123456789
10 int float Short
Tim
e, s
Some benchmarks
Simple Vectorized Vect + Loop Eigen SumOrder Asm0123456789
10 float
Tim
e, s
Using GPGPU
• Around 1.5 orders of magnitude bigger theoretical performance• On iPhone 5, CPU has like ~800 MFlops, GPU has 28.8 GFlops• On iPhone 5S, CPU has 1.5~ GFlops, GPU has 76.4 GFlops !• Can be very hard to utilize efficiently• CUDA, obviously, isn’t available on mobile devices• OpenCL isn’t available on iOS and is hardly available on android
• On iOS, Metal is available for GPGPU but only starting with iPhone 5S• On Android, Google promotes Renderscript for GPGPU
• So, the only cross-platform way is to use OpenGL ES (2.0)
Common usage of shaders for GPGPU
Shader 1
Image
Data
Texture containing processed data
Shader 2
…
Data
Results
Display on screen
Read back to cpu
Common problems
• Textures were designed to hold RGBA8 data• On almost all phones starting 2012, half-float and float textures are supported as
input• Effective bilinear filtering for float textures may be unsupported or ineffective
• On many devices, writing from fragment shader to half-float (16 bit) textures is supported.
• Emulating the fixed-point arithmetic is pretty straightforward• Emulating floating-point is possible, but a bit tricky and requires more operations• Change of OpenGL states may be expensive• For-loops with non-const number of iterations not supported on older devices• Reading from GPU to CPU is very expensive
• There are some platform-dependent way to make it faster
Tasks that can be solved on OpenGL ES
• Image processing• Image binarization• Edge detection (Sobel, Canny)• Hough transform (though, some parts can’t be implemented on GPU)• Histogram equalization• Gaussian blur/other convolutions• Colorspace conversions• Much more examples in GPUImage library for iOS
• For other tasks, it depends on many factors• We tried to implement our tracking on GPU, but didn’t get the expected
performance boost
Questions?
Thanks for attention!