Post on 21-Dec-2015
transcript
A Case for Source-Level Transformations in
MATLAB
Vijay Menon and Keshav Pingali
Cornell University
The MaJic Project
at Illinois/Cornell•George Almasi•Luiz De Rose•David Padua
MATLAB
High-Level Interpreted Language for Numerical Computing Matrix is 1st class type Library of numerical functions
Application Domains Image Processing Structural Mechanics Computational Finance
The Problem
Development is fast... ~10X as concise as C/Fortran
Performance is slow! ~10X as slow as C/Fortran
Conventional Approach: Rewrite Compile
Our Approach: Source-Level Optimization
Apply high-level transformations directly on MATLAB codes
Significant performance benefit for: interpreted code compiled code
Outline
Overheads in MATLABConventional CompilationSource-Level OptimizationComparisonImplementation Status
Outline
Overheads in MATLAB Type/Shape Checking Memory Management Array Bounds Checking
Conventional CompilationSource-Level OptimizationComparisonImplementation Status
Type/Shape Checking
MATLAB has no type/shape declarationsConsider: A * B
Interpreter checks to perform multiply (*)
ShapeScalar*ScalarScalar*MatrixMatrix*Matrix
TypeReal*RealReal*ComplexComplex*Compl
ex
Type/Shape Checking
Consider:for i = 1:n
y = y + a * x(i)
end
Loops perform redundant checks magnify interpreter overhead
Memory Management: Dynamic Resizing
Consider:x(10) = 10;
C/Fortran: x must have >= 10 elements
MATLAB: x is resized if needed Memory reallocated Data copied
Memory Management: Dynamic Resizing
MATLAB dynamically grows arrays:for i = 1 : 1000
x(i) = i;
end
Every iteration triggers resize! 1,000 memory allocations ~500,000 elements copied
Execution Time: x is undefined: 14.2 seconds x is already defined: 0.37 seconds
Array Bounds Checking
Consider array indexing:x(i) = y(i);
Failed Bounds Check on x(i) can trigger resize y(i) can trigger error
Array Bounds Checking
In a loop:for i = 3:100
x(i) = x(i-1) + x(i-2);
end
Interpreter performance redundant checksCompiler work:
Nonresizable arrays: Gupta PLDI’90 Resizable arrays: more difficult
Common Theme
Loops magnify overheads every iteration: redundant checks,
resizes, …
MATLAB interprets naively computes as is no reorganization to optimize
Outline
Overheads in MATLABConventional Compilation
Compile to C/Fortran Rely on C/Fortran compiler for
optimizationSource-Level OptimizationComparisonImplementation Status
MATLAB Compilers
Compile to C/C++/Fortran MCC -> C (The MathWorks) MATCOM -> C++ (Mathtools) FALCON -> F90 (U of Illinois)
Native compiler generates executable code: Link back into MATLAB environment Run as stand-alone program
The MCC Compiler
Safe Optimization: Type Inference - no declarations in MATLAB Eliminate Type Checks / Reduce Storage Specialize for real input variables Always legal!
Unsafe Optimization: Assume all data is real Eliminate all bounds checks - disallow resizing User must ensure legality!
Falcon Benchmarks Collected by DeRose from MATLAB users at Illinois/NCSA
Element/Loop Intensive CN - Crank-Nicholson PDE Solver Di - Dirichlet PDE Solver FD - Finite Difference PDE Solver Ga - Galerkin PDE Solver IC - Incomplete Cholesky Factorization
Memory Intensive AQ - Adaptive Quadrature w/ Simpson’s Rule EC - Euler-Cromer 2 body problem RK - Runga Kutta 2 body problem
Library Intensive CG - Conjugate Gradients Iterative Solver Mei - 3D surface Generation QMR - Quasi-Minimal Residual SOR - Successive Over-Relaxation AQ
MCC: Safe Optimizations
0
10
20
30
40
50
60
70
80
AQ CG CN Di FD Ga IC Mei EC RK QMR SOR
Ex
ec
uti
on
Tim
e (
s)
Interpreted
MCC Safe
MCC: Unsafe Optimizations
0
10
20
30
40
50
60
70
CG Di FD IC QMR SOR
Ex
ecu
tio
n T
ime
(s)
Interpreted
MCC Safe
MCC Unsafe All
Note: User must ensure legality!
Outline
Overheads in MATLABConventional CompilationSource-Level Optimization
Vectorization Preallocation Expression Optimization
ComparisonImplementation Status
Vectorization
Loops are expensive Overheads are magnified
Idea: Eliminate Loops Map loops to higher-level matrix
operations Interpreter uses efficient libraries
BLASLINPACK/EISPACK
Example of Vectorization
In Galerkin, 98% of execution spent in:
for i = 1:N
for j = 1:N
phi(k) += a(i,j)*x(i)*y(i);
end
end
Vectorized Code
In Optimized Galerkin:
phi(k) += x*a*y’;
Fragment Speedup: 260Program Speedup: 110
Note: Not always possible!
Effect of Vectorization
0
10
20
30
40
50
60
70
80
CN Di FD Ga IC
Ex
ecu
tio
n T
ime
(s)
Original
Vectorized
Preallocation
Eliminate Dynamic Resizing Try to predict eventual size of array
Insert early allocation when possible:x = zeros(1000,1);
Resizing will not be triggered
Example of Preallocation
In Euler-Cromer, 87% of time spent in:
for i = 1:N
r(i) = …
th(i) = …
t(i) = …
k(i) = …
p(i) = …
…
end
Preallocated Code
In Optimized Euler-Cromer:
r = zeros(1,N);
...
for i = 1:N
r(i) = …
…
end
Fragment Speedup: 7Program Speedup: 4
Effect of Preallocation
0
10
20
30
40
50
60
70
80
CN Ga EC RK
Ex
ecu
tio
n T
ime
(s)
Original
Preallocated
Expression Optimization
MATLAB interprets expressions naïvely in left to right order
Simple restructuring may significantly effects execution time, e.g.: A*B*x : O(n3) flops A*(B*x) : O(n2) flops
Example of Expression Optimization
In QMR, 70% of execution spent in:
w = A’*q;
A : 420x420 matrixq, w : 420x1 vectors
A’ = transpose(A)
Expression Optimized Code
In Optimized QMR: A’*q == (q’*A)’
w = (q’*A)’;
Transpose 2 vectors instead 1 matrix
Fragment Speedup: 20Program Speedup: 3
Effect of Expression Optimization
0
10
20
30
40
50
60
70
EC RK QMR
Ex
ecu
tio
n T
ime
(s)
Original
Expr. Optimized
Summary Source-Level
0
10
20
30
40
50
60
70
80
AQ
CG
CN Di
FD
Ga IC
Mei
EC
RK
QM
R
SO
R
Ex
ecu
tio
n T
ime
(s)
Original
Source Optimized
Comparison
0
10
20
30
40
50
60
70
80
AQ CG CN Di FD Ga IC Mei EC RK QMR SOR
Ex
ec
uti
on
Tim
e (
s)
Interpreted MCC Safe MCC Best
Opt. Interpreted Opt. MCC Safe Opt. MCC Best
Point #1:
Source optimizations can outperform MCC
0
10
20
30
40
50
60
70
FD Ga IC QMR
Ex
ecu
tio
n T
ime
(s)
Interpreted MCC Safe MCC Best
Opt. Interpreted Opt. MCC Safe Opt. MCC Best
Point #2:
0
10
20
30
40
50
60
70
80
CN FD Ga IC EC
Ex
ecu
tio
n T
ime
(s)
Interpreted MCC Safe MCC Best
Opt. Interpreted Opt. MCC Safe Opt. MCC Best
Source optimizations complement MCC
Benefits of Source-Level Optimizations
Vectorization Directly eliminates loop overhead Move work to hand-optimized BLAS
Preallocation Eliminates resizing overhead Enables MCC array bounds elimination
Expression Optimization Uses algebraic info unavailable in C/Fortran
Implementation Status
Illinois/Cornell MaJic system Just-in-time MATLAB interpreter/compiler Incorporates Source-Level Transformation
Semantic Optimization (Menon/Pingali ICS’99)• Vectorization/BLAS call generation• Expression Optimization
Preallocation/Bounds Check Optimization (Work in progress)
Conclusion
Source Level Optimizations are important for enhancing performance of MATLAB whether code is just interpreted or later compiled
THE END
Unsafe Type Check Removal
0
10
20
30
40
50
60
70
80
Ex
ecu
tio
n T
ime
(s)
Interpreted
MCC Safe
MCC Unsafe Type
Correct on 11/12 Codes
Unsafe Bounds Check Removal
0
10
20
30
40
50
60
70
CG Di FD IC Mei QMR SOR
Ex
ecu
tio
n T
ime
(s)
Interpreted
MCC Safe
MCC Unsafe Bounds
Correct on 7/12 Codes