Date post: | 29-Jan-2018 |
Category: |
Technology |
Upload: | oracle-hardware |
View: | 2,481 times |
Download: | 0 times |
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 131
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 132
Maximizing Your SPARC T4 Oracle Solaris Application Performance§ Darryl Gove
Senior Principal Software Engineer
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 133
Program Agenda
§ Hardware§ Correctness§ Performance§ Parallelism
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.4
More Information§ Download, technical articles and more: http://oracle.com/goto/solarisstudio
OpenWorld Sessions
§ Mon, Oct 1, 10:45 - 11:45 AM: Maximizing Your SPARC T4 Oracle Solaris Application Performance, CON 6382 (Marriott Marquis - Golden Gate)
§ Mon, Oct 1, 3:15 - 4:15 PM: Technical Panel: Developing High Performance Applications on Oracle Solaris, CON 7196 (Marriott Marquis - Golden Gate)
Hands-on Lab
§ Wed, Oct 3, 1:15 - 2:15 PM: Develop C/C++ Applications for the Cloud with Oracle Tuxedo and Oracle Solaris Studio, HOL 10276 (Marriott Marquis - Salon 5/6)
JavaOne Sessions
§ Mon, Oct 1, 8:30 – 9:30 AM: Mixed-Language Development: Leveraging Native Code from Java, CON 6714 (Hilton San Francisco -Continental Ballroom 6)
§ Tues, Oct 2, 1:00 – 2:00 PM: Take Performance Tuning of Your Enterprise Java Applications to the Next Level , CON 10213 (Hilton San Francisco -Continental Ballroom 6)
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.5
Oracle Solaris Studio
© 2011 Oracle Corporation – Proprietary and Confidential 4
Performance Analyzer provides unparalleled insight into your app, allowing you to identify bottlenecks and improve performance by orders of magnitude
Code Analyzer ensures app reliability by detecting app vulnerabilities, including memory leaks and memory access violations
Thread Analyzer simplifies complex parallel programming errors by detecting hard to pinpoint race and deadlock conditions
Integrated Development Environment increases developer efficiency
New
Analysis Suite
C, C++ Compilers utilize advanced code generation technology to optimize apps for highest performance on SPARC & x86
Fortran Compiler optimizes compute intensive app performance
Debugger ensures app stability with event handling & multi-thread support
Performance Library maximizes compute-intensive app performance using advanced numeric solver libraries
Compiler Suite
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.6
Oracle Solaris Studio 12.3 Highlights
Accelerate Performance
Ø 3x faster code on SPARC T4 than GCC; 40% faster than Sun Studio 12
Ø 1.5x faster code on Intel x86 than GCC; 20% faster than Sun Studio 12
Gain Extreme Observability
Ø New Code Analyzer for more reliable applications; reports common coding & memory access errors faster than competitive alternatives
Ø Enhanced Performance Analyzer with system-wide performance analysis
Improve Productivity
Ø Remote access to Solaris Studio tools from local desktop (Oracle Solaris, Linux, Microsoft Windows, Mac)
Ø Streamlined Oracle DB application developmentØ Simplify Oracle Tuxedo development with IDE plug-inØ IPS distribution on Solaris 11 for simplified managementØ 20% faster compile time
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 137
Click icon to add picture
SPARC T4 Hardware
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.8
SPARC T4 - Overview
§ Not like T1 – T3 (only shares the T-series name)§ Single thread performance § Multithread throughput
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.9
SPARC T4 - Details
§ 1 to 4 chips per system§ 8 cores per chip
● Dual issue
● Out-of-order
§ 8 threads per core§ 3.0 GHz clock
● 48B (3.0GHz * 8 * 2) instructions / sec / chip
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.10
SPARC T4 - Capacity
§ Chip capacity: 48 B instructions / sec§ For fully active threads:
● Single thread: 6 B instructions / sec
● Each of eight threads: 0.75 B instructions / sec
§ Threads rarely fully active:● I/O wait
● Processor stall (fetch from memory = 300-400 cycles)
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.11
Developing for T4
§ Make it correct§ Remove obvious performance issues§ Make it scale (correctly)
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1312
Click icon to add picture
Application Correctness
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.13
Debug information
§ Always use -g§ No optimisation flags:
● Full debug
● Lower performance
§ Optimised binaries:● Best effort debug
● No/minimal performance impact
§ Debug what you ship!
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.14
Automatic Error Detection
§ Static/compile time error detection● Code Analyzer
§ Dynamic/runtime memory access error detection● Discover
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.15
Code Analyzer
§ Static analysis for common coding errors● Uninitialised variables, etc.
§ Compile with:● -xanalyze=code
§ View results with:● code-analyzer <a.out>
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.16
Code Analyzer – example output
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.17
Memory Error Detection - discover
§ Common memory allocation and use errors:● Uninitialised memory
● Access past bounds
● Memory leaks
§ Usage: ● discover <a.out>
● <a.out>
● Default = html output
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.18
Example of discover$ ./a.outERROR 1 (ABR): reading memory beyond array bounds at address 0xffbff278 (8 bytes) on the stack at: average() + 0x228 <disc.c:8> 6: for (int i=1; i<=len; i++) 7: { 8:=> total+=array[i]; 9: } _start() + 0xd8 ... double array[20]; ... printf(" Average = %f\n", average(array,20) );
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1319
Click icon to add picture
Application Performance
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.20
Optimisation – the Basics
§ No optimisation flags == no optimisation§ Good optimisation: -O§ Advanced optimisations:
● Guided by profile of appliaction
● Knowledge of deployment systems
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.21
Profiling
§ Profiling with the performance analyzer● collect <a.out>
● collect -P <pid>
● analyzer test.1.er
§ Report generation with spot● spot <a.out>
● spot -P <pid>
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.22
Performance Analyzer
§ Demo
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.23
Performance Analyzer
§ Demo
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.24
Aggressive Optimisation
§ One stop flag: -fast§ Enables multiple optimisations
● Build machine = deployment machine
● Floating point simplification and optimisation
● Pointers to different types do not alias
● Function inlining
§ Investigate performance gain
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.25
Profile Drives Flag Selection Floating Point§ Significant time in floating point computation:
● Floating point simplification
● -fsimple=2
§ Significant time in floating point library code:
● Optimised floating point libraries
● -xlibmopt, -xlibmil
§ Use FP optimisations if performance improves and FP optimisations are acceptable
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.26
Profile Drives Flag Selection Flat profile§ Many hot small functions
● At least -xO4 optimisation level
● -xipo for cross-file optimisations
§ Conditional code or inlining
● Profile feedback
● -xprofile=collect:
● Training run of application
● -xprofile=use:
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.27
Profile Drives Flag Selection Pointers§ Pointers inhibit compiler optimisations§ Compiler needs more information§ restrict qualified pointers in C
● Localised action
§ Flags:● -xrestrict (restrict qualified pointers passed into functions)
● -xalias_level=std [C]
● -xalias_level=compatible [C++]
● Actions at file level
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.28
Processor Specific Optimisations
§ Default: -xtarget=generic often good enough
§ T4 has useful instructions● Compare and branch
● Floating point multiply add
§ One stop flag: -xtarget=T4§ Schedules for T4, uses entire T4 instruction set§ Only runs on T4 (or later) processors
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.29
SPARC Instruction Sets
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1330
Click icon to add picture
Multi-threaded Applications
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.31
Multi-thread or Multi-process
§ Multiprocess:● Isolation
● Independence
● Large virtual memory footprint
● Potentially high synchronisation costs
§ Multithread
● Low synchronisation costs
● Minimal memory footprint
Throughput
Latency
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.32
Multi-threaded Application Development
§ POSIX threads (C11, C++11)● Low level: Great control, significant complexity
§ OpenMP
● High abstraction: Easy to use, flexible
§ Automatic parallelisation
● Trivial to use: -xautopar -xreduction
● Works best for loop-intensive code (typically FP)
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.33
OpenMP Parallel For
§ Distributes iterations across CPUs
#pragma omp parallel for
for (int i=0; i<length; i++)
{
// Do work
}
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.34
OpenMP Tasks
§ Distributes work across CPUs
for (int i=0; i<length; i++)
{
#pragma omp task
{
// Do work for task ‘i’
}
}
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.35
Parallel Program Correctness
§ Distributes work across CPUs
int total=0;
#pragma omp parallel for
for (int i=0; i<length; i++)
{
total += i;
}§ Data race: Multiple threads updating the same variable
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.36
Thread Analyzer
§ Instrument application● Compiler flag: -xinstrument=datarace
● Binary instrumentation: discover -i datarace <a.out>
§ Gather data:● collect -r on <a.out>
§ View data:
● tha tha.1.er
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.37
Thread Analyzer - Example
§ Demo
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.38
Scaling to Many Threads
§ Minimise serial code● Amdahl’s Law
§ Minimise lock contention§ Minimise writes of shared data§ Evenly distribute work
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.39
Scaling to Many Threads
§ Demo
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.40
Limits of Performance
§ Threads● vmstat
§ Instruction Issue Width
● pgstat / cputrack / cpustat / ripc
§ Bandwidth
● busstat / bw
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.41
Conclusion: Optimising for T4
§ Step 1: Profile and remove inefficient code§ Step 2: Explore benefits of increased optimisation§ Step 3: Identify opportunities for parallelisation§ Step 4: Profile and tune parallel code § Step 5: Watch for hitting hardware limits
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1342
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1343