White Paper Best Known Methods for Using OpenMP* on Intel ......BKM: Use –openmp-report to observe...

White Paper

Best Known Methods for Using OpenMP* on

Intel® Many Integrated Core (Intel® MIC) Architecture

Abstract

This paper is a supplement to the Intel® Composer XE for Linux* documentation. It serves as a

compendium of best known methods for using the OpenMP* extensions to C/C++ and Fortran when

programming offload and native programs for the Intel Many Integrated Core (Intel MIC) Architecture.

Volume 1a - January 29, 2013 - 2 -

Overview

Best Known Methods (BKMs) are techniques, explanations and hints that empower users to gain

maximum benefit in their use of systems. The BKMs described here are particular tips for implementing

and optimizing programs running on the Intel® Many Integrated Core (Intel® MIC) Architecture and using

the versatile language extension for parallel programming, OpenMP. This article and the ones to follow

describe OpenMP techniques and tips for operating in both offloaded and native execution environments

on Intel MIC Architecture. OpenMP offers a simple to learn model with well-chosen defaults to make the

simple forms easy to understand. Those forms have rich options that can be customized to tune for a

myriad of variations. These articles will explore some of these options to show how they might most

effectively be applied. Each example has been verified against the Intel Composer XE 2013 compilers.

Best Known Methods for the use of OpenMP* on Intel® Many Integrated Core

(Intel® MIC) Architecture

Environment settings to set configuration for use on Intel MIC Architecture

BKM: Use MIC_ENV_PREFIX in offload to automatically propagate Host environment to Target

When offloading execution to the target coprocessor, users can limit the host environment exposed

on the target and selectively propagate settings defined in the host environment to the target

through the use of the MIC_ENV_PREFIX environment variable.

Applicability: Offloaded C, C++ and Fortran.

Details: If set on the host, normally to “MIC”, only the host-side user-defined shell environment

variables that start with the defined string prepended to “_” will be propagated to the target

environment. The one exception to this rule is MIC_LD_LIBRARY_PATH, which has a unique meaning

on the host (the location of target libraries on the host to be searched by the offloader)—there will

be no LD_LIBRARY_PATH in the target environment of offloaded programs. Examples using

MIC_ENV_PREFIX are scattered throughout this article but please note that current behavior is to

always append an underscore (“_”) to the defined prefix before using it to match host environment

variables for import to the target.

If the MIC_ENV_PREFIX is not set on the host, nearly the entire host environment is copied to the

environment of the offload process. This can be confusing, especially if you are using OpenMP on

both the target and the host, since you will likely want properties such as affinity and thread count

to be different, reflective of the specific architectures.

Basic Use of OpenMP directives in combination with Offload directives

BKM: Offload Essential Details

Offload declarations can start simply but can be modified with options to handle a range of

capabilities.

Applicability: Offloaded C, C++ and Fortran


Details: The simplest portable offload statement looks something like this (C and Fortran):

#pragma offload target (mic:0) Openmp for loop or function call ... !$DIR OFFLOAD TARGET (mic:0) CALL offloadfunc(...)

These “minimal” examples still include an optional component, the “:0”, which is not required but sets

a specific card unit number, in this case, card 0. If the default unit is unspecified, i.e. “mic” by itself,

the scheduler will always use card 0. This results in more predictable performance behavior. If you

desire to balance the load across several cards, you have to do this manually by specifying a target

of “mic:n” in your offloads, where n is the card to which you are offloading. Once synchronous offload

is understood, the next thing users usually want to do is play with data, copying them out or copying

them back, or allocating them on the target or some combination. The key to understanding offload

data is the recognition that the IN, OUT, INOUT and NOCOPY directives do not expand the normal scope

of data; they only restrict it. If there are a lot of data visible but not necessarily all are used;

restricting how much gets copied between host and target may yield significant performance

improvements.

With a mechanism in place to filter data to and from the target, there are more things you can

specify about the data, such as their length, alignment, and dynamic allocation behavior. The

example below shows the use of the INOUT specifier to name a data pointer (memp), give its length

(200 elements whose size is determined by the type of memp) and its data alignment (8-byte). INOUT

does not change the offload behavior of identifiers in its charge, but provides a bit of syntax within

which the other qualities of the offloaded data can be specified:

#pragma offload target(mic:0), inout(memp : length(200) align(8))

The fourth keyword, NOCOPY, is most commonly used in programs employing pointers located

exclusively on the target side, along with the directives LENGTH, ALLOC_IF and FREE_IF. There is also a

mechanism using IN and LENGTH(0) directives to maintain the host-side connection to offloaded data,

which empowers the programmer to use data persistent on the target across multiple offloads

without having to recopy them at each offload. (More information on the use of these keywords is

available in the compiler reference documents, and at the Intel® Xeon Phi™ Community portal site.)

Asynchronous offload features are also available, wherein strict, sequential, non-overlapping

execution between host and offload-target can be relaxed using the SIGNAL and WAIT specifiers to

synchronize offloads and the OFFLOAD_TRANSFER and OFFLOAD_WAIT pragmas to handle data transfers

separately specially and separately from offloading computation. These are also described in more

detail in the compiler reference documents.

More about these topics, including examples to show them in practical use, will appear in other

articles.

BKM: Use –openmp-report to observe how the compiler is realizing your OpenMP regions and loops

-openmp-report is a command line switch that stimulates a report to stderr providing diagnostic

messages about the compiler’s handling of OpenMP constructs processed during C/C++ or Fortran

source compilation. Use the setting –openmp-report=1 to learn about the handling of parallel loops,

regions and sections within the compilation unit. Make that a 2 to add diagnostics about various

synchronization constructs. Also look at –vec-report

http://software.intel.com/en-us/mic-developer


Applicability: Offloaded and native, C, C++ and Fortran.

Example: Be sure to enable –openmp when you employ the –openmp-report switch. From code

equipped with OpenMP parallel region and loop constructs you should get reports like these, or

others indicating failing loops or regions; the line numbers let you connect reports back to specific

OpenMP constructs.

MICFtest.F90(104): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED. MICFtest.F90(110): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. MICFtest.F90(172): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED. MICFtest.F90(186): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. MICFtest.F90(199): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. MICtest.cpp(534): (col. 5) remark: *MIC* OpenMP DEFINED REGION WAS PARALLELIZED. MICtest.cpp(538): (col. 5) remark: *MIC* OpenMP DEFINED LOOP WAS PARALLELIZED. MICtest.cpp(568): (col. 5) remark: *MIC* OpenMP DEFINED REGION WAS PARALLELIZED. MICtest.cpp(582): (col. 5) remark: *MIC* OpenMP DEFINED LOOP WAS PARALLELIZED. MICtest.cpp(594): (col. 2) remark: *MIC* OpenMP DEFINED LOOP WAS PARALLELIZED.

BKM: Uncertain whether that offload section is running on the target or the host? Count the number of

threads.

If you are uncertain whether the code you’re testing is actually running on the right target, it is fairly

easy to add a piece of temporary code that returns the number of threads in the thread team.

Applicability: Offloaded C, C++ and Fortran.

Example: Here are code samples demonstrating a way to use omp_get_num_threads() to report the

number of threads running in a parallel section. Remember to make the get_num_threads call within

a parallel section but avoid extra work by ensuring that only one thread makes the call.

C/C++

using namespace std; #pragma offload_attribute(push,target(mic)) //{ #include <iostream> #include <omp.h> void testThreadCount() { int thread_count; #pragma omp parallel { #pragma omp single thread_count = omp_get_num_threads(); } cout << "Thread count: " << thread_count << endl; } #pragma offload_attribute(pop) //} int main (int argc, char **argv)


{ #pragma offload target(mic), if (argc > 1) testThreadCount(); }

Fortran

!DIR$ ATTRIBUTES OFFLOAD : mic ::testThreadCount subroutine testThreadCount() use omp_lib integer :: thread_count !$omp parallel !$omp single thread_count = OMP_GET_NUM_THREADS() !$omp end single !$omp end parallel WRITE (*, '(A,I4)') "The number of threads is ", thread_count end subroutine testThreadCount program threadcnt integer :: arg_num arg_num = COMMAND_ARGUMENT_COUNT () !DIR$ ATTRIBUTES OFFLOAD : mic ::testThreadCount !DIR$ OFFLOAD target(mic), if (arg_num .ge. 1) call testThreadCount() end program threadcnt

OMP_GET_NUM_THREADS() operates on thread state within the “innermost enclosing parallel region”

but there are OMP calls that don’t require such a context: You could use OMP_GET_MAX_THREADS() to

get an upper bound on thread count with requiring a parallel region. But are you just interested in where

the code is running but don’t care about thread counts at all? Even easier! Just replace the body of

testThreadCount() with the following code, which will be selectively emitted in the two copies of the

function:

#ifdef __MIC__ //{ cout << “running on target” << endl; #else //}{ cout << “running on host” << endl; #endif //}

Differences in OpenMP use in Native versus Offload environment

BKM: Default values of OpenMP parameters vary between offloaded and native execution

Because the system reserves a core for its own processing when running programs invoked via the

offload compiler (versus programs cross-compiled and run natively on the target) some of the

parameters available to OpenMP vary between these two execution environments.

Applicability: Offloaded and native C, C++ and Fortran.

Details: A simple expansion of the sample program shown above for getting the available number of

threads can reveal these differences. When run as an offload program, it produces numbers like the

following:


Thread count: 124 Threads_max: 124 Proc count: 124 Is dynamic: 0 Is nested: 0 Sched kind: 1 Sched modifier: 0 Thread limit: 2147483647 Max Active levels: 2147483647 Active Level: 1

While running the same program on the same machine as a native program produces the following:

Thread count: 128 Threads_max: 128 Proc count: 128 Is dynamic: 0 Is nested: 0 Sched kind: 1 Sched modifier: 0 Thread limit: 2147483647 Max Active levels: 2147483647 Active Level: 1

The variations here are all due to the availability of all the cores when running native. This

implemented in the kernel thread table and mimicked by an affinity map set up in the OpenMP

runtime environment. You can peek at this affinity map by setting the environment variable

KMP_AFFINITY=verbose prior to running an offloaded program. This will generate a lot of output,

some of which looks like this:

OMP: Info #156: KMP_AFFINITY: 124 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 1 packages x 31 cores/pkg x 4 threads/core (31 total cores) OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60 ,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124}

This last line is repeated for each of the available threads. In the above example, there are 31 cores

available, giving 124 threads. Logging into the card and setting KMP_AFFINITY=verbose before

running the code as a native program shows how the above mapping has been altered:

OMP: Info #156: KMP_AFFINITY: 128 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 1 packages x 32 cores/pkg x 4 threads/core (32 total cores) OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59, 60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127}

Here the full set of 32 cores is available and the extra HW threads show up in the affinity map as

numbers 0, 125, 126 and 127. This odd numbering is due to the way the HW threads are


enumerated by the Advanced Programmable Interrupt Controller (APIC) as can be seen in the

contents of /proc/cpuinfo (filtered down to the relevant bits):

processor : 0 apicid : 124 processor : 1 apicid : 0 processor : 2 apicid : 1 processor : 3 apicid : 2 ... processor : 123 apicid : 122 processor : 124 apicid : 123 processor : 125 apicid : 125 processor : 126 apicid : 126 processor : 127 apicid : 127

This structure of HW thread mapping is present regardless of the number of cores on any particular

device: using offload causes the system to remove one core from the affinity map. The core

removed from the map is the one whose HW thread numbers are not adjacent—always the highest

three plus HW thread 0. This core is used by the Intel® Coprocessor Offload Infrastructure (Intel® COI)

and other offload-related services.

Between offloaded and native execution OpenMP programs can find themselves within very

different environments. On the native side a typical environment is very simple. Using a little

environ table dumper, it’s easy to print out the environment of the main process (LD_LIBRARY_PATH

was added to provide a path to the OpenMP runtime object).

~ # /tmp/getmyenvnat ENV: "USER=root" ENV: "LD_LIBRARY_PATH=/tmp" ENV: "HOME=/" ENV: "SSH_TTY=/dev/pts/0" ENV: "LOGNAME=root" ENV: "TERM=ansi" ENV: "PATH=/usr/bin:/bin" ENV: "SHELL=/bin/sh" ENV: "PWD=/" ENV: "SSH_CONNECTION=192.168.1.99 48111 192.168.1.100 22"

However, without user action to restrict propagation the offloaded environment is nearly a complete

copy of the host environment. Low level communications libraries Intel® COI and Intel® Symmetric

Communications Infrastructure (Intel® SCI) each insert an identifier while LD_LIBRARY_PATH is not


copied, but the host PATH is passed intact, with “/usr/bin:/bin” appended to it. It is easy to clear

away all this host environment clutter by defining MIC_ENV_PREFIX in the host environment:

$ export MIC_ENV_PREFIX=MIC $ ./getmyenv mic ENV: "COI_LOG_PORT=65535" ENV: "ENV_PREFIX=MIC" ENV: "PATH=/usr/bin:/bin" ENV: "SCIF_SOURCE_NODE=0" ENV: "__KMP_REGISTERED_LIB_1017=blahblahblahblah-libiomp5.so"

With MIC_ENV_PREFIX defined, the target environment is isolated from any weird side effects the

host environment might trigger, but particular variables can be passed through by adding a prefix to

the host variable name comprising the user-defined MIC_ENV_PREFIX plus an underscore, as shown

above. You can see that it is working, as evidenced by the ENV_PREFIX definition. Change

MIC_ENV_PREFIX to something else and that variable will go away.

Using this mechanism, it is possible to have separate environment values defined for runs on the

host or on the target. The one place this doesn’t work is the excluded variable, LD_LIBRARY_PATH.

If you look at your host environment after running the compiler initialization script, you’ll find both

LD_LIBRARY_PATH and MIC_LD_LIBRARY_PATH defined, the former identifying the host library

environment and the latter locating the target library paths for the offloaded code.

Topical Advice on OpenMP on Intel MIC Architecture

BKM: When modifying OpenMP stack size (OMP_STACKSIZE), take care to note the different options,

defaults and effects with high thread counts

The careful reader will note that there are a number of controls available to adjust stack size for

running threads on host and target, and that these controls interact with each other but don’t share

the same defaults.

Applicability: Offloaded and native C, C++ and Fortran.

Details: First there is KMP_STACKSIZE, whose values can be suffixed with B/K/M/G/T to mark bytes,

kilobytes, megabytes, etc.:

export KMP_STACKSIZE=4M

If the units are unspecified for KMP_STACKSIZE, the number is assumed to be in bytes.

KMP_STACKSIZE overrides any OMP_STACKSIZE setting, and if both are set the runtime will

generate a warning message to the user. OMP_STACKSIZE uses the same B/K/M/G/T suffix notation,

but if unspecified, OMP_STACKSIZE units default to kilobytes. To be safe, always specify a units-

suffix with these parameters.

If offloading computation from the host and MIC_ENV_PREFIX is not defined, the stack-size

environment variables are copied from host to target environment when the target process is

spawned (along with the rest of the environment). With MIC_ENV_PREFIX defined, users can define

separate settings in the host environment for both host and target values:

export MIC_ENV_PREFIX=MIC export OMP_STACKSIZE=8M


export MIC_OMP_STACKSIZE=2M

Caution: normally the Intel® Xeon Phi™ processor can have over 60 cores, each with four threads, or

over 240 hardware threads. The default OMP stack size is 4MB, so by default the threads on this

platform could allocate nearly a gigabyte of local memory just for all the stacks! (The example above

would cut this resource limit in half while doubling the default limit on the host side.)

In addition to these environment variables, the runtime libraries also respond to the host

environment setting of MIC_STACKSIZE. This controls the size of the stack in the target process

wherein offloaded code is run and so only applies to offloaded code. The default size of this stack is

12 MB but there’s only one of them. Likewise, native applications run in a process whose default

stack size is 8 MB.

BKM: Processor affinity may have a significant impact on the performance of your algorithm, so

understanding the affinity types available and the behaviors of your algorithm can help you adapt

affinity on the target architecture to maximize performance.

Hardware threads are grouped together into cores, within which they share resources. This sharing

of resources can benefit or harm particular algorithms, depending on their behaviors. Understanding

the behavior of your algorithms will guide you in selecting the optimal affinity. Affinity can be

specified in the environment (KMP_AFFINITY) or via a function call (kmp_set_affinity).

Applicability: Offloaded and native C, C++ and Fortran

Details: Affinity starts with the process that runs the main OpenMP thread on the target. Offloaded

programs inherit an affinity map that hides the last core, which is dedicated to offload system

functions. Native programs can use all the cores, making the calculations required for balancing the

threads slightly different.

Some algorithms exploit data sharing between threads and can take advantage of shared caches to

speed computation. Other algorithms demand a lot of data that are visited only once, requiring the

threads to be spread out to make maximum use of the available bandwidth. And some algorithms lie

somewhere in between. OMP_NUM_THREADS and KMP_AFFINITY (and their functional

counterparts) let users and programmers configure their threads as appropriate:

export MIC_ENV_PREFIX=MIC export MIC_OMP_NUM_THREADS=60 export MIC_KMP_AFFINITY=verbose,granularity=fine,scatter

On a 31-core target processor with an offloaded application the above sequence of host

environment settings would limit the HW threads to half the available number and distribute them as

broadly as possible across the cores. The example sets MIC_ENV_PREFIX to be able to selectively

set the number of threads and their distribution only for the target in the lines that follow.

Granularity is set to fine so each OpenMP thread is constrained to a single HW thread. Alternatively,

setting core granularity groups the OpenMP threads assigned to a core into little quartets with free

reign over their respective cores.

The affinity types COMPACT and SCATTER either clump OpenMP threads together on as few cores as

possible or spread them apart so that adjacent thread numbers are on different cores. Sometimes

though, it is advantageous to enable a limited number of threads distributed across the cores in a

manner that leaves adjacent thread numbers on the same cores. For this, there is a new affinity


type available, BALANCED, which does exactly that. Using the verbose setting shown above, you can

determine how the OpenMP thread numbers are dispersed. Just run a program with an offloaded

OpenMP component to trigger the affinity report. We can run that for several affinity-types and see

how they vary:

AFFINITY: fine, compact

AFFINITY: fine, scatter

AFFINITY: fine, balanced

Because MIC_OMP_NUM_THREADS was set to 60 but there are 31 cores in this example, the last

couple of cores only get a single OpenMP thread. Alternatively, setting MIC_OMP_NUM_THREADS to

62 in this example would move thread 59 to core 29 and leave threads 60 and 61 on core 30.

In the case of native execution, all the same rules apply, though we don’t need to preface the

environment variables with “MIC_” and use MIC_ENV_PREFIX. We may want to acknowledge the

availability of another core and bump OMP_NUM_THREADS up to 64 in order to populate all the

cores with two threads in the balanced affinity-type. Any irregularities in the underlying HW thread

numbering scheme for the last core are hidden by the affinity mask.

Beyond these affinity-types, more exotic mappings can be achieved through using the low-level

affinity interface. Mention of these low-level affinity APIs also brings up a caution: it’s very unlikely

that code being ported to Intel MIC Architecture that already has an explicit affinity map will have


the correct map for the new architecture; better to reset affinity back to some basic mapping until

experiments can be run to re-optimize affinity.

About the Author

Robert Reed is a software performance engineer in the Technical Computing, Analyzers and Runtimes

organization within the Intel Developer Products Division. Robert honed his programming skills over

many years working at Tektronix. After a migrant middle period (contracting and consulting for a

variety of companies), Robert settled at Intel where he has done a lot of performance optimization

and analysis on everything from architecture-specific functional simulations of DGEMM to playing with

fluids and cloth in high-end 3D editors. Robert has been working with the Intel MIC Architecture

program since before it was called that. When he’s not pushing bits, Robert likes to sing and see live

theatre, of which there is plenty in Portland.


Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go

to: http://www.intel.com/design/literature.htm

Intel, the Intel logo, VTune, Cilk and Xeon are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others

Copyright© 2013 Intel Corporation. All rights reserved.

Optimization Notice

http://software.intel.com/en-us/articles/optimization-notice/

Performance Notice

For more complete information about performance and benchmark results, visit

www.intel.com/benchmarks.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

*Other names and brands may be claimed as the property of others

http://www.intel.com/design/literature.htm

http://software.intel.com/en-us/articles/optimization-notice/

http://www.intel.com/benchmarks


Optimization Notice

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture is reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Date post:	29-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

White Paper Best Known Methods for Using OpenMP* on Intel ......BKM: Use –openmp-report to observe...

Documents