

## Programming and Tuning for Intel® Xeon® Processors

Dr-Ing. Michael Klemm Software and Services Group Intel (michael.klemm@intel.com)



# Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright °, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804



Copyright<sup>®</sup> 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

## My Personal Disclaimer

Take this presentation with a grain of salt.

Some recipes, switches, settings, or some advice may or may not work on your system.

Please consult the manual and ask the operations team of the machine before you shoot yourself.





## Agenda

Intel<sup>®</sup> Platforms for HPC

Intel® Xeon® E5-2600v3 Processor Series

Programming for Intel Architecture

Controlling FP Arithmetic with Intel® Composer XE

Using Intel MPI for Performance



# The Book of the Year... 😳 (Or: A Shameless Plug) ^

Authors: Alexander Supalov, Andrey Semin, Michael Klemm, Chris Dahnken

Table of Contents: Foreword by Bronis de Supinski (CTO LLNL) Preface Chapter 1: No Time to Read this Book? Chapter 2: Overview of Platform Architectures Chapter 3: Top-Down Software Optimization Chapter 4: Addressing System Bottlenecks Chapter 5: Addressing Application Bottlenecks: **Distributed Memory** Chapter 6: Addressing Application Bottlenecks: Shared Memory Chapter 7: Addressing Microarchitecture Bottlenecks **Chapter 8: Application Design Implications** 

> ISBN-13 (pbk): 978-1-4302-6496-5 ISBN-13 (electronic): 978-1-4302-6497-2

### Order now at http://www.clusterbook.info



Copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.





5

# Intel Technologies for HPC

### Processors

Intel<sup>®</sup> Xeon<sup>®</sup> Processor



Coprocessor Intel® Many Integrated Core



Network & Fabric



I/O & Storage



Software & Services







# Transforming the Economics of HPC



### Executing to Moore's Law

Predictable Silicon Track Record – well and alive at Intel. Enabling new devices with higher performance and functionality while controlling power, cost, and size







## **Driving Innovation and Integration**

Enabled by Leading Edge Process Technologies



### Integrated Today



### Coming in the Future



Copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



# **The Magic of Integration**

Moore's Law at Work & Architecture Innovations







1970s **150 MFLOPS** CRAY-1 2015 **1000000 MFLOPS** 2S Intel® Xeon® Processor



Copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.





## Intel<sup>®</sup> Xeon Processor Architecture

# Intel "Tick-Tock" Roadmap – Part I

| Intel <sup>®</sup> Core <sup>™</sup><br>MicroArchitecture |                                          | Micro Architecture<br>Codename "Nehalem" |                                          | 2 <sup>nd</sup> Generation<br>Intel <sup>®</sup> Core <sup>™</sup> Micro<br>Architecture | 3 <sup>nd</sup> Generation<br>Intel® Core™ Micro<br>Architecture |
|-----------------------------------------------------------|------------------------------------------|------------------------------------------|------------------------------------------|------------------------------------------------------------------------------------------|------------------------------------------------------------------|
| Merom                                                     | Penryn                                   | Nehalem                                  | Westmere                                 | Sandy Bridge                                                                             | Ivy Bridge                                                       |
| NEW<br>Micro architecture<br>65nm                         | NEW<br>Process Technology<br><b>45nm</b> | NEW<br>Micro architecture<br><b>45nm</b> | NEW<br>Process Technology<br><b>32nm</b> | NEW<br>Micro architecture<br><b>32nm</b>                                                 | NEW<br>Process Technology<br><b>22nm</b>                         |
| ТОСК                                                      | ТІСК                                     | ТОСК                                     | ТІСК                                     | ТОСК                                                                                     | TICK                                                             |
| 2006<br>SSSE-3                                            | 2007<br>SSE4.1                           | 2008<br>SSE4.2                           | 2009<br>AES                              | 2011<br>AVX                                                                              | 2012<br>RDRAND<br>etc                                            |





## Intel "Tick-Tock" Roadmap – Part II

Future Release Dates & Features subject to Change without Notice !





(Intel?

# Recap: Sandy Bridge and Ivy Bridge Execution Units







# Haswell Core at a Glance



### Next generation branch prediction

- Improves performance and saves wasted work
   Improved front-end
- Initiate TLB and cache misses speculatively
- Handle cache misses in parallel to hide latency
- Leverages improved branch prediction

### Deeper buffers

- Extract more instruction parallelism
- More resources when running a single thread **More execution units, shorter latencies**
- Power down when not in use

### More load/store bandwidth

 Better prefetching, better cache line split latency & throughput, double L2 bandwidth

### No pipeline growth

- Same branch misprediction latency
- Same L1/L2 cache latency

Copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



## Haswell Execution Unit Overview



## Haswell Buffer Sizes

### Extract more parallelism in every generation

|                       | Nehalem   | Sandy Bridge | Haswell |
|-----------------------|-----------|--------------|---------|
| Out-of-order Window   | 128       | 168          | 192     |
| In-flight Loads       | 48        | 64           | 72      |
| In-flight Stores      | 32        | 36           | 42      |
| Scheduler Entries     | 36        | 54           | 60      |
| Integer Register File | N/A       | 160          | 168     |
| FP Register File      | N/A       | 144          | 168     |
| Allocation Queue      | 28/thread | 28/thread    | 56      |





# Core Cache Size/Latency/Bandwidth

| Metric               | Nehalem                                            | Sandy Bridge                                      | Haswell                                           |
|----------------------|----------------------------------------------------|---------------------------------------------------|---------------------------------------------------|
| L1 Instruction Cache | 32K, 4-way                                         | 32K, 8-way                                        | 32K, 8-way                                        |
| L1 Data Cache        | 32K, 8-way                                         | 32K, 8-way                                        | 32K, 8-way                                        |
| Fastest Load-to-use  | 4 cycles                                           | 4 cycles                                          | 4 cycles                                          |
| Load bandwidth       | 16 Bytes/cycle                                     | 32 Bytes/cycle<br>(banked)                        | 64 Bytes/cycle                                    |
| Store bandwidth      | 16 Bytes/cycle                                     | 16 Bytes/cycle                                    | 32 Bytes/cycle                                    |
| L2 Unified Cache     | 256K, 8-way                                        | 256K, 8-way                                       | 256K, 8-way                                       |
| Fastest load-to-use  | 10 cycles                                          | 11 cycles                                         | 11 cycles                                         |
| Bandwidth to L1      | 32 Bytes/cycle                                     | 32 Bytes/cycle                                    | 64 Bytes/cycle                                    |
| L1 Instruction TLB   | 4K: 128, 4-way<br>2M/4M: 7/thread                  | 4K: 128, 4-way<br>2M/4M: 8/thread                 | 4K: 128, 4-way<br>2M/4M: 8/thread                 |
| L1 Data TLB          | 4K: 64, 4-way<br>2M/4M: 32, 4-way<br>1G: fractured | 4K: 64, 4-way<br>2M/4M: 32, 4-way<br>1G: 4, 4-way | 4K: 64, 4-way<br>2M/4M: 32, 4-way<br>1G: 4, 4-way |
| L2 Unified TLB       | 4K: 512, 4-way                                     | 4K: 512, 4-way                                    | 4K+2M shared: 1024,<br>8-way                      |



Copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



# New Instructions in Haswell

| Group         |                                                | Description                                                                                      | Count * |
|---------------|------------------------------------------------|--------------------------------------------------------------------------------------------------|---------|
| <-N           | SIMD Integer Instructions promoted to 256 bits | Adding vector integer operations to 256-bit                                                      |         |
| AVX-2         | Gather                                         | Load elements using a vector of indices, vectorization enabler                                   | 170/124 |
| 4             | Shuffling / Data<br>Rearrangement              | Blend, element shift and permute instructions                                                    |         |
| FMA           |                                                | Fused Multiply-Add operation forms (FMA-3)                                                       | 96 / 60 |
|               | nipulation and<br>ography                      | Improving performance of bit stream manipulation and decode, large integer arithmetic and hashes | 15 / 15 |
| TSX = RTM+HLE |                                                | Transactional Memory                                                                             | 4/4     |
| Others        | 5                                              | MOVBE: Load and Store of Big Endian forms<br>INVPCID: Invalidate processor context ID            | 2/2     |

#### \* Total instructions / different mnemonics





# FMA: Fused Multiply Add Instruction

Improves accuracy and performance for commonly used class of algorithms



| Mirco-<br>Architecture | Instruction Set       | SP FLOPs<br>per cycle | DP FLOPs per<br>cycle |
|------------------------|-----------------------|-----------------------|-----------------------|
| Nehalem                | SSE (128-bits)        | 8                     | 4                     |
| Sandy Bridge           | AVX (256-bits)        | 16                    | 8                     |
| Haswell                | AVX2 (FMA) (256-bits) | 32                    | 16                    |

### 2x peak FLOPs/cycle (throughput)

| E5 v2 |        | Ratio<br>* |
|-------|--------|------------|
| 5     | 5      |            |
| 3     | 3      |            |
| 8     | 5 🤇    | 0.625      |
|       | 5<br>3 | 3 3        |

>37% reduced latency (5-cycle FMA latency same as an FP multiply)

Increased performance potential for Technical Computing workloads like Structural Analysis, CFD, EMF computation, Cosmology, .... \*



Copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

Lower is\* better



## Intel<sup>®</sup> Xeon Processor Platforms



Intel<sup>®</sup> Xeon<sup>®</sup> E5

2x QPI

E5

PCle3

de Me Me







Optimization Notice

PCle3





E3

Memory

## Intel<sup>®</sup> Xeon<sup>®</sup> Processors







Inte

## Intel® Xeon® Processors & Platforms









OPI

# Intel<sup>®</sup> Xeon<sup>®</sup> E5-2600v3 Processor Overview









# Key Differences Between E5-2600 v2 & E5-2600 v3

|                           | Xeon E5-2600 v2                                             | Xeon E5-2600 v3                                                                     |
|---------------------------|-------------------------------------------------------------|-------------------------------------------------------------------------------------|
| Core Count                | Up to 12 Cores                                              | Up to 18 Cores                                                                      |
| Frequency                 | TDP & Turbo Frequencies                                     | TDP & Turbo Freq<br>AVX & AVX Turbo Freq                                            |
| AVX Support               | AVX 1<br>8 DP Flops/Clock/Core                              | AVX 2<br>16 DP Flops/Clock/Core                                                     |
| Memory Type               | 4xDDR3 channels<br>RDIMM, UDIMM, LRDIMM                     | 4xDDR4 channels<br>RDIMM, LRDIMM                                                    |
| Memory Frequency<br>(MHz) | 1866 (1DPC), 1600, 1333, 1066                               | RDIMM: 2133 (1DPC), 1866 (2DPC), 1600<br>LRDIMM: 2133 (1&2DPC), 1600                |
| QPI Speed                 | Up to 8.0 GT/s                                              | Up to 9.6 GT/s                                                                      |
| TDP                       | Up to 130W Server, 150W<br>Workstation                      | Up to 145W Server, 160W Workstation<br>(Increase due to Integrated VR)              |
| Power<br>Management       | Same P-states for all cores<br>Same core & uncore frequency | Per-core P-states<br>Independent uncore frequency scaling<br>Energy Efficient Turbo |





# On-Die Interconnect Enhancements







# Haswell EP Die Configurations



#### Not representative of actual die-sizes, orientation and layouts - for informational use only.

| Chop | Columns                     | Home Agents                       | Cores                          | Power (W)                            | Transitors (B)                    | Die Area (mm²)            |
|------|-----------------------------|-----------------------------------|--------------------------------|--------------------------------------|-----------------------------------|---------------------------|
| HCC  | 4                           | 2                                 | 14-18                          | 110-145                              | 5.69                              | 662                       |
| MCC  | 3                           | 2                                 | 6-12                           | 65-160                               | 3.84                              | 492                       |
| LCC  | Optimi2ation<br>Notice Copy | 1<br>/right° 2015, Intel Corporat | 4-8<br>ion. All rights reserve | 55-140<br>d. *Other brands and names | 2.60<br>are the property of their | 354<br>respective owners. |

# Haswell Processor Improvements

| Агеа                              | Change                                                                                                                     | Benefit                                                                                                                                                                                  |
|-----------------------------------|----------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| On-die interconnect               | Two Fully Buffered Rings                                                                                                   | Enables higher core counts and provides higher bandwidth<br>per core.                                                                                                                    |
| Home Agent / Memory<br>Controller | <ul><li>DDR4</li><li>Two Home Agents in more SKUs</li><li>Directory Cache</li></ul>                                        | <ul> <li>Increased memory bandwidth and power efficiency</li> <li>Greater socket BW with more outstanding requests</li> <li>Lower average memory latency</li> </ul>                      |
| LLC                               | <ul> <li>Cluster On Die (COD) mode</li> <li>Improved LLC allocation policy</li> <li>Cache Allocation Monitoring</li> </ul> | <ul> <li>Increased performance, reduced latency</li> <li>Enables improved performance by better application placement in a virtualized environment</li> </ul>                            |
| Power Management                  | <ul> <li>Separate clock and voltage domains for each core<br/>and uncore (enables PCPS, UFS)</li> </ul>                    | <ul><li>Better performance per watt</li><li>Lower socket idle (package C6) power.</li></ul>                                                                                              |
| QPI 1.1                           | Increase to 9.6GT/s                                                                                                        | Multi-socket coherence performance                                                                                                                                                       |
| Integrated<br>IO-Hub (IIO)        | <ul> <li>LLC cache tracks IIO cache line ownership</li> <li>Increased PCIe buffers and credits</li> </ul>                  | <ul> <li>Improves PCIe bandwidth under conflicts (concurrent accesses to the same cache line).</li> <li>Increase PCIe bandwidth and latency tolerance</li> </ul>                         |
| PCI Express 3.0                   | <ul> <li>DualCast - Allows a single write transaction to multiple targets.</li> <li>Relaxed ordering</li> </ul>            | <ul> <li>Utilized to minimize memory channel bandwidth – data can<br/>be sent to memory and on the NTB port. Storage<br/>applications are typically memory bandwidth limited.</li> </ul> |





# DDR4 Benefits

### **Lower Power**

- Lower voltage (1.5v -> 1.2v) DIMMs
- Smaller page size (1024 -> 512) for x4 devices
- Initial results show savings of ~2W per DIMM at the wall.

Improved RAS

Command/Address Parity error recovery

Higher bandwidth

- 14% higher STREAM results (DDR4-2133 vs. DDR3-1866)
- Increased DIMM frequency when multiple DIMMs per channel are installed

| Dimms / Ch | DDR3 1.5v | DDR3 1.35v | DDR4 RDIMM | DDR4 LRDIMM |
|------------|-----------|------------|------------|-------------|
| 1          | 1866      | 1600       | 2133       | 2133        |
| 2          | 1600      | 1333       | 1866       | 2133        |
| 3          | 1066      | 800        | 1600       | 1600        |



Copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



# Cluster on Die (COD) Mode

- Supported on 1S & 2S SKUs with 2 Home Agents (10+ cores)
- In memory directory bits & directory cache used on 2S to reduce coherence traffic and cache-to-cache transfer latencies
- Targeted at NUMA optimized workloads where latency is more important than sharing across Caching Agents
  - Reduces average LLC hit and local memory latencies
  - HA sees most requests from reduced set of threads potentially offering higher effective memory bandwidth
- OS/VMM own NUMA and process affinity decisions

### COD Mode for 18C E5-2600 v3







# Intel® Turbo Boost Technology 2.0 and Intel® AVX\*

- Intel<sup>®</sup> Turbo Boost Technology 2.0 automatically allows processor cores to run faster than the Rated and AVX base frequencies if they're operating below power, current, and temperature specification limits.
- Amount of turbo frequency achieved depends on the type of workload, number of active cores, estimated current & power consumption, and processor temperature
- Due to workload dependency, separate AVX base & turbo frequencies will be defined for Xeon<sup>®</sup> processors starting with E5 v3 product family



E5 v3 & Future

Generations



copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owner

Previous Generations



# How does frequency change with AVX workloads?

### Core detects presence of AVX instructions

AVX instructions draw more current & higher voltage is needed to sustain operating conditions

Core signals to Power Control Unit (PCU) to provide additional voltage & core slows the execution of AVX instructions

- Need to maintain TDP limits, so increasing voltage may cause frequency drop
- Amount of frequency drop will depend on workload power & AVX frequency limits

PCU signals that the voltage has been adjusted & core returns to full execution throughput

PCU returns to regular (non-AVX) operating mode 1ms after AVX instructions are completed







# Programming for Intel Architecture

# Highly Parallel Applications



Efficient vectorization, threading, and parallel execution drives higher performance for suitable scalable applications



# Parallel Programming for Intel® Architecture

| NODES                                                                                        | Use Intel <sup>®</sup> MPI, Co-Array Fortran                                                          |  |
|----------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|--|
| CORES                                                                                        | Use threads directly (pthreads) or via OpenMP*, C++11<br>Use tasking, Intel® TBB / Cilk™ Plus         |  |
| VECTORS                                                                                      | Intrinsics, auto-vectorization, vector-libraries<br>Language extensions for vector programming (SIMD) |  |
| BLOCKING                                                                                     | Use caches to hide memory latency<br>Organize memory access for data reuse                            |  |
| DATA LAYOUT                                                                                  | Structure of arrays facilitates vector loads / stores, unit stride<br>Align data for vector accesses  |  |
| Parallel programming to utilize the hardware resources,<br>in an abstracted and portable way |                                                                                                       |  |



34

## Programming for Intel Procesors





35

# Preparing Code for SIMD





Copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



# Data Layout – Common Layouts

## Array-of-Structs (AoS)

| Х | у | Z | Х | у | Z |
|---|---|---|---|---|---|
| x | У | z | Х | У | Z |
| X | У | Z | Х | У | Ζ |

- Pros: Good locality of {x, y, z}.
   1 memory stream.
- Cons: Potential for gather/scatter.

#### Struct-of-Arrays (SoA)



- Pros: Contiguous load/store.
- Cons: Poor locality of {x, y, z}.
   3 memory streams.

### Hybrid (AoSoA)

|   |   | у |   |   |   |
|---|---|---|---|---|---|
| Х | X | У | у | z | Z |
| Х | Х | У | У | Z | Z |

- Pros: Contiguous load/store.
   1 memory stream.
- Cons: Not a "normal" layout.



## Data Layout – Why It's Important

#### Instruction-Level

- Hardware is optimized for contiguous loads/stores.
- Support for non-contiguous accesses differs with hardware. (e.g., AVX2/KNC gather)

#### Memory-Level

- Contiguous memory accesses are cache-friendly.
- Number of memory streams can place pressure on prefetchers.



# Data Alignment – Why It's Important



#### Aligned Load

- Address is aligned.
- One cache line.
- One instruction.

#### Unaligned Load

- Address is not aligned.
- Potentially multiple cache lines.
- Potentially multiple instructions.



# Data Alignment – Sample Applications

## 1) Align Memory

\_mm\_malloc(bytes, 64) / !dir\$ attributes align:64

1

## 2) Access Memory in an Aligned Way

for (i = 0; i < N; i++) { array[i] ... }</pre>

## 3) Tell the Compiler

- #pragma vector aligned
- \_\_assume\_aligned(p, 16)
- \_\_assume(i % 16 == 0)

- / !dir\$ vector aligned
- / !dir\$ assume\_aligned (p, 16)
  - !dir\$ assume (mod(i, 16) .eq. 0)





| 0 | 1 | 2 | З | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
| 8 | 9 |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |

Data





| 0 | 1 | 2 | З | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
| 8 | 9 |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |

















45



46

## Implicit Vectorization

Very powerful, but a compiler cannot make unsafe assumptions.

```
void not_vectorizable
(float* a, float* b, float* c, int* ind) {
    for (int i = 0; i < *g_size; i++) {
        int j = ind[i];
        c[j] += a[i] + b[i];
    }
}</pre>
```

• Unsafe Assumptions:

int\* g size;

- a, b and c point to different arrays.
- Value of global g\_size is loop-invariant.
- ind[i] is a one-to-one mapping.



## Use the Compiler's Optimization Report

Begin optimization report for: not\_vectorizable(float \*, float \*, float \*, int \*)

```
Report from: Interprocedural optimizations [ipo]
```

```
INLINE REPORT: (not_vectorizable(float *, float *, float *, int *)) [1] vectorize.cc(4,63)
```

Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]

```
LOOP BEGIN at vectorize.cc(5,9)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First
dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed ANTI dependence between line 7 and line 7
   remark #25439: unrolled with remainder by 2
LOOP END
```

```
LOOP BEGIN at vectorize.cc(5,9)
<Remainder>
LOOP END
```



## Implicit Vectorization

Very powerful, but a compiler cannot make unsafe assumptions.

int\* g\_size;

```
void vectorizable
(float* restrict a, float* restrict b, float* restrict c, int* restrict ind) {
    int size = *g_size;
    #pragma ivdep
    for (int i = 0; i < size; i++) {
        int j = ind[i];
        c[j] += a[i] + b[i];
    }
}</pre>
```

- Safe Assumptions:
  - a, b and c point to different arrays. (restrict)
  - Value of global g\_size is loop-invariant. (pointer dereference outside loop)
  - ind[i] is a one-to-one mapping. (#pragma ivdep)



## Implicit Vectorization – Improving Performance

Getting code to vectorize is only half the battle

- "LOOP WAS VECTORIZED" != "the code is optimal"
- Vectorized code can be slower than the scalar equivalent.

Compiler will always choose correctness over performance

- "Hints" and pragmas can't possibly cover all the situations...
- ... but we can usually rewrite loop bodies to assist the compiler.





## Explicit Vectorization

#### **Compiler Responsibilities**

- Allow programmer to declare that code can and should be run in SIMD.
- Generate the code the programmer asked for.

**Programmer Responsibilities** 

- Correctness (e.g., no dependencies, no invalid memory accesses).
- Efficiency (e.g., alignment, loop order, masking).



## Explicit Vectorization – Motivating Example 1

```
float sum = 0.0f;
float *p = a;
int step = 4;
#pragma omp simd reduction(+:sum) linear(p:step)
for (int i = 0; i < N; ++i) {
    sum += *p;
    p += step;
}
```

- The two += operators have different meaning from each other.
- The programmer should be able to express those differently.
- The compiler has to generate different code.
- The variables *i*, *p* and *step* have different "meaning" from each other.



## Explicit Vectorization – Motivating Example 2

```
#pragma omp declare simd simdlen(16)
uint32_t mandel(fcomplex c)
{
    uint32_t count = 1; fcomplex z = c;
    for (int32_t i = 0; i < max_iter; i += 1) {
        z = z * z + c;
        int t = cabsf(z) < 2.0f;
        count += t;
        if (!t) { break;}
    }
    return count;
}</pre>
```

- mandel() function is called from a loop over X/Y points.
- We would like to vectorize that outer loop.
- Compiler creates a vectorized function that acts on a vector of N values of c.



Copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



## Explicit Vectorization – Performance Impact



M. Klemm, A. Duran, X. Tian, H. Saito, D. Caballero, and X. Martorell, "Extending OpenMP with Vector Constructs for Modern Multicore SIMD Architectures. In Proc. of the Intl. Workshop on OpenMP", pages 59-72, Rome, Italy, June 2012. LNCS 7312.





## Controlling FP Arithmetic with Intel<sup>®</sup> Composer XE

## Standard Compiler Switches

| GCC                | ICC          | Effect                                                        |
|--------------------|--------------|---------------------------------------------------------------|
| -00                | -00          | Disable (almost all) optimization.                            |
| -01                | -01          | Optimize for speed (no code size increase for ICC)            |
| -02                | -02          | Optimize for speed and enable vectorization (default for ICC) |
| -03                | -03          | Turn on high-level optimizations                              |
| -ftlo              | -ipo         | Enable interprocedural optimization                           |
| -ftree-vectorize   | -vec         | Enable auto-vectorization (auto-enabled with -02 and -03)     |
| -fprofile-generate | -prof-gen    | Generate runtime profile for optimization                     |
| -fprofile-use      | -prof-use    | Use runtime profile for optimization                          |
|                    | -parallel    | Enable auto-parallelization                                   |
| -fopenmp           | -qopenmp     | Enable OpenMP                                                 |
| -g                 | -g           | Emit debugging symbols                                        |
|                    | -qopt-report | Generate the optimization report                              |
|                    | -ansi-alias  | Enable ANSI aliasing rules for C/C++                          |
| -mcorei7           | -xSSE4.1     | Generate code for Intel processors with SSE 4.1 instructions  |
| -mcorei7-avx       | -xCORE-AVX   | Generate code for Intel processors with AVX1 instructions     |
| -mcorei7-avx2      | -xCORE-AVX2  | Generate code for Intel processors with AVX2 instructions     |
| -mnative           | -xHOST       | Generate code for the current machine used for compilation    |

# Frequently Users want Consistent FP Results (which is not necessarily the "most accurate" result)

Root cause for variations in results

- floating-point numbers → order of computation matters!
- Single precision arithmetic example (a+b)+c != a+(b+c)
  - 226 226 + 1 = 1 (infinitely precise result)
  - (226 226) + 1 = 1
- 1 (correct IEEE single precision result)

(correct IEEE single precision result)

• 226 - (226 - 1) = 0

Conditions that affect the order of computations

- Different code branches (e.g., x87 versus SSE2 or AVX )
- Memory alignment (scalar or vector code)
- Dynamic parallel task/thread/rank scheduling

#### Bitwise repeatable/reproducible results

- repeatable = results the same as last run (same conditions)
- reproducible = results the same as results in other environments
- environments = OS / CPU / architecture / # threads / # processes / BIOS / pinning

4.012345678901111 4.012345678902222 4.012345678902222 4.012345678901111 4.012345678902222 4.012345678901111 4.012345678901111 4.012345678901111 4.012345678902222 4.012345678902222 4.012345678901111 4.012345678902222 4.012345678901111 4.012345678902222 4.012345678902222 4.012345678901111



57





58

## Intel64 Register Set



Fourteen 32-bit registers Scalar data & addresses Direct access to regs

eax

...

edi

Eight 80/64-bit registers Hold data only Direct access to MMO..MM7 No MMX<sup>™</sup> Technology / FP interoperability



Sixteen 256-bit registers Hold data only: 8 x single FP numbers 4 x double FP numbers overlap with 128-bit SSE registers

AVX-512 will extend ymm[0..15] to zmm[0..31] with 512-bit each.



Copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

## FP Model Summary

| Кеу                                                                                                                                                                                              | Value Safety | Expression<br>Evaluation               | FPU Environ.<br>Access | Precise FP<br>Exceptions |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|----------------------------------------|------------------------|--------------------------|
| precise<br>source<br>double<br>extended                                                                                                                                                          | Safe         | Varies<br>Source<br>Double<br>Extended | No                     | No                       |
| strict                                                                                                                                                                                           | Safe         | Varies                                 | Yes                    | Yes                      |
| fast=1<br>(default)                                                                                                                                                                              | Unsafe       | Unknown                                | No                     | No                       |
| fast=2                                                                                                                                                                                           | Very Unsafe  | Unknown                                | No                     | No                       |
| except<br>except-                                                                                                                                                                                | */**<br>*    | *                                      | *                      | Yes<br>No                |
| <ul> <li>These modes are unaffectedfp-model except[-] only affects the precise FP exceptions mode.</li> <li>It is illegal to specify -fp-model except in an unsafe value safety mode.</li> </ul> |              |                                        |                        |                          |

Optimization Notice

Copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

Int

## Value Safety

 In SAFE mode, the compiler may not make any transformations that could affect the result, e.g., all the following are prohibited.

| x / x ⇔ 1.0       | x could be 0.0, ∞, or NaN                                |
|-------------------|----------------------------------------------------------|
| x – y ⇔ - (y – x) | If x equals y, $x - y$ is +0.0 while $-(y - x)$ is - 0.0 |
| x – x ⇔ 0.0       | x could be $\infty$ or NaN                               |
| x * 0.0 ⇔ 0.0     | x could be -0.0, ∞, or NaN                               |
| x + 0.0 ⇔ x       | x could be -0.0                                          |
| (x + y) + z ⇔     | General reassociation is not value safe                  |
| x + (y + z)       |                                                          |
| (x == x) ⇔ true   | x could be NaN                                           |

- UNSAFE mode is the default
- VERY UNSAFE mode enables riskier transformations





## Reassociation Example: Reductions





# Use of FMA Instructions [1]

**Potential issue:** Since execution of FMA does not round intermediate product result, final result may be different compared to older (non-FMA) CPUs

- For QA comparisons to older processors, FMAs in compiled code can be disabled explicitly by
  - -no-fma (/Qfma-)
  - fp-model strict (disables much more besides)
- FMAs can be disabled at function level by
- #pragma fp\_contract (off) (C/C++); !DIR\$ NOFMA (Fortran)





# Use of FMA Instructions [2]

Putting multiply & add on separate lines does not disable FMA

```
t = a*b;
result = t + c;
// may still generate FMA
```

t = a\*b; \_mm\_mfence(); result = t + c; // no FMA

- FMAs are not (completely) disabled by -fp-model precise
- None of the above disables FMA usage in math library
  - requires -fimf-arch-consistency=true
- Results may change on "Haswell" wrt "Sandy Bridge" even without recompilation!
  - math library may take different path at run-time
- For debugging interesting to know of: fma() and fmaf() intrinsics from math.h give FMA result with a single rounding via a libm call, even for processors with no FMA instruction



## Sample of FMA Rounding Difference

```
double sub(double a, double b, double c, double d )
{
    c = -a;
    d = b;
    return (a*b + c*d);
}
```

- Without FMA, should evaluate to zero
- With FMA, it may not evaluate to zero
  - Returns FMA(a, b, (c\*d)) or FMA (c, d, (a\*b))

Each has different rounding, unspecified which grouping the compiler will generate

This behavior is not suppressed by 'fp-model precise' !





## FP Model and FMA Summary

| Кеу                                                                                         | Value Safety | Expression<br>Evaluation               | FPU Environ.<br>Access | Precise FP<br>Exceptions | FMA Use |
|---------------------------------------------------------------------------------------------|--------------|----------------------------------------|------------------------|--------------------------|---------|
| precise<br>source<br>double<br>extended                                                     | Safe         | Varies<br>Source<br>Double<br>Extended | No                     | No                       | Yes     |
| strict                                                                                      | Safe         | Varies                                 | Yes                    | Yes                      | No      |
| fast=1<br>(default)                                                                         | Unsafe       | Unknown                                | No                     | No                       | Yes     |
| fast=2                                                                                      | Very Unsafe  | Unknown                                | No                     | No                       | Yes     |
| except<br>except-                                                                           | */**<br>*    | *<br>*                                 | *<br>*                 | Yes<br>No                | *<br>*  |
| * These modes are unaffectedfp-model except[-] only affects the precise FP exceptions mode. |              |                                        |                        |                          |         |

It is illegal to specify **-fp-model except** in an unsafe value safety mode.



\*\*

Copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

## All Libraries Optimized for HSW

Math libraries detect target processor by their own – independent of code generated by compiler

- e.g., HSW-optimized version will be executed on HSW even in case binary is compiled for SandyBridge (-xavx)
- Can be disabled by switch -fimf-arch-consistency=true for libimf and libsvml and CBWR API (conditional bit wise reproducibility) for Intel<sup>®</sup> MKL-VML
- HSW optimization in some cases not necessarily implies use of FMA!

| Library | Used for                                                                                | Comment                         |
|---------|-----------------------------------------------------------------------------------------|---------------------------------|
| libimf  | Library routines for single elements –<br>libm replacement                              |                                 |
| libsvml | Small vector math library: Used by vectorizer to replace math calls in vectorized loops | Optimizations still<br>on-going |
| MKL-VML | Vector math library component of MKL                                                    |                                 |



## Sample Performance Data for SVML Double Precision – Cycles per Element

| Routine | Sandy Bridge |       |       |       | Haswell |      |
|---------|--------------|-------|-------|-------|---------|------|
|         | HA           | LA    | EP    | HA    | LA      | EP   |
| sqrt    | 10.51        | 10.51 | 10.51 | 7.08  | 7.08    | 7.08 |
| ехр     | 11.20        | 8.48  | 8.06  | 7.12  | 5.13    | 4.62 |
| sin     | 16.91        | 16.11 | 8.21  | 11.15 | 6.95    | 4.03 |

See here for complete data of MKL 11.2 comparing VML execution on Haswells (desktop processor), Westmere and SandyBridge EP <a href="https://software.intel.com/sites/products/documentation/doclib/mkl\_sa/112/vml/functions/\_perform">https://software.intel.com/sites/products/documentation/doclib/mkl\_sa/112/vml/functions/\_perform</a>

anceall.html

Code of VML similar to SVML but loop unrolling etc accelerate computation by working on multiple (vector-) computations simultaneously





## Using Intel MPI for Performance

## Use Best Possible Communication Fabric

| Supported<br>I_MPI_FABRICS | Description                                                                                 |
|----------------------------|---------------------------------------------------------------------------------------------|
| shm                        | Shared-memory only; intra-node default                                                      |
| tcp                        | TCP/IP-capable network fabrics, such as<br>Ethernet and InfiniBand* (through IPoIB*)        |
| dapl                       | DAPL-capable network fabrics, such as<br>InfiniBand*, iWarp*, and XPMEM* (through<br>DAPL*) |
| ofa                        | OFA-capable network fabric including<br>InfiniBand* (through OFED* verbs)                   |
| tmi                        | TMI-capable network fabrics including Qlogic*,<br>Myrinet* (through Tag Matching Interface) |

Intel MPI will select fastest available fabric by default (shared memory within a node and InfiniBand\* across nodes – shm:dapl)

If using the OpenFabrics Enterprise Distribution (OFED\*) software stack, select shm:ofa



Copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

## Disable Fallback for Benchmarking (and Production)

Intel MPI Library falls back from the 'dapl' or 'shm:dapl' fabric to 'tcp' and/or 'shm:tcp' if DAPL provider initialization failed

Set I\_MPI\_FALLBACK to 'disable' to be sure that needed fast fabric is working

- Fallback is disabled by default if I\_MPI\_FABRICS is set

Same result is achieved with the command line option:

```
$ mpirun -genv I_MPI_FALLBACK 0 ...
```



## Use Connectionless Communication

The connectionless feature works for the 'dapl' and 'tmi' fabrics only

Provides better scalability

Significantly reduces memory requirements by reducing the number of receive queues

Generally advised for large jobs

\$ export I\_MPI\_FABRICS=shm:dapl
\$ export I\_MPI\_DAPL\_UD=enable



## Use lightweight statistics

- Set I\_MPI\_STATS to a non-zero integer value to gather MPI communication statistics (max value is 10)
- Manipulate the results with I\_MPI\_STATS\_SCOPE to increase effectiveness of the analysis
- Example on the right Gromacs rank 0 with suggested values
- Suggested values:
  - \$ export I\_MPI\_STATS=3
  - \$ export I\_MPI\_STATS\_SCOPE=coll

| ollectives<br>peration | Context  | Algo             | Comm size | Message size | Calls     | Cost(%)      |
|------------------------|----------|------------------|-----------|--------------|-----------|--------------|
|                        |          |                  |           |              |           |              |
| Allreduce              |          |                  |           |              |           |              |
| 1                      | 58       | 1                | 4         | 24           | 1         | 0.00         |
| 2                      | 58       | 1                | 4         | 4            | 8         | 0.00         |
| 3<br>4<br>5<br>6<br>7  | 58       | 1                | 4         | 8            | 12        | 0.03         |
| 4                      | 58<br>58 | 1                | 4<br>4    | 1376<br>1344 | 181<br>19 | 0.04<br>0.01 |
| 5                      | 58       |                  | 4         | 1344 1216    |           |              |
| 0                      | 58       | 1                | 44        | 224          | 1         | 0.00         |
| 8                      | 58       | Ę                | 192       | 224          | 2         | 0.00         |
| 9                      | 0        | 5                | 192       | 968          | 1         | 0.00         |
| 10                     | ő        | 1<br>5<br>5<br>5 | 192       | 288          | 2         | 0.01         |
| 11                     | ŏ        | 5                | 192       | 768          | 2         | 0.00         |
| Barrier                | v        | 5                | 172       | 700          | -         | 0.00         |
| 1                      | 62       | 5                | 160       | 0            | 1         | 0.00         |
| 2                      | ō        | 5                | 192       | ŏ            | ī         | 0.00         |
| Bcast                  |          |                  |           |              |           |              |
| ···.                   |          |                  |           |              |           |              |
| Gather                 |          |                  | _         |              |           |              |
| 1 2                    | 52       | 3                | 5<br>4    | 32           | 25        | 0.01         |
| 2                      | 54<br>56 | 3<br>3<br>3      | 4 8       | 36<br>28     | 25<br>25  | 0.00         |
| Reduce                 | 50       | 3                | •         | 28           | 25        | 0.01         |
| Reduce<br>1            | 60       | 1                | 40        | 24           | 1         | 0.00         |
| 2                      | 60       | 1                | 40        | 4            | 8         | 0.00         |
| 2                      | 60       | 1                | 40        | 4<br>8       | 12        | 0.00         |
| 3<br>4                 | 60       | 1                | 40        | 1376         | 181       | 0.21         |
| 5                      | 60       | 1                | 40        | 1344         | 19        | 0.03         |
| 6                      | 60       | i                | 40        | 1216         | 1         | 0.00         |
| 7                      | 60       | 1                | 40        | 224          | 1         | 0.00         |
| Scatter                |          | -                |           |              | -         |              |
| 1                      | 62       | 1                | 160       | 8            | 1         | 0.00         |
| Scatterv               |          |                  |           |              |           |              |
| 1                      | 62       | 1                | 160       | 315840       | 2         | 0.03         |
| 2                      | 62       | 1                | 160       | 52640        | 1         | 0.08         |



## Choose the best collective algorithm

Use one of the I\_MPI\_ADJUST\_<opname> knobs to change the algorithm

**Recommendations:** 

- Focus on the most critical collective operations (see stats output)
- Run the Intel MPI Benchmarks by selecting various algorithms to find out the right protocol switchover points for hot collective operations
- ... or use the mpitune tool

\$ mpirun -genv I\_MPI\_ADJUST\_REDUCE <algorithm #> ...



## Select Proper Process Layout

Default process layout is that all physical cores will be used

If running hybrid applications, you might want to reduce the number of ranks/node

Set I\_MPI\_PERHOST or use the -perhost (/-ppn) option to override the default process layout:

\$ mpirun -ppn <#processes per node> -n <#processes> ...

Same can be achieved using a "machinefile"

On batch scheduler environments, the Intel MPI library respects the scheduler settings

To overwrite the batch scheduler settings (at your own risk  $\odot$ ):

\$ export I\_MPI\_JOB\_RESPECT\_PROCESS\_PLACEMENT=0



## Use Proper Process Pinning

Default pinning options are suitable for most cases

Use I\_MPI\_PIN\_PROCESSOR\_LIST to define custom map of MPI processes to CPU cores pinning

The 'cpuinfo' utility of the Intel MPI Library shows the processor topology

Placing the processes on physical cores:

\$ export I\_MPI\_PIN\_PROCESSOR\_LIST=allcores

Avoid sharing of common resources by adjacent MPI processes, use "map=scatter" setting:

\$ export I\_MPI\_PIN\_PROCESSOR\_LIST=allcores,map=scatter

Choose to share resources by setting "map=bunch":

\$ export I\_MPI\_PIN\_PROCESSOR\_LIST=allcores,map=bunch



## Use Proper Hybrid Process Pinning

Link with thread safe library (-qopenmp / -mt\_mpi)

Choose MPI threading model (SINGLE / FUNNELED / SERIALIZED / MULTIPLE ) - either using MPI\_Init\_thread(...) or env. var. I\_MPI\_THREAD\_LEVEL\_DEFAULT

\$ export I\_MPI\_THREAD\_LEVEL\_DEFAULT=SINGLE

Choose distribution of MPI ranks & threads – (ranks x threads = cores)

\$ mpirun -n <#ranks> -genv OMP\_NUM\_THREADS <#threads>

Pin MPI ranks using I\_MPI\_PIN\_DOMAIN (e.g., "omp" according to #OpenMP t.):

\$ export I\_MPI\_PIN\_DOMAIN=omp

Pin threads, e.g., KMP\_AFFINITY

\$ export KMP\_AFFINITY=compact

If you want a nicer and more portable syntax, use OpenMP places introduced with OpenMP 4.0



## Adjust the eager / rendezvous protocol threshold

Two communication protocols:

"Eager" sends data immediately regardless of receive request availability and uses for short messages

"Rendezvous" notices receiving site on data pending and transfers when receive request is set

\$ export I\_MPI\_EAGER\_THRESHOLD=<#bytes>





## Using the MPI Performance Snapshot Tool

- 1. Install Intel<sup>®</sup> Trace Analyzer and Collector
- 2. Setup your environment

\$ source /opt/intel/itac/9.1/bin/mpi\_perf\_snapshot\_vars.sh

3. Run with the MPI Performance Snapshot enabled

\$ mpirun -mps -n 1024 ./exe

4. Analyze your results

\$ mpi\_perf\_snapshot ./stats.txt ./app\_stat.txt



## Focus on Memory & Counter Usage

New collector displays summary info immediately after end of application run HW counters & memory usage info:

| ======================================    | ===== |  |  |  |  |  |  |  |
|-------------------------------------------|-------|--|--|--|--|--|--|--|
| WallClock: 284.274 sec (All processes)    |       |  |  |  |  |  |  |  |
| MIN: 31.998 sec (rank 0)                  |       |  |  |  |  |  |  |  |
| MAX: 35.534 sec (rank 7)                  |       |  |  |  |  |  |  |  |
|                                           |       |  |  |  |  |  |  |  |
| ======================================    | ===== |  |  |  |  |  |  |  |
| GFlops: 9.563 MPI: 11.28% NON_MPI: 88.72% |       |  |  |  |  |  |  |  |
|                                           |       |  |  |  |  |  |  |  |
| Floating-Point instructions: 45.77%       |       |  |  |  |  |  |  |  |
| Vectorized DP instructions: 24.69%        |       |  |  |  |  |  |  |  |
| Memory access instructions: 42.35%        |       |  |  |  |  |  |  |  |
|                                           |       |  |  |  |  |  |  |  |
| ======================================    | ===== |  |  |  |  |  |  |  |
| All processes: 256.740MB                  |       |  |  |  |  |  |  |  |
| MIN: 30.608MB (process 7)                 |       |  |  |  |  |  |  |  |
| MAX: 33.136MB (process 1)                 |       |  |  |  |  |  |  |  |
|                                           |       |  |  |  |  |  |  |  |



Copyright° 2015, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

## Find your MPI & OpenMP Imbalance hotspots

|         | ====== MPI | IMBALANCE STATISTIC | 'S =========        | =====           |                 |
|---------|------------|---------------------|---------------------|-----------------|-----------------|
| MPI Imb | alance:    | 207.847 sec         | 73,12% (            | All processes)  |                 |
|         | MIN:       | 23.044 sec          | 64,85% (            | rank 6)         |                 |
|         | MAX:       | 30.113 sec          | 88,57% (            | rank 1)         |                 |
|         |            | OpenMP STATISTICS = | =================== | =====           |                 |
| OpenMP  | Regions:   | 228.631 sec         | 80,43%              | 56 region(s)    | (All processes) |
|         | MIN:       | 25.348 sec          | 71,33%              | 7 region(s)     | (rank 7)        |
|         | MAX:       | 33.124 sec          | 97,42%              | 7 region(s)     | (rank 1)        |
| OpenMP  | Imbalance: | 103.924 sec         | 36,56%              | (All processes) |                 |
|         | MIN:       | 11.522 sec          | 32,43%              | (rank 3)        |                 |
|         | MAX:       | 15.057 sec          | 44,29%              | (rank 2)        |                 |
|         |            |                     |                     |                 |                 |



# Easy-to-read HTML output helps you categorize performance issues

### MPI Performance Snapshot Summary

Application: ./hybrid

Ranks: 4

Used statistics: pcs\_r4\_f020.txt

### Overview

### Performance by Metric

WallClock time:





### Memory Usage

Peak memory consumption (rank 1): 0.79 MB

Mean memory consumption: 0.75 MB Per process memory usage affects the application scalability.

#### Total application lifetime. The time is elapsed time for the slowest process. This metric is the sum of the MPI Time and the Computation time below. MPI Time: 14.70 sec 48,99% Time spent inside the MPI library. High values are usually bad. This value is HIGH. The application is Communication-bound. More details. 48.98% MPI Imbalance: 14.69 sec Mean unproductive wait time per-process spent in the MPI library calls when a process is waiting for data. This time is part of the MPI time above. High values are usually bad. This value is HIGH. The application workload is NOT well balanced between MPI ranks. More details... Computation Time: 15.30 sec Mean time per-process spent in the application code. This is the sum of the OpenMP Time and the Serial time. High values are usually good. This value is AVERAGE. The application is Computation-bound. More details... OpenMP Time: 14.90 sec Mean time per process spent in the OpenMP parallel regions. High values are usually good and indicate that the application is well-threaded. This value is AVERAGE. OpenMP Imbalance: 7.37 sec 24.57% Mean unproductive wait time per-process spent in OpenMP parallel regions (normally at synchronization barriers). High values are usually bad. This value is HIGH. The application's OpenMP work sharing is NOT well load-balanced, More details. Serial Time: 0.40 sec Mean application time per-process spent outside OpenMP parallel regions.

Mean application time per-process spent outside OpenMP parallel region: High values may be good or bad depending on the application algorithm. This value is NEGLIGIBLE. This application is well parallelized via OpenMP directives.

### Application Analysis to Guide Development Efforts



30.00 sec

## Full MPI Profiling via Intel® Trace Analyzer and Collector



83

Imbalance Diagram



The "Haswell" microarchitecture makes several performance improvements

SIMD-parallel programming is key to performance

Use implicit or explicit SIMD coding to exploit SIMD units

Tune MPI for optimal performance

Use MPI Performance Snapshot or ITAC to find MPI bottlenecks





## Software

°2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Intel Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. \*Other names and brands may be claimed as the property of others.