

# Software Tools for Software Developers and Programming Models

James Reinders
Director, Evangelist, Intel Software
james.r.reinders@intel.com





- "Port of Choice"
  - Help IA continue as the biggest and best eco-system, a benefit for owners of IA as well as for software developers
  - This means: we want everything to run on IA, and run well



- "Port of Choice"
  - Help IA continue as the biggest and best eco-system, a benefit for owners of IA as well as for software developers
  - This means: we want everything to run on IA, and run well
- Support for open standards
  - Leadership in compliance, participation and strength of implementation



- "Port of Choice"
  - Help IA continue as the biggest and best eco-system, a benefit for owners of IA as well as for software developers
  - This means: we want everything to run on IA, and run well
- Support for open standards
  - Leadership in compliance, participation and strength of implementation
- Add value, do not detract from value
  - Provide Intel solutions and leadership where we believe we have unique technology and/or value to the industry





Picture credit: wikimedia.org



- Tools NEW announcement today of Intel® Parallel Studio XE 2011 SP1
- Support for standards
  - For instance: radix 10 floating point support
- Tackling the TOUGH issues for parallelism
  - High scalability to HUGE machines
  - Programming models that scale forward



- Tools NEW announcement today of Intel® Parallel Studio XE 2011 SP1
- Support for standards
  - For instance: radix 10 floating point support
- Tackling the TOUGH issues for parallelism
  - High scalability to HUGE machines
  - Programming models that scale forward



## Intel® Parallel Studio Philosophy



- All-in-one toolset for the software development lifecycle
- Multiplatform

Standards Based

intel.com/go/parallel



#### Intel® Parallel Studio XE 2011 Service Pack 1

Intel continues to be the best choice for C/C++/Fortran development tools

#### Performance

- Updated compilers and libraries produce industry leading performance.
  - Oup to 47% faster for C/C++ compiler, or more?
  - Up to 24% faster for Fortran compiler, or more?
- Intel C++ Compiler 12.1 is first compiler for IA to support IEEE 754-2008 radix-10 and the related C++ TR 24732. And... High performance!
- The most popular Analysis Tools<sup>1</sup> just got better

#### Forward scaling

- Intel® Threading Building Blocks 4.0, commercially supported. Code using TBB scales exceptionally well.
- Intel® Cilk™ Plus v1.1 implemented with commercial support; simplifies going parallel
- Advanced tools to develop code for Intel® Xeon® Processors (today), easily extends to Intel® MIC architecture (future)
- Tools that developers count on
  - Expanded standards support
    - o OpenMP\* 3.1
    - Leading support for key parts of the latest Fortran and C++ standards
  - Enhanced compatibility
    - Visual Studio\* 2010 Shell for Visual Fortran\*





Updated compilers and libraries produce industry leading

performance

- Intel v12.1 compilers improve performance compared with:
  - Competitive compilers
  - Previous version Intel compilers

|                                         | Intel v12.1<br>Compiler on<br>Windows* vs.<br>nearest<br>competitor | Intel v12.1<br>Compiler on<br>Linux* vs.<br>nearest<br>competitor | Intel v12.1<br>Compiler on<br>Windows vs.<br>v12.0 | Intel v12.1<br>Compiler on<br>Linux vs.<br>v12.0 |  |  |  |  |
|-----------------------------------------|---------------------------------------------------------------------|-------------------------------------------------------------------|----------------------------------------------------|--------------------------------------------------|--|--|--|--|
| C/C++<br>Integer <sup>1</sup>           | 47% faster                                                          | 12% faster                                                        | 11% faster                                         | 6% faster                                        |  |  |  |  |
| C/C++<br>Floating<br>Point <sup>1</sup> | 21% faster                                                          | 9% faster                                                         | 3% faster                                          | 1% faster                                        |  |  |  |  |
| Fortran <sup>2</sup>                    | 24% faster                                                          | 17% faster                                                        | 22% faster                                         | 27% faster                                       |  |  |  |  |

#### Notes:

<sup>1</sup>C/C++ performance measured using SPECint®\_base2006 estimated RATE benchmark running on a 64 bit operating system

<sup>2</sup> Fortran performance measured using Polyhedron\* benchmark running on a 64 bit operating system. In this performance measurement, "faster" refers to percent reduction in time-to-completion.



Configuration Info - SW Versions: Intel® C/C++ version 12.1: Hardware: Intel® Xeon® CPU X5670. @ 2.93GHz. 2x2.93GHz. RAM 48GB. CACHE 12288KB: Operating System: Windows 2008 x64 SP2: Benchmark Source: Intel Corp. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests

Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, efer to www.intel.com/performance/resources/benchmark\_limitations.htm. \* Other brands and names are the property of their respective owner

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

#### Industry Leading Performance using the Intel® Fortran Compiler Intel® Core™ i7 Processor running on Windows\* 64 (Lower is Better)



Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products. refer to www.intel.com/performance/resources/benchmark\_limitations.htm. \* Other brands and names are the property of their respective owner

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain ontimizations not specific to Intel microprocessors with Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804





Advanced tools to develop code for Intel® Xeon® Processors today that easily extends to Intel® MIC





"By just utilizing standard programming on both Intel® Xeon processor and Intel® MIC architecture based platforms, the performance met multi-threading scalability expectations and we observed near-theoretical linear performance scaling with the number of threads." – Hongsuk Yi, Heterogeneous Computing Team Leader, KISTI Supercomputing Center



"SGI understands the significance of interprocessor communications, power, density and usability when architecting for exascale. Intel has made the leap towards exaflop computing with the introduction of Intel® Many Integrated Core (MIC) architecture. Future Intel® MIC products will satisfy all four of these priorities, especially with their expected ten times increase in compute density coupled with their familiar X86 programming environment." –

Dr. Eng Lim Goh, SGI CTO



# Intel® Threading Building Blocks 4.0, commercially supported code using TBB scales exceptionally well

#### Flow Graph

- API Extends applicability of Intel® TBB to event-driven/reactive programming models
- Concurrent Unordered Set
  - Thread-safe container to store and access user objects
- Memory Pools
  - Enables greater flexibility and performance by getting thread-safe and scalable object allocation
- Generic GCC\* Atomics Support
  - Library portability enables development of Intel® TBB-based solutions on a broader range of platforms



Configuration Info - SW Versions: Intel® (++ Intel® 64 Compiler, Version 12.1, Intel® Threading Building Blocks 4.0; Hardware 4\* Intel® Xeon® CPU E7\*-4850 @ 2.27GHz (40 cores), 256GB Main Memory; Operating System: Linux, Red Hat\* Enterprise Server\* release 5.4, kernel 2.6.18-19.4.11.4.e.[5]; Benchmark Source: Intel Corp.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance tests and on the performance.

of Intel products, refer to <a href="www.intel.com/performance/resources/benchmark\_limitations.htm">was a term of the transs and names are the property of their respective owners</a>
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microproceiticcture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804





# Intel® Cilk™ Plus v1.1 implemented with commercial support; simplifies going parallel

- Enhanced performance and utilization of future Intel CPU features
- SIMD pragma loops, vector length, and elemental functions support
- Mac OS\* X support

```
cilk_for (int i=0; i<n; ++i) {
   Foo(a[i]);
}
Parallel loops made</pre>
```

```
int fib(int n)
                                          int fib(int n)
                                               if (n <= 2)
    if (n <= 2)
        return n;
                                                   return n;
    else {
                                               else {
        int x, y;
                                                   int x, y;
        x = fib(n-1);
                                                   x = cilk spawn fib(n-1);
        y = fib(n-2);
                                                   y = fib(n-2);
        return x+y;
                                                   cilk sync;
                                                   return x+y;
         Turn serial code
                                                          Into parallel code
```

Open spec at: cilkplus.org



### Pricing and availability



| Includes                                                       | C/C++<br>compiler | Fortran<br>compiler | For<br>Linux* | For<br>Windows* |
|----------------------------------------------------------------|-------------------|---------------------|---------------|-----------------|
| Intel® Parallel Studio XE 2011 SP1                             | •                 | •                   | \$2249        | \$1899          |
| Intel® C++ Studio XE 2011 SP1                                  | •                 |                     | \$1499        | \$1499          |
| Intel® Fortran*Studio XE 2011 SP1                              |                   | •                   | \$1799        | \$1599          |
| Intel® Visual Fortran Composer XE 2011 with IMSL* for Windows* |                   | •                   | NA            | \$1699          |

Additional configurations including floating and academic are available at www.intel.com/software/products



- Tools NEW announcement today of SP1 (service pack)
- Support for standards

\*Other names and brands may be claimed as the property of others

- For instance; radix 10 floating point support
- Tackling the TUGH issues for parallelism
  - High scalability to HUSE machines
  - Programming models that scale forward

OpenMP\* 3.1, C++11, Fortran 2003, Fortran 2008, LAPACK for C, OpenCL\*, IEEE FP...



# IEEE 754-2008 and ISO/IEC TR 24732:2009

0.1 (decimal)

0.0001100110011001100110011... (binary)



### Example from the CASE FILE for "Floating-point disasters"

Patriot missile accident. On February 25, 1991 an American Patriot missile failed to track and destroy an Iraqi Scud missile. Instead it hit an Army barracks, killing 28 Americans. The cause was later determined to be an inaccurate time caused by incrementing time in tenths of a second. Couldn't represent 0.1 exactly (single-precision floating point); error accumulated over about 100 hours before firing.



Photo credit: U.S. Dept. of Defense (http://www.defense.gov/photos/newsphoto.aspx?newsphotoid=685) Story credit: Federation of American Scientists (http://www.fas.org/spp/starwars/gao/im92026.htm)



- Tools NEW announcement today of Intel® Parallel Studio XE 2011 SP1
- Support for standards
  - For instance: radix 10 floating point support
- Tackling the TOUGH issues for parallelism
  - High scalability to HUGE machines
  - Programming models that scale forward





#### Driving Scalability in Intel MPI

New v4 architecture leads to very high scalability

- Fast startup and shutdown of large runs
- Reduced memory footprint
- Dynamic, progressive "connections"
- Remains binary compatible
- Maintains network independence
- Forward-looking focus:
  - Extreme scalability
  - Performance in every dimension
  - Tracking emerging standards





- Tools NEW announcement today of Intel® Parallel Studio XE 2011 SP1
- Support for standards
  - For instance: radix 10 floating point support
- Tackling the TOUGH issues for parallelism
  - High scalability to HUGE machines
  - Programming models that scale forward



#### **Scaling Programmability**



Standard Programming Models Democratizes Usage ... Avoid Costly Detours



There are many parallel programming models for C, C++ and Fortran.

--- support all established standards ---



## There are many parallel programming models for C, C++ and Fortran.

--- support all established standards ---

Intel® Parallel Building Blocks...

Intel® Cilk™ Plus

Language extension to simplify task, data and vector parallelism. Intel® Threading Building Blocks

Widely used
C++
template
library for
data and
task
parallelism.

Domain Specific Libraries

Intel® Integrated Performance Primitives.

Intel® Math Kernel Library.

Established Standards

Message Passing Interface (MPI)

OpenMP\*

Coarray Fortran

OpenCL\*

**Exploration** 

Intel® Concurrent Collections

Offload Extensions

Intel Array Building Blocks

vector parallelism.

parallelism

Intel® Math Kernel Library.

©2011 Intel

Fortran OpenCL\*

Building Blocks

## Intel® Threading Building Blocks (TBB)

- Outfits C++ for parallelism
- More popular than any other abstraction for parallelism
- Created by Intel
- Open Specification
- Open Source
- Adopted by industry
- Supported by community



### Intel® Cilk™ Plus

- Augments TBB three ways:
  - 1. Addresses needs of C programmers (and C++)
  - 2. Compiler can help, because keywords used
  - 3. Data parallelism is made explicit (important!)
- Created by Intel
- Open Specification
- Open Source NEW
   Simplicity of only 3 new keywords is surprisingly powerful.

# We are really onto something here (again)! Watch: Cilk™ Plus



- Tools NEW announcement today of SP1 (service pack)
- Support for standards
  - For instance: radix 10 floating point support
- Tackling the TOUGH issues for parallelism
  - High scalability to HUGE machines
  - Programming models that scale forward







### Legal Disclaimer

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference <a href="https://www.intel.com/software/products">www.intel.com/software/products</a>.

BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino Iogo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.

\*Other names and brands may be claimed as the property of others.

Copyright © 2011. Intel Corporation.



#### **Optimization Notice**

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804





