How to Write Your Own Blazingly Fast Library of Special Functions for Intel Xeon Phi Coprocessors

by Vadim Karpusenko 3. May 2013 17:56

Complete paper: PDF logo Colfax_Static_Libraries_Xeon_Phi.pdf (425.6 kb)

Statically-linked libraries are used in business and academia for security, encapsulation, and convenience reasons. Static libraries with functions offloadable to Intel Xeon Phi coprocessors must contain executable code for both the host and the coprocessor architecture. Furthermore, for library functions used in data-parallel contexts, vectorized versions of the functions must be produced at the compilation stage.

This white paper shows how to design and build statically-linked libraries with functions offloadable to Intel Xeon Phi coprocessors. In addition, it illustrates how special functions with scalar syntax (e.g., y=f(x)) can be implemented in such a way that user applications can use them in thread- and data-parallel contexts. The second part of the paper demonstrates some optimization methods that improve the performance of functions with scalar syntax on the multi-core and the many-core platforms: precision control, strength reduction, and algorithmic optimizations.

Complete paper: PDF logo Colfax_Static_Libraries_Xeon_Phi.pdf (425.6 kb)

Tags: , , , , , ,

Test-driving Intel® Xeon Phi™ coprocessors with a basic N-body simulation

by Andrey Vladimirov 7. January 2013 18:37

Complete paper:  Colfax_Nbody_Xeon_Phi.pdf (1.21 mb)

Addendum (correction):  Colfax_Nbody_Xeon_Phi-addendum.pdf (509.35 kb)

Intel® Xeon Phi™ coprocessors are capable of delivering more performance and better energy efficiency than Intel® Xeon® processors for certain parallel applications. In this paper, we investigate the porting and optimization of a test problem for the Intel Xeon Phi coprocessor. The test problem is a basic N-body simulation, which is the foundation of a number of applications in computational astrophysics and biophysics. Using common code in the C language for the host processor and for the coprocessor, we benchmark the N-body simulation. The simulation runs 2.3x to 5.4x times faster on a single Intel Xeon Phi coprocessor than on two Intel Xeon E5 series processors. The performance depends on the accuracy settings for transcendental arithmetics. We also study the assembly code produced by the compiler from the C code. This allows us to pinpoint some strategies for designing C/C++ programs that result in efficient automatically vectorized applications for Intel Xeon family devices.

C Code Assembly Listing Performance Chart

The visualization shown below demonstrates the results and the performance of the N-body simulation on Intel Xeon processors and Intel Xeon Phi coprocessors. The code running the visualization has the same force calculation algorithm as the code presented in the paper.

CORRECTION

Thanks to Georg Hager for pointing out the missing compiler argument -xAVX for the host version of the code! The corrected result is reported in the  addendum (509 kb). The performance with -xhost (equivalent to -xAVX on our system) is shown in the last set of bars in the plot below (click to enlarge).

correction-plot

Complete paper:  Colfax_Nbody_Xeon_Phi.pdf (1.21 mb)

Addendum (correction):  Colfax_Nbody_Xeon_Phi-addendum.pdf (509.35 kb)

Tags: , , , , , ,

Arithmetics on Intel’s Sandy Bridge and Westmere CPUs: not all FLOPs are created equal

by Andrey Vladimirov 30. April 2012 19:47

Complete paper:  Colfax_FLOPS.pdf (195.45 kb)

This paper presents a new arithmetic efficiency benchmark and uses it to compare the Intel Sandy Bridge E5-2680 CPU to the Intel Westmere X5690 CPU performance. The efficiency is measured for single and double precision floating point operations: addition, multiplication, division, square root and the exponential function, and for 32- and 64-bit integer operations: addition, multiplication and division. The SSE2 and AVX instruction sets, as well as scalar operations, in single-threaded and multi-threaded modes are covered. This benchmark eliminates the effects of memory bandwidth and latency by fitting the calculation in the L1 cache. The bandwidth of the L1 cache and main memory (RAM) are estimated for reference, and the LINPACK benchmark result is reported.

Results show that the E5-2680 CPU performs floating point addition and multiplication dramatically faster (up to 2.6x) than the X5690 model. However, the floating point division and square root are the new model’s weak spots. AVX floating point operations addition and multiplication are up to 2.0x faster than the SSE2; however, AVX provides no performance gain for division and square root. 32-bit integer arithmetic operations, despite the lack of AVX integer intrinsics, are up to 3.5x faster on E5-2680. At the same time, the Sandy Bridge CPU showed a 1.15x better L1 cache performance and 2.4x greater memory bandwidth than the Westmere model.

These results lead to the conclusion that the edge of the 8-core, 2.70 GHz Sandy Bridge CPU over the 6-core, 3.46 GHz Westmere processor will be most significant in both single and double precision for linear algebra and other tasks based on addition and multiplication. Re-compilation of codes performing addition and multiplication-based tasks with AVX intrinsics instead of SSE2 should lead to additional performance benefits on Sandy Bridge. However, CPU- bound calculations heavily using the division operation and transcendental functions are likely to experience a smaller speedup from using the Sandy Bridge processor in place of Westmere. Likewise, they will benefit less from the migration from SSE2 to AVX.

Complete paper:  Colfax_FLOPS.pdf (195.45 kb)

ADDENDUM

1. Note that pipelining effects come into play when arithmetic operations are combined in a code. For instance, better performance may be obtained when additions are alternated with multiplications, as opposed to a code that performs only additions or only multiplications. See follow-up article about this effect at http://research.colfaxinternational.com/post/2012/07/31/CPI.aspx.

2. The Linpack benchmark result reported in the paper "Arithmetics on Intel's Sandy Bridge..." was obtained using the precompiled binaries optimized for the Xeon 64-bit architecture and employing the Intel OpenMP library for shared-memory parallelization. These results are sub-optimal for this system. Running the MPI-based benchmark yielded a higher Linpack score for the dual-socket E5-2680 CPU system: 292 GFLOP/s. The key parameters of this benchmark are: 32 processes, N=39936, NB=112, PMAP=0, P=4, Q=8. Even higher scores may be possible, see Intel's publication on this subject.

Tags: , , , , , , ,

Auto-Vectorization with the Intel Compilers: is Your Code Ready for Sandy Bridge and Knights Corner?

by Andrey Vladimirov 12. March 2012 13:01

Complete paper:  Colfax_Sandy_Bridge_AVX.pdf (632.23 kb)

One of the features of Intel’s Sandy Bridge-E processor released this month is the support for the Advanced Vector Extensions (AVX) instruction set. Codes suitable for efficient auto-vectorization by the compiler will be able to take advantage of AVX without any code modification, with only re-compilation.

This paper explains the guidelines for code design suitable for auto-vectorization by the compiler (elimination of vector dependence, implementation of unit-stride data access and proper address alignment) and walks the reader through a practical example of code development with auto-vectorization. The resulting code is compiled and executed on two computer systems: a Westmere CPU-based system with SSE 4.2 support, and a Sandy Bridge-based system with AVX support. The benefit of vectorization is more significant in the AVX version, if the code is designed efficiently. An ‘elegant’, but inefficient solution is also provided and discussed.

In addition, the paper provides a comparative benchmark of the Sandy Bridge and Westmere systems, based on the discussed algorithm. Implications of auto-vectorization methods for Intel’s future Many Integrated Core technology based on the Knights Corner chip are discussed at the end.

Complete paper:  Colfax_Sandy_Bridge_AVX.pdf (632.23 kb)

Tags: , , , , , ,

About Colfax Research

Colfax International provides an arsenal of novel computational tools, which need to be leveraged in order to harness their full power. We are collaborating with researchers in science and industry, including our customers, to produce case studies, white papers, and develop a wide knowledge base of the applications of current and future computational technologies.

This blog will contain a variety of information, from hardware benchmarks and HPC news highlights, to discussions of programming issues and reports on research projects carried out in our collaborations. In addition to our in-house research, we will present contributions from authors in the academia, industry and finance, as well as software developers. Our hope is that this information will be useful to a wide audience interested in innovative computing technologies and their applications.

Author Profiles

Andrey Vladimirov, PhD, is a physicist with a longstanding interest in high performance computing. His research topics include computer simulations of cosmic ray production and propagation and collisionless plasma modeling. Andrey is a postdoctoral scholar at Stanford University.

All posts by this author...

Author Profiles

Vadim Karpusenko, PhD, is a Research Associate at Colfax International. His research interests are in the area of physical modeling with HPC clusters, highly parallel architectures, and code optimization. Vadim holds a PhD in Physics from North Carolina State University for his computational research of the free energy and stability of helical secondary structures of proteins.

All posts by this author...