Accelerated Simulations of Cosmic Dust Heating Using the Intel Many Integrated Core Architecture

by Andrey Vladimirov 7. June 2013 11:57

Slides from the talk: PDF logo Vladimirov_HEATCODE_MIC_UCSC.pdf (4.03 mb)

Cosmic dust absorbs starlight in the optical and ultraviolet ranges, and re-emits it in the infrared range. This process is crucial for radiative transport in our Galaxy. I am participating in a project to develop a computational tool for Galactic radiative transport simulation with stochastic light absorption and re-emission on small dust grains. This project has resulted in the development of a library called HEATCODE (HEterogeneous Architecture library for sTochastic COsmic Dust Emissivity) for fast calculation of the stochastic dust heating process using Intel Xeon Phi coprocessors.

I presented HEATCODE and shared my experiences with the development and optimization of applications for Xeon Phi coprocessors in a talk at the Applied Mathematics and Statistics Department at UCSC. The slides from this talk can be downloaded here (see below). The full source code of the application, along with a detailed description of the optimization process, will soon be submitted for peer-reviewed publication, and will become publicly available.

Slides from the talk: PDF logo Vladimirov_HEATCODE_MIC_UCSC.pdf (4.03 mb)

Tags: , , , , , , , ,

How to Write Your Own Blazingly Fast Library of Special Functions for Intel Xeon Phi Coprocessors

by Vadim Karpusenko 3. May 2013 17:56

Complete paper: PDF logo Colfax_Static_Libraries_Xeon_Phi.pdf (425.6 kb)

Statically-linked libraries are used in business and academia for security, encapsulation, and convenience reasons. Static libraries with functions offloadable to Intel Xeon Phi coprocessors must contain executable code for both the host and the coprocessor architecture. Furthermore, for library functions used in data-parallel contexts, vectorized versions of the functions must be produced at the compilation stage.

This white paper shows how to design and build statically-linked libraries with functions offloadable to Intel Xeon Phi coprocessors. In addition, it illustrates how special functions with scalar syntax (e.g., y=f(x)) can be implemented in such a way that user applications can use them in thread- and data-parallel contexts. The second part of the paper demonstrates some optimization methods that improve the performance of functions with scalar syntax on the multi-core and the many-core platforms: precision control, strength reduction, and algorithmic optimizations.

Complete paper: PDF logo Colfax_Static_Libraries_Xeon_Phi.pdf (425.6 kb)

Tags: , , , , , ,

Cache Traffic Optimization on Intel Xeon Phi Coprocessors for Parallel In-Place Square Matrix Transposition with Intel Cilk Plus and OpenMP

by Andrey Vladimirov 25. April 2013 10:05

Complete paper: PDF logo Colfax_Transposition_Xeon_Phi.pdf (602 kb)

Numerical algorithms sensitive to the performance of processor caches can be optimized by increasing the locality of data access. Loop tiling and recursive divide-and-conquer are common methods for cache traffic optimization. This paper studies the applicability of these optimization methods in the Intel Xeon Phi architecture for the in-place square matrix transposition operation. Optimized implementations in the Intel Cilk Plus and OpenMP frameworks are presented and benchmarked. Cache-oblivious nature of the recursive algorithm is compared to the tunable character of the tiled method. Results show that Intel Xeon Phi coprocessors transpose large matrices faster than the host system, however, smaller matrices are more efficiently transposed by the host. On the coprocessor, the Intel Cilk Plus framework excels for large matrix sizes, but incurs a significant parallelization overhead for smaller sizes. Transposition of smaller matrices on the coprocessor is faster with OpenMP.

Complete paper: PDF logo Colfax_Transposition_Xeon_Phi.pdf (602 kb)

Tags: , , , , , , ,

Test-driving Intel® Xeon Phi™ coprocessors with a basic N-body simulation

by Andrey Vladimirov 7. January 2013 18:37

Complete paper:  Colfax_Nbody_Xeon_Phi.pdf (1.21 mb)

Addendum (correction):  Colfax_Nbody_Xeon_Phi-addendum.pdf (509.35 kb)

Intel® Xeon Phi™ coprocessors are capable of delivering more performance and better energy efficiency than Intel® Xeon® processors for certain parallel applications. In this paper, we investigate the porting and optimization of a test problem for the Intel Xeon Phi coprocessor. The test problem is a basic N-body simulation, which is the foundation of a number of applications in computational astrophysics and biophysics. Using common code in the C language for the host processor and for the coprocessor, we benchmark the N-body simulation. The simulation runs 2.3x to 5.4x times faster on a single Intel Xeon Phi coprocessor than on two Intel Xeon E5 series processors. The performance depends on the accuracy settings for transcendental arithmetics. We also study the assembly code produced by the compiler from the C code. This allows us to pinpoint some strategies for designing C/C++ programs that result in efficient automatically vectorized applications for Intel Xeon family devices.

C Code Assembly Listing Performance Chart

The visualization shown below demonstrates the results and the performance of the N-body simulation on Intel Xeon processors and Intel Xeon Phi coprocessors. The code running the visualization has the same force calculation algorithm as the code presented in the paper.

CORRECTION

Thanks to Georg Hager for pointing out the missing compiler argument -xAVX for the host version of the code! The corrected result is reported in the  addendum (509 kb). The performance with -xhost (equivalent to -xAVX on our system) is shown in the last set of bars in the plot below (click to enlarge).

correction-plot

Complete paper:  Colfax_Nbody_Xeon_Phi.pdf (1.21 mb)

Addendum (correction):  Colfax_Nbody_Xeon_Phi-addendum.pdf (509.35 kb)

Tags: , , , , , ,

Squeezing More Instructions per Cycle out of the Intel Sandy Bridge CPU Pipeline

by Andrey Vladimirov 31. July 2012 20:50

Complete paper:  Colfax_CPI.pdf (253.26 kb)

Parallelism in modern CPU architectures is supported at hardware level by multiple cores, vector registers, and pipelines. While the utilization of the former two is a shared responsibility of the programmer and the compiler, pipelining is handled completely by the processor. It is, however, useful for the developer to know what types of workloads optimize pipeline utilization. This paper shows one example where a specific workload improves the number of instructions executed per clock cycle, boosting arithmetic performance. This workload is comprised of two independent data processing tasks, one performing the AVX addition instruction and the other — the AVX multiplication instruction. Even though these tasks are executed sequentially on one core, alternating additions and multiplications in the code allows the CPU to complete the task 40% faster than when a sequence of additions is followed by a sequence of multiplications. Such workloads are common in linear algebraic applications. Examples in the paper illustrate how improved performance can be achieved in portable C code using the Intel C/C++ compiler. Performance benchmarking with the Intel Vtune Parallel Amplifier is illustrated.

c code c code c code

Complete paper:  Colfax_CPI.pdf (253.26 kb)

Tags: , , , , ,

Scientific Computing in a Web Browser: GALPROP WebRun

by Andrey Vladimirov 30. June 2012 17:52

Complete paper:  Colfax_Galprop_WebRun.pdf (1.86 mb)

As scientific software tools become increasingly complex and computationally demanding, sharing the source code of a scientific project with the community may be insufficient to support peer interest and ensure the appropriate use of the tools. In order to facilitate the use of astrophysical code GALPROP, our group has launched a public online service named GALPROP WebRun. This service, live since August 2010, includes: the ability to configure GALPROP computing tasks in a Web browser; access to a dedicated computing cluster and precompiled binaries for code execution; and user support in the form of online documentation, automated validation tools, and forum/bug reporting online software. This paper reports the details and status of the GALPROP WebRun project as well as our experience with it.

Complete paper:  Colfax_Galprop_WebRun.pdf (1.86 mb)

Tags: , , , , , , , , , ,

Avoiding communication saves time and energy (if you are an algorithm)

by Andrey Vladimirov 30. May 2012 14:18

In this post, I would like to reflect on a seminar that I recently attended at Stanford University's Institute for Computational and Mathematical Engineering. The talk was given by Prof. James Demmel, who leads the research on communication avoiding algorithms at the UC Berkley Computer Science department. The lessons I took home from this talk are two: first, the research in communication avoiding algorithms has brought about amazing optimization possibilities, which reduce the time and energy usage of a number of computing problems; and second, the trend of hardware upgrades in the academic HPC arena goes in the direction that is counter-productive for these novel methods.

Why avoiding communication is important

It is common knowledge that arithmetic capabilities of computing systems progress much faster than the bandwidth and latency of computer networks and random-access memory. An explanation of this trend offered by Mark Hoemmen, a student of Demmel, is that "Flops are cheap, bandwidth is money, latency is physics". The consequence of the skyrocketing arithmetic capabilities, paraphrasing Prof. Demmel's words, is that even if one's algorithm is not bottlenecked by the communication of data today, it will be in the near future, when the number of FLOP/s supported by available clusters has grown much more than the network throughput. And this is where communication-avoiding algorithms come in. Indeed, this area of research has born fruit in the last few years (see, e.g., Prof. Demmel's publication list on DBLP). Even for common operations, such as matrix multiplication, LU decomposition or the N-body problem, novel algorithms can boost speed and reduce energy usage; Prof. Demmel's talk showed examples where improvements were as large as an order of magnitude compared to traditional methods!

More memory + Novel algorithm = Less communication

At this point, it is valid to ask the question, "Ok, what's the catch?" The answer is one word: memory. The work "Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms" by Solomonik and Demmel explains it in this way:

Extra memory allows parallel matrix multiplication to be done with asymptotically less communication than Cannon’s algorithm and be faster in practice. "3D" algorithms arrange the p processors in a 3D array, and store redundant copies of the matrices on each of p layers. "2D" algorithms such as Cannon's algorithm store a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of "2.5D algorithms". For matrix multiplication, we take advantage of any amount of extra memory to store c copies of the data, for any c ∈ {1, 2, ..., [p]}, to reduce the bandwidth cost of Cannon's algorithm by a factor of c½ and the latency cost by a factor c3/2...

According to Prof. Demmel's presentation, storing additional copies of the data to reduce the communication cost has proven to be advantageous for a number of O(n3) and Strassen-like problems; "s-step" Krylov subspace methods; and general programs with array references where the subscripts are linear functions of the loop references (e.g., the N-body problem). The optimum performance produces nearly perfect scaling in time and energy (i.e., the computing time scales as T=p−1 and energy consumption scales as E=const).

Is the academic HPC world up to speed?

There is no doubt that communication avoiding algorithms will play a very important role in the future of HPC. In fact, these methods were mentioned in President Obama's Energy Budget Request to Congress (see also this press-release for page numbers and quotes). However, what has the trend in HPC hardware developments been so far? After the Prof. Demmel's talk I had a very informative discussion with another attendee of the seminar, SLAC scientist Ki Hwan Lee, who has brought to my attention that the recent upgrade of state-of-the art HOPPER cluster resulted in less memory per core than before the upgrade. At the moment, HOPPER's 6,000 compute nodes have 32 GB and 384 nodes feature 64 GB of RAM shared between 24 processor cores.

The lack of large shared-memory computing resources available to academia is not news; however, these resources do exist elsewhere. As an example, allow me to refer to the recent Colfax Research publication, which mentions an astrophysical computing effort involving terabyte-RAM servers. These servers were provided to the Fermi Large Area Telescope researchers by Colfax International on a temporary basis, and similar resources could not be found in academic institutions. In that publication, I have already advocated for the utility of large memory compute nodes from the perspective of efficiency:

  • first, they allow in-core solution of exotic computational problems, such as the huge 1D FFTs necessary for our astrophysical project, which eliminates network traffic completely;
  • second, for memory-bound batch calculations, they allow to shift from thread-level parallelism to process-level parallelism, eliminating the multithreading overhead and thus increasing scalability;
  • and finally, they may replace several small-memory compute nodes, reducing the amount, cost and maintenance effort of cluster infrastructure.
While these large-memory machines are routinely ordered by Colfax's customers as stand-alone compute nodes, they are generally not used as a part of an HPC solution. In fact, this technology is so alien to the scientific HPC community that I often have to repeat "a terabyte of RAM" twice in discussions with experts in scientific computing.

Conclusions

What does one take away from this information? In my opinion, with teraflop-capable GPUs overtaking the arithmetically-intensive workloads and the Intel MIC on the way for similar tasks, the logical direction in which the application of CPUs should shift is memory-intensive problems. Memory-intensive problems is something that GPUs are not good at by design. It is therefore surprising that the recent advances in hardware configuration are made in the opposite direction, shrinking the amount of memory per core. While it may take some time for big computing to assess the financial and environmental impact of large-memory computers and adjust the strategies for HPC development, today's users and owners of smaller computing clusters can benefit from large memory machines in a number of ways, with communication avoiding algorithms being one of the most lucrative.

Tags: , , ,

Arithmetics on Intel’s Sandy Bridge and Westmere CPUs: not all FLOPs are created equal

by Andrey Vladimirov 30. April 2012 19:47

Complete paper:  Colfax_FLOPS.pdf (195.45 kb)

This paper presents a new arithmetic efficiency benchmark and uses it to compare the Intel Sandy Bridge E5-2680 CPU to the Intel Westmere X5690 CPU performance. The efficiency is measured for single and double precision floating point operations: addition, multiplication, division, square root and the exponential function, and for 32- and 64-bit integer operations: addition, multiplication and division. The SSE2 and AVX instruction sets, as well as scalar operations, in single-threaded and multi-threaded modes are covered. This benchmark eliminates the effects of memory bandwidth and latency by fitting the calculation in the L1 cache. The bandwidth of the L1 cache and main memory (RAM) are estimated for reference, and the LINPACK benchmark result is reported.

Results show that the E5-2680 CPU performs floating point addition and multiplication dramatically faster (up to 2.6x) than the X5690 model. However, the floating point division and square root are the new model’s weak spots. AVX floating point operations addition and multiplication are up to 2.0x faster than the SSE2; however, AVX provides no performance gain for division and square root. 32-bit integer arithmetic operations, despite the lack of AVX integer intrinsics, are up to 3.5x faster on E5-2680. At the same time, the Sandy Bridge CPU showed a 1.15x better L1 cache performance and 2.4x greater memory bandwidth than the Westmere model.

These results lead to the conclusion that the edge of the 8-core, 2.70 GHz Sandy Bridge CPU over the 6-core, 3.46 GHz Westmere processor will be most significant in both single and double precision for linear algebra and other tasks based on addition and multiplication. Re-compilation of codes performing addition and multiplication-based tasks with AVX intrinsics instead of SSE2 should lead to additional performance benefits on Sandy Bridge. However, CPU- bound calculations heavily using the division operation and transcendental functions are likely to experience a smaller speedup from using the Sandy Bridge processor in place of Westmere. Likewise, they will benefit less from the migration from SSE2 to AVX.

Complete paper:  Colfax_FLOPS.pdf (195.45 kb)

ADDENDUM

1. Note that pipelining effects come into play when arithmetic operations are combined in a code. For instance, better performance may be obtained when additions are alternated with multiplications, as opposed to a code that performs only additions or only multiplications. See follow-up article about this effect at http://research.colfaxinternational.com/post/2012/07/31/CPI.aspx.

2. The Linpack benchmark result reported in the paper "Arithmetics on Intel's Sandy Bridge..." was obtained using the precompiled binaries optimized for the Xeon 64-bit architecture and employing the Intel OpenMP library for shared-memory parallelization. These results are sub-optimal for this system. Running the MPI-based benchmark yielded a higher Linpack score for the dual-socket E5-2680 CPU system: 292 GFLOP/s. The key parameters of this benchmark are: 32 processes, N=39936, NB=112, PMAP=0, P=4, Q=8. Even higher scores may be possible, see Intel's publication on this subject.

Tags: , , , , , , ,

Auto-Vectorization with the Intel Compilers: is Your Code Ready for Sandy Bridge and Knights Corner?

by Andrey Vladimirov 12. March 2012 13:01

Complete paper:  Colfax_Sandy_Bridge_AVX.pdf (632.23 kb)

One of the features of Intel’s Sandy Bridge-E processor released this month is the support for the Advanced Vector Extensions (AVX) instruction set. Codes suitable for efficient auto-vectorization by the compiler will be able to take advantage of AVX without any code modification, with only re-compilation.

This paper explains the guidelines for code design suitable for auto-vectorization by the compiler (elimination of vector dependence, implementation of unit-stride data access and proper address alignment) and walks the reader through a practical example of code development with auto-vectorization. The resulting code is compiled and executed on two computer systems: a Westmere CPU-based system with SSE 4.2 support, and a Sandy Bridge-based system with AVX support. The benefit of vectorization is more significant in the AVX version, if the code is designed efficiently. An ‘elegant’, but inefficient solution is also provided and discussed.

In addition, the paper provides a comparative benchmark of the Sandy Bridge and Westmere systems, based on the discussed algorithm. Implications of auto-vectorization methods for Intel’s future Many Integrated Core technology based on the Knights Corner chip are discussed at the end.

Complete paper:  Colfax_Sandy_Bridge_AVX.pdf (632.23 kb)

Tags: , , , , , ,

Large Fast Fourier Transforms with FFTW 3.3 on Terabyte-RAM NUMA Servers

by Andrey Vladimirov 2. February 2012 12:54

Complete paper:  Colfax_Benchmark_Large_1D_FFTW_NUMA.pdf (294.13 kb)

This paper presents the results of a Fast Fourier Transform (FFT) benchmark of the FFTW 3.3 library on Colfax's 4-CPU, large memory servers. Unlike other published benchmarks of this library, we study two distinct cases of FFT usage: sequential and concurrent computation of multithreaded transforms. In addition, this paper provides results for very large (up to N = 231) and massively parallel (up to 80 threads) shared memory transforms, which have not yet been reported elsewhere.

The FFT calculation is discussed: parallelization techniques and hardware-specific implementations; motivation for a specific astrophysical research is given. Results presented here include: dependence of performance on the transform size and on the number of threads, memory usage of multithreaded 1D FFTs, estimates of the FFT planning time. The paper shows how to optimize the performance of concurrent independent calculations on these large memory systems by setting an efficient NUMA policy. This policy partitions the machine’s resources, reducing the average memory latency. Such optimization is not specific to FFT algorithms, and can be useful for a variety of applications in large memory NUMA systems. Our conclusion is that the FFTW implementation of multithreaded one-dimensional FFTs scales very well with the number of threads for large transforms, but worse for small transforms. Having a large amount of shared memory in the system is beneficial for the performance of large concurrent FFTs, as it allows to reduce instruction-level parallelism in favor of task-level parallelism.

Complete paper:  Colfax_Benchmark_Large_1D_FFTW_NUMA.pdf (294.13 kb)

Tags: , , , , , , ,

About Colfax Research

Colfax International provides an arsenal of novel computational tools, which need to be leveraged in order to harness their full power. We are collaborating with researchers in science and industry, including our customers, to produce case studies, white papers, and develop a wide knowledge base of the applications of current and future computational technologies.

This blog will contain a variety of information, from hardware benchmarks and HPC news highlights, to discussions of programming issues and reports on research projects carried out in our collaborations. In addition to our in-house research, we will present contributions from authors in the academia, industry and finance, as well as software developers. Our hope is that this information will be useful to a wide audience interested in innovative computing technologies and their applications.

Author Profiles

Andrey Vladimirov, PhD, is a physicist with a longstanding interest in high performance computing. His research topics include computer simulations of cosmic ray production and propagation and collisionless plasma modeling. Andrey is a postdoctoral scholar at Stanford University.

All posts by this author...

Author Profiles

Vadim Karpusenko, PhD, is a Research Associate at Colfax International. His research interests are in the area of physical modeling with HPC clusters, highly parallel architectures, and code optimization. Vadim holds a PhD in Physics from North Carolina State University for his computational research of the free energy and stability of helical secondary structures of proteins.

All posts by this author...