by Andrey Vladimirov
30. April 2012 19:47
Complete paper:
Colfax_FLOPS.pdf (195.45 kb)
This paper presents a new arithmetic efficiency benchmark and uses it to compare the Intel Sandy Bridge E5-2680 CPU to the Intel Westmere X5690 CPU performance. The efficiency is measured for single and double precision floating point operations: addition, multiplication, division, square root and the exponential function, and for 32- and 64-bit integer operations: addition, multiplication and division. The SSE2 and AVX instruction sets, as well as scalar operations, in single-threaded and multi-threaded modes are covered. This benchmark eliminates the effects of memory bandwidth and latency by fitting the calculation in the L1 cache. The bandwidth of the L1 cache and main memory (RAM) are estimated for reference, and the LINPACK benchmark result is reported.
Results show that the E5-2680 CPU performs floating point addition and multiplication dramatically faster (up to 2.6x) than the X5690 model. However, the floating point division and square root are the new model’s weak spots. AVX floating point operations addition and multiplication are up to 2.0x faster than the SSE2; however, AVX provides no performance gain for division and square root. 32-bit integer arithmetic operations, despite the lack of AVX integer intrinsics, are up to 3.5x faster on E5-2680. At the same time, the Sandy Bridge CPU showed a 1.15x better L1 cache performance and 2.4x greater memory bandwidth than the Westmere model.
These results lead to the conclusion that the edge of the 8-core, 2.70 GHz Sandy Bridge CPU over the 6-core, 3.46 GHz Westmere processor will be most significant in both single and double precision for linear algebra and other tasks based on addition and multiplication. Re-compilation of codes performing addition and multiplication-based tasks with AVX intrinsics instead of SSE2 should lead to additional performance benefits on Sandy Bridge. However, CPU- bound calculations heavily using the division operation and transcendental functions are likely to experience a smaller speedup from using the Sandy Bridge processor in place of Westmere. Likewise, they will benefit less from the migration from SSE2 to AVX.
Complete paper:
Colfax_FLOPS.pdf (195.45 kb)
by Andrey Vladimirov
12. March 2012 13:01
Complete paper:
Colfax_Sandy_Bridge_AVX.pdf (632.23 kb)
One of the features of Intel’s Sandy Bridge-E processor released this month is the support for the Advanced Vector Extensions (AVX) instruction set. Codes suitable for efficient auto-vectorization by the compiler will be able to take advantage of AVX without any code modification, with only re-compilation.
This paper explains the guidelines for code design suitable for auto-vectorization by the compiler (elimination of vector dependence, implementation of unit-stride data access and proper address alignment) and walks the reader through a practical example of code development with auto-vectorization. The resulting code is compiled and executed on two computer systems: a Westmere CPU-based system with SSE 4.2 support, and a Sandy Bridge-based system with AVX support. The benefit of vectorization is more significant in the AVX version, if the code is designed efficiently. An ‘elegant’, but inefficient solution is also provided and discussed.
In addition, the paper provides a comparative benchmark of the Sandy Bridge and Westmere systems, based on the discussed algorithm. Implications of auto-vectorization methods for Intel’s future Many Integrated Core technology based on the Knights Corner chip are discussed at the end.
Complete paper:
Colfax_Sandy_Bridge_AVX.pdf (632.23 kb)
by Andrey Vladimirov
2. February 2012 12:54
Complete paper:
Colfax_Benchmark_Large_1D_FFTW_NUMA.pdf (294.13 kb)
This paper presents the results of a Fast Fourier Transform (FFT) benchmark of the FFTW 3.3 library on Colfax's 4-CPU, large memory servers. Unlike other published benchmarks of this library, we study two distinct cases of FFT usage: sequential and concurrent computation of multithreaded transforms. In addition, this paper provides results for very large (up to N = 231) and massively parallel (up to 80 threads) shared memory transforms, which have not yet been reported elsewhere.
The FFT calculation is discussed: parallelization techniques and hardware-specific implementations; motivation for a specific astrophysical research is given. Results presented here include: dependence of performance on the transform size and on the number of threads, memory usage of multithreaded 1D FFTs, estimates of the FFT planning time. The paper shows how to optimize the performance of concurrent independent calculations on these large memory systems by setting an efficient NUMA policy. This policy partitions the machine’s resources, reducing the average memory latency. Such optimization is not specific to FFT algorithms, and can be useful for a variety of applications in large memory NUMA systems. Our conclusion is that the FFTW implementation of multithreaded one-dimensional FFTs scales very well with the number of threads for large transforms, but worse for small transforms. Having a large amount of shared memory in the system is beneficial for the performance of large concurrent FFTs, as it allows to reduce instruction-level parallelism in favor of task-level parallelism.
Complete paper:
Colfax_Benchmark_Large_1D_FFTW_NUMA.pdf (294.13 kb)
by Andrey Vladimirov
4. January 2012 21:16
Complete paper:
Colfax_Large_Memory_Servers_Memory_Bandwidth_Benchmark.pdf (435.43 kb)
Colfax International produces servers capable of supporting up to 1 TB of RAM and up to 4 Intel Xeon CPUs. This paper reports the memory bandwidth benchmark of these servers obtained using the STREAM code.
Our benchmark includes comprehensive statistical data: the mean, standard deviation, extrema and the distribution of bandwidth measurements. The distribution of measurements reveals several modes of RAM performance, including an above-average bandwidth mode. By default, the mode realized by any given benchmark depends on an unpredictable runtime pattern of thread and memory binding to the physical cores. The paper shows how to optimize memory traffic for bandwidth and consistently achieve the fastest mode. This is done by controlling the code’s thread affinity, and results in a bandwidth increase around 20% over the average unoptimized performance.
Without optimization, the measured RAM bandwidth with one thread is 5.79±0.06 GB/s (the ‘copy’ test), and it scales almost linearly with the number of threads until it peaks at 67±6 GB/s at 20 threads. Optimized code shows a maximum bandwidth up to 78.9±0.3 GB/s. A list of references for the NUMA architecture tools is provided.
Complete paper:
Colfax_Large_Memory_Servers_Memory_Bandwidth_Benchmark.pdf (435.43 kb)