Configuration and Benchmarks of Peer-to-Peer Communication over Gigabit Ethernet and InfiniBand in a Cluster with Intel Xeon Phi Coprocessors

by Vadim Karpusenko 11. March 2014 14:05

Complete Paper: PDF logo Colfax_InfiniBand_for_MIC.pdf (2.32 mb)

Intel Xeon Phi coprocessors allow symmetric heterogeneous clustering models, in which MPI processes are run fully on coprocessors, as opposed to offload-based clustering. These symmetric models are attractive, because they allow effortless porting of CPU-based applications to clusters with manycore computing accelerators.

However, with the default software configuration and without specialized networking hardware, peer-to-peer communication between coprocessors in a cluster is quenched by orders of magnitude compared to the capabilities of Gigabit Ethernet networking hardware. This situation is remedied by InfiniBand interconnects and the software supporting them.

In this paper we demonstrate the procedures for configuring a cluster with Intel Xeon Phi coprocessors connected with Gigabit Ethernet as well as InfiniBand interconnects. We measure and discuss the latencies and bandwidths of MPI messages with and without the advanced configuration with InfiniBand support. The paper contains a discussion of MPI application tuning in an InfiniBand-enabled cluster with Intel Xeon Phi Coprocessors, a case study of the impact of InfiniBand protocol, and a set of recommendations for accommodating the non-uniform RDMA performance across the PCIe bus in high performance computing applications.

Complete Paper: PDF logo Colfax_InfiniBand_for_MIC.pdf (2.32 mb)

Tags: , , , , , , , , , , , ,

Primer on Computing with Intel Xeon Phi Coprocessors

by Andrey Vladimirov 6. March 2014 14:35

Presentation slides: PDF logo Colfax-G4-XeonPhi-Presentation.pdf (10.31 mb)

Geant4 is a high energy physics application package for simulation of elementary particle transport through matter. It is used in fundamental physics experiments, as well as in industrial and medical applications. For example, the ATLAS detector at LHC and the Fermi Gamma-Ray Space Telescope rely on Geant4 simulations, DNA damage due to ionizing radiation is studied by a derivative project Geant4-DNA, and radiotherapy planning can benefit from calculations with Geant4.

Geant4 has long been employing distributed-memory parallelism in the MPI framework. However, due to the trend of increasing ratio of core count to memory size in modern computing systems, and due to the need to process larger geometry models, Geant4 is undergoing modernization through inclusion of thread parallelism in shared memory. This effort is led by SLAC researchers Dr. Makoto Asai and Dr. Andrea Dotti (see, e.g., slides 1 and slides 2).

A beneficial by-product of such modernization is the possibility to use the Intel Many Integrated Core (MIC) architecture of Intel Xeon Phi coprocessors for Geant4 calculations. This possibility is being actively investigated by Dr. Dotti, who has extensively discussed his work on this project with us at Colfax.

I was fortunate to be invited to give a presentation for Geant4 users at the SLAC Geant4 Tutorial 2014 held at Stanford University. The talk discusses the Intel Many Integrated Core Architecture and points to the resources for learning about optimization of computing applications for Intel Xeon Phi coprocessors. The slides of the talk can be downloaded from this page.

Presentation slides: PDF logo Colfax-G4-XeonPhi-Presentation.pdf (10.31 mb)

Tags: , , , ,

"Heterochromic" Computer and Finding the Optimal System Configuration for Medical Device Engineering

by Andrey Vladimirov 27. January 2014 09:55

Report: PDF logo Carestream_HPC_Study-December2013.pdf (4.01 mb)

Designing a computing system configuration for optimal performance of a given task is always challenging, especially if the acquisition budget is fixed. It is difficult, if not impossible, to analytically resolve all of the following questions:

  • How well does the application scale across multiple cores?
  • What is the efficiency and scalability of the application with accelerators (GPGPUs or coprocessors)?
  • Should measures be taken to prevent I/O bottlenecks?
  • Is it more efficient to scale up a single task or partition the system for multiple tasks?
  • What combination of CPU models, accelerator count, and per-core software licenses gives the best return on investment?

Rigorous benchmarking is the most reliable method of ensuring the "best bang for buck", however, it requires access to the computing systems of interest. Colfax takes pride in being able to offer interested customers opportunities for deducing the optimal configuration for specific tasks.

Recently we received a request from Peter Newman, Systems Engineer at Carestream Health, for evaluating the performance of the software tool ANSYS Mechanical on Colfax's computing solutions. His goal was to find the optimum number of computing accelerators (if any) and software licenses that he needed to purchase in order to achieve the best performance of specific calculations in ANSYS.

In order to allow Mr. Newman to seamlessly benchmark a variety of system configurations, we provided him access to a unique machine built by Colfax, based on an Intel Xeon E5 CPU, and supporting four Nvidia Tesla K40 GPGPUs and four Intel Xeon Phi 7120P coprocessors. Normally, this system is built either with eight GPGPUs as CXT9000, or outfitted with eight Xeon Phi coprocessors as CXP9000. However, the ``heterochromic'' (i.e., featuring both Nvidia's and Intel's accelerators) configuration that we produced for this project allowed the customer to benchmark the ANSYS software on both the Nividia Tesla and Intel Xeon Phi platforms with minimal logistic effort. Indeed, the software had to be installed only once, and the benchmark scripts and data collection scripts could all be retained in one place.

The methodology of the study was developed by Peter Newman, who also executed the benchmarks, collected and analyzed the data, and summarized findings in a comprehensive report. Mr. Jason Zbick of SimuTech Group, an ANSYS distributor, participated in the study and provided support for ANSYS Mechanical installation and configuration. Colfax's involvement included custom system configuration, maintenance of secure remote access to the system and assistance with automated result collection.

The result of the testing, pertinent to the current state of the software and the specific models used in Carestream, allowed Mr. Newman to empirically find the best way to spend the funds allocated for improvement of his computing infrastructure. His feedback was,
``This study greatly changed my plan for what to purchase with the budget I had to work with... Having all the hardware at once made the testing very efficient. That was the best part. ''

The customer has generously shared with us the report on his thorough research. The report can be downloaded below.

Report: PDF logo Carestream_HPC_Study-December2013.pdf (4.01 mb)

Tags: , , , , , , , , , ,

Parallel Computing in the Search for New Physics at LHC

by Andrey Vladimirov 2. December 2013 15:35

Manuscript of Publication: PDF logo http://arxiv.org/pdf/1310.7556 (submitted to JINST)

Feature in International Journal of Innovation: PDF logo p36-38_Valerie_Halyo-LR.pdf (724.08 kb)

In the past few months we have had the pleasure of collaborating with Prof. Valerie Halyo of Princeton University on modernization of a high energy physics application for the needs of the Large Hadron Collider (LHC). The objective of our project is to improve the performance of the trigger at LHC, so as to enable real-time detection of exotic collision event products, such as black holes or jets.

For the numerical algorithm of the new trigger software, the Hough transform was chosen. This method allows fast detection of straight or curved tracks in a set of points (detector hits), which could be the traces of new exotic particles. The nature of the numerical Hough transform is highly parallelizable, however, existing implementations did not use hardware parallelism or used it sub-optimally.

Colfax's role in the project was to optimize a thread-parallel implementation of the Hough transform for multi-core processors. The result of our involvement was a code capable of detecting 5000 tracks in a synthetic dataset 250x faster than prior art, on a multi-core desktop CPU. By benchmarking the application on a server based on multi-core Intel Xeon E5 processors, we obtained a yet 5x greater performance. The techniques used for optimization, briefly discussed in the report paper (see below), are featured in our book on parallel programming and in our developer training program. They focus on code portability across multi- and many-core platforms, with the emphasis on future-proofing the optimized application.

Our results are reported in a publication submitted for peer review to JINST (see link at the top and bottom of this post). Prof. Halyo's work was also featured in an article in International Journal of Innovation, available for download here (courtesy of Prof. Halyo).

Manuscript of Publication: PDF logo http://arxiv.org/pdf/1310.7556 (submitted to JINST)

Feature in International Journal of Innovation: PDF logo p36-38_Valerie_Halyo-LR.pdf (724.08 kb)

Tags: , , ,

Accelerating Public Domain Applications: Lessons from Models of Radiation Transport in the Milky Way Galaxy

by Andrey Vladimirov 25. November 2013 10:53

Slides: PDF logo SC13-Intel-Theater-Talk-Colfax.pdf (2.62 mb)

Manuscript: PDF logo http://arxiv.org/pdf/1311.4627 (submitted to Computer Physics Communications)

Last week I had the privilege of giving a talk at the Intel Theater at SC'13. I presented a case study done with Stanford University on using Intel Xeon Phi coprocessors for accelerating a new astrophysical library HEATCODE (HEterogeneous Architecture library for sTochastic COsmic Dust Emissivity).

If this talk can be summarized in one sentence, that will be "One high performance code for two platforms is reality". Indeed, the optimizations performed in order to optimize HEATCODE for the MIC architecture lead to a tremendous performance increase on the CPU platform. As a consequence, we have developed a high performance library which can be employed and modified both by users who have access to Xeon Phi coprocessors, and by those only using multi-core CPUs.

The paper introducing HEATCODE library with details of the optimization process is under review at Computer Physics Communications. The preliminary manuscript can be obtained from arXiv, and the slides of the talk are available on this page (see links above and below). The open source code will be made available upon the acceptance of the paper.

Slides: PDF logo SC13-Intel-Theater-Talk-Colfax.pdf (2.62 mb)

Manuscript: PDF logo http://arxiv.org/pdf/1311.4627 (submitted to Computer Physics Communications)

Tags: , , , , , , ,

Heterogeneous Clustering with Homogeneous Code: Accelerate MPI Applications Without Code Surgery Using Intel Xeon Phi Coprocessors

by Andrey Vladimirov 17. October 2013 11:55

Complete Paper: PDF logo Colfax_Heterogeneous_Clustering_Xeon_Phi.pdf (442.87 kb)

This paper reports on our experience with a heterogeneous cluster execution environment, in which a distributed parallel application utilizes two types of compute devices: those employing general-purpose processors, and those based on computing accelerators known as Intel Xeon Phi coprocessors.

Unlike general-purpose graphics processing units (GPGPUs), Intel Xeon Phi coprocessors are able to execute native applications. In this mode, the application runs in the coprocessor's operating system, and does not require a host process executing on the CPU and offloading data to the accelerator (coprocessor). Therefore, for an application in the MPI framework, it is possible to run MPI processes directly on coprocessors. In this case, coprocessors behave like independent compute nodes in the cluster, with an MPI rank, peer-to-peer communication capability, and access to a network-shared file system. With such configuration, there is no need to instrument data offload in the application in order to utilize a heterogeneous system comprised of processors and coprocessors. That said, an MPI application designed for a CPU-only cluster can be used on coprocessor-enabled clusters without code modification.

We discuss the issues of portable code design, load balancing and system configuration (networking and MPI) necessary in order for such a setup to be efficient. An example application used for this study carries out a Monte Carlo simulation for Asian option pricing. The paper includes the performance metrics of this application with CPU-only and heterogeneous cluster configurations.

This visualization based on the paper was exhibited by Colfax at SC13 at the Intel corporate booth:

Complete Paper: PDF logo Colfax_Heterogeneous_Clustering_Xeon_Phi.pdf (442.87 kb)

Tags: , , , , , , , , , ,

Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors

by Andrey Vladimirov 12. August 2013 11:44

Complete Paper: PDF logo Colfax_Transposition-7110P.pdf (513 kb)

In-place matrix transposition, a standard operation in linear algebra, is a memory bandwidth-bound operation. The theoretical maximum performance of transposition is the memory copy bandwidth. However, due to non-contiguous memory access in the transposition operation, practical performance is usually lower. The ratio of the transposition rate to the memory copy bandwidth is a measure of the transposition algorithm efficiency.

This paper demonstrates and discusses an efficient C language implementation of parallel in-place square matrix transposition. For large matrices, it achieves a transposition rate of 49 GB/s (82% efficiency) on Intel Xeon CPUs and 113 GB/s (67% efficiency) on Intel Xeon Phi coprocessors. The code is tuned with pragma-based compiler hints and compiler arguments. Thread parallelism in the code is handled by OpenMP, and vectorization is automatically implemented by the Intel compiler. This approach allows to use the same C code for a CPU and for a MIC architecture executable, both demonstrating high efficiency. For benchmarks, an Intel Xeon Phi 7110P coprocessor is used.

Complete Paper: PDF logo Colfax_Transposition-7110P.pdf (513 kb)

Tags: , , , , ,

Accelerated Simulations of Cosmic Dust Heating Using the Intel Many Integrated Core Architecture

by Andrey Vladimirov 7. June 2013 11:57

Slides from the talk: PDF logo Vladimirov_HEATCODE_MIC_UCSC.pdf (4.03 mb)

Cosmic dust absorbs starlight in the optical and ultraviolet ranges, and re-emits it in the infrared range. This process is crucial for radiative transport in our Galaxy. I am participating in a project to develop a computational tool for Galactic radiative transport simulation with stochastic light absorption and re-emission on small dust grains. This project has resulted in the development of a library called HEATCODE (HEterogeneous Architecture library for sTochastic COsmic Dust Emissivity) for fast calculation of the stochastic dust heating process using Intel Xeon Phi coprocessors.

I presented HEATCODE and shared my experiences with the development and optimization of applications for Xeon Phi coprocessors in a talk at the Applied Mathematics and Statistics Department at UCSC. The slides from this talk can be downloaded here (see below). The full source code of the application, along with a detailed description of the optimization process, will soon be submitted for peer-reviewed publication, and will become publicly available.

Slides from the talk: PDF logo Vladimirov_HEATCODE_MIC_UCSC.pdf (4.03 mb)

Tags: , , , , , , , ,

How to Write Your Own Blazingly Fast Library of Special Functions for Intel Xeon Phi Coprocessors

by Vadim Karpusenko 3. May 2013 17:56

Complete paper: PDF logo Colfax_Static_Libraries_Xeon_Phi.pdf (425.6 kb)

Statically-linked libraries are used in business and academia for security, encapsulation, and convenience reasons. Static libraries with functions offloadable to Intel Xeon Phi coprocessors must contain executable code for both the host and the coprocessor architecture. Furthermore, for library functions used in data-parallel contexts, vectorized versions of the functions must be produced at the compilation stage.

This white paper shows how to design and build statically-linked libraries with functions offloadable to Intel Xeon Phi coprocessors. In addition, it illustrates how special functions with scalar syntax (e.g., y=f(x)) can be implemented in such a way that user applications can use them in thread- and data-parallel contexts. The second part of the paper demonstrates some optimization methods that improve the performance of functions with scalar syntax on the multi-core and the many-core platforms: precision control, strength reduction, and algorithmic optimizations.

Complete paper: PDF logo Colfax_Static_Libraries_Xeon_Phi.pdf (425.6 kb)

Tags: , , , , , ,

Cache Traffic Optimization on Intel Xeon Phi Coprocessors for Parallel In-Place Square Matrix Transposition with Intel Cilk Plus and OpenMP

by Andrey Vladimirov 25. April 2013 10:05

Complete paper: PDF logo Colfax_Transposition_Xeon_Phi.pdf (602 kb)

Follow-up publication: http://research.colfaxinternational.com/post/2013/08/12/Trans-7110.aspx

Numerical algorithms sensitive to the performance of processor caches can be optimized by increasing the locality of data access. Loop tiling and recursive divide-and-conquer are common methods for cache traffic optimization. This paper studies the applicability of these optimization methods in the Intel Xeon Phi architecture for the in-place square matrix transposition operation. Optimized implementations in the Intel Cilk Plus and OpenMP frameworks are presented and benchmarked. Cache-oblivious nature of the recursive algorithm is compared to the tunable character of the tiled method. Results show that Intel Xeon Phi coprocessors transpose large matrices faster than the host system, however, smaller matrices are more efficiently transposed by the host. On the coprocessor, the Intel Cilk Plus framework excels for large matrix sizes, but incurs a significant parallelization overhead for smaller sizes. Transposition of smaller matrices on the coprocessor is faster with OpenMP.

COMMENTS:

  1. If you are interested in this paper, make sure to also read a follow-up publication (improved results, downloadable public code) available at http://research.colfaxinternational.com/post/2013/08/12/Trans-7110.aspx
  2. Note that in the present paper, the rate of transposition in GB/s is calculated as the matrix size divided by the transposition time. In the follow-up paper and most other publications on the subject, this ratio is further multiplied by 2x. Multiply the transposition rate reported in this paper by 2x in order to compare it to the follow-up results.

Complete paper: PDF logo Colfax_Transposition_Xeon_Phi.pdf (602 kb)

Follow-up publication: http://research.colfaxinternational.com/post/2013/07/24/Trans-7110.aspx

Tags: , , , , , , ,

About Colfax Research

Colfax International provides an arsenal of novel computational tools, which need to be leveraged in order to harness their full power. We are collaborating with researchers in science and industry, including our customers, to produce case studies, white papers, and develop a wide knowledge base of the applications of current and future computational technologies.

This blog will contain a variety of information, from hardware benchmarks and HPC news highlights, to discussions of programming issues and reports on research projects carried out in our collaborations. In addition to our in-house research, we will present contributions from authors in the academia, industry and finance, as well as software developers. Our hope is that this information will be useful to a wide audience interested in innovative computing technologies and their applications.

Author Profiles

Andrey Vladimirov, PhD, is the Head of HPC Research at Colfax International. His primary research interest is the application of modern computing technologies to computationally demanding scientific problems. Prior to joining Colfax, Andrey was involved in theoretical astrophysics research at the Ioffe Institute (Russia), North Carolina State University, and Stanford University (USA), where he studied cosmic rays, collisionless plasmas and the interstellar medium using computer simulations. 

All posts by this author...

Author Profiles

Vadim Karpusenko, PhD, is a Principal HPC Research Engineer at Colfax International. His research interests are in the area of physical modeling with HPC clusters, highly parallel architectures, and code optimization. Vadim holds a PhD in Physics from North Carolina State University for his computational research of the free energy and stability of helical secondary structures of proteins.

All posts by this author...