Slide Deck for Intel Xeon Phi Coprocessor Programming Training

by Andrey Vladimirov 13. October 2014 15:57

We are making publicly available the slide deck (280 pages) of the Colfax developer training titled "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors" (click for download).

This training is an intensive course for developers wishing to leverage the Intel MIC architecture. It is also useful for multi-core processor programming. The course is based on a book of the same name, which contains targeted exercises ("labs") for hands-on practicum.

This year, "Parallel Programming and Optimization..." has visited over 40 locations across the United States: research institutions, government labs, universities, and regional trainings. Over 700 students attended the course. Many of these events were free to attendees thanks to Intel's sponsorship.

Check back with us at the end of the year for a schedule of additional regional trainings, and for the second edition of the book featuring information on future manycore architectures, cluster configuration, networking on Xeon Phi with InfiniBand, usage of the latest compilers and driver stack, new practical exercises, computer display-friendly page format, and more.

Tags: , ,

Intel Cilk Plus for Complex Parallel Algorithms: "Enormous Fast Fourier Transforms" (EFFT) Library

by Ryo Asai 18. September 2014 15:42

Electronic preprint: PDF logo Colfax_EFFT.pdf (724 kb), arXiv:1409.5757

In this paper we demonstrate the methodology for parallelizing the computation of large one-dimensional discrete fast Fourier transforms (DFFTs) on multi-core Intel Xeon processors. DFFTs based on the recursive Cooley-Tukey method have to control cache utilization, memory bandwidth and vector hardware usage, and at the same time scale across multiple threads or compute nodes. Our method builds on single-threaded Intel Math Kernel Library (MKL) implementation of DFFT, and uses the Intel Cilk Plus framework for thread parallelism. We demonstrate the ability of Intel Cilk Plus to handle parallel recursion with nested loop-centric parallelism without tuning the code to the number of cores or cache metrics. The result of our work is a library called EFFT that performs 1D DFTs of size 2^N for N>=21 faster than the corresponding Intel MKL parallel DFT implementation by up to 1.5x, and faster than FFTW by up to 2.5x. The code of EFFT is available for free download under the GPLv3 license. This work provides a new efficient DFFT implementation, and at the same time demonstrates an educational example of how computer science problems with complex parallel patterns can be optimized for high performance using the Intel Cilk Plus framework.

Electronic preprint: PDF logo Colfax_EFFT.pdf (724 kb)

Tags: , , , , , , ,

Installing Intel MPSS 3.3 in Arch Linux

by Andrey Vladimirov 20. August 2014 16:35

Complete Paper: PDF logo Colfax_MPSS_in_Arch_Linux.pdf (97 KB)

This technical publication provides instructions for installing the Intel Manycore Platform Software Stack (MPSS) version 3.3 in Arch Linux operating system. Intel MPSS is a suite of tools necessary for operation of Intel Xeon Phi coprocessors. Instructions provided here enable offload and networking functionality for coprocessors in Arch Linux. The procedure described in this paper is completely reversible via an uninstallation script.


Product Direct Link
Intel MPSS 3.3 (page, archive) mpss-3.3-linux.tar (~400 MB)
Linux Kernel 3.10 LTS (AUR) linux-lts310.tar.gz (78 KB)
TRee Installation Generator (TRIG) (3 KB)
RHEL networking utilities rhnet.tgz (33 KB)
Offload functionality test (347 B)
GNU Public License v2 (applies to TRIG and RHEL utilities) page

Complete Paper: PDF logo Colfax_MPSS_in_Arch_Linux.pdf (97 KB)

Make sure to read important additional information by clicking "Comments" below ↓

Tags: , , , , ,

File I/O on Intel Xeon Phi Coprocessors: RAM disks, VirtIO, NFS and Lustre

by Andrey Vladimirov 28. July 2014 13:10

Complete Paper: PDF logo Colfax_File_IO_on_Intel_Xeon_Phi_Coprocessors.pdf (2.27 mb)

The key innovation brought about by Intel Xeon Phi coprocessors is the possibility to port most HPC applications to manycore computing accelerators without code modification. One of the reasons why this is possible is support for file input/output (I/O) directly from applications running on coprocessors. These facilities allow seamless usage of manycore accelerators in common HPC tasks such as application initialization from file data, saving running output, checkpointing and restarting, data post-processing and visualization, and other.

This paper provides information and benchmarks necessary to make the choice of the best file system for a given application from a number of the available options:

  • RAM disks,
  • virtualized local hard drives, and
  • distributed storage shared with NFS or Lustre.

We report benchmarks of I/O performance and parallel scalability on Intel Xeon Phi coprocessors, strengths and limitations of each option. In addition, the paper presents system administration procedures necessary for using each file system on coprocessors, including bridged networking and InfiniBand configuration, software installation and MPSS image modifications. We also discuss the applicability of each storage option to common HPC tasks.

Complete Paper: PDF logo Colfax_File_IO_on_Intel_Xeon_Phi_Coprocessors.pdf (2.27 mb)

Tags: , , , , , , , , ,

Colfax Research papers translated to Japanese

by Andrey Vladimirov 14. July 2014 12:21

With the help of our partners at Intel, some of our articles on Intel Xeon Phi coprocessor programming were translated to the Japanese language.

インテル社の協力で、我が社のインテル(R) Xeon Phi(TM) コプロセッサーのプログラミングについての論文の一部が日本語に翻訳されました。

Tags: , , , ,

Cluster-Level Tuning of a Shallow Water Equation Solver on the Intel MIC Architecture

by Andrey Vladimirov 12. May 2014 09:55

Complete Paper: PDF logo Colfax_Shallow_Water.pdf (810.35 kb), arXiv:1408.1727

The paper demonstrates the optimization of the execution environment of a hybrid OpenMP+MPI computational fluid dynamics code (shallow water equation solver) on a cluster enabled with Intel Xeon Phi coprocessors. The discussion includes:

  1. Controlling the number and affinity of OpenMP threads to optimize access to memory bandwidth;
  2. Tuning the inter-operation of OpenMP and MPI to partition the problem for better data locality;
  3. Ordering the MPI ranks in a way that directs some of the traffic into faster communication channels;
  4. Using efficient peer-to-peer communication between Xeon Phi coprocessors based on the InfiniBand fabric.

With tuning, the application has 90% percent efficiency of parallel scaling up to 8 Intel Xeon Phi coprocessors in 2 compute nodes. For larger problems, scalability is even better, because of the greater computation to communication ratio. However, problems of that size do not fit in the memory of one coprocessor.

The performance of the solver on one Intel Xeon Phi coprocessor 7120P exceeds the performance on a dual-socket Intel Xeon E5-2697 v2 CPU by a factor of 1.6x. In a 2-node cluster with 4 coprocessors per compute node, the MIC architecture yields 5.8x more performance than the CPUs.

Only one line of legacy Fortran code had to be changed in order to achieve the reported performance on the MIC architecture (not counting changes to the command-line interface).

The methodology discussed in this paper is directly applicable to other bandwidth-bound stencil algorithms utilizing a hybrid OpenMP+MPI approach.

Complete Paper: PDF logo Colfax_Shallow_Water.pdf (810.35 kb)

Tags: , , , , , , , , , , ,

Configuration and Benchmarks of Peer-to-Peer Communication over Gigabit Ethernet and InfiniBand in a Cluster with Intel Xeon Phi Coprocessors

by Vadim Karpusenko 11. March 2014 14:05

Complete Paper: PDF logo Colfax_InfiniBand_for_MIC.pdf (2.32 mb)

Intel Xeon Phi coprocessors allow symmetric heterogeneous clustering models, in which MPI processes are run fully on coprocessors, as opposed to offload-based clustering. These symmetric models are attractive, because they allow effortless porting of CPU-based applications to clusters with manycore computing accelerators.

However, with the default software configuration and without specialized networking hardware, peer-to-peer communication between coprocessors in a cluster is quenched by orders of magnitude compared to the capabilities of Gigabit Ethernet networking hardware. This situation is remedied by InfiniBand interconnects and the software supporting them.

In this paper we demonstrate the procedures for configuring a cluster with Intel Xeon Phi coprocessors connected with Gigabit Ethernet as well as InfiniBand interconnects. We measure and discuss the latencies and bandwidths of MPI messages with and without the advanced configuration with InfiniBand support. The paper contains a discussion of MPI application tuning in an InfiniBand-enabled cluster with Intel Xeon Phi Coprocessors, a case study of the impact of InfiniBand protocol, and a set of recommendations for accommodating the non-uniform RDMA performance across the PCIe bus in high performance computing applications.

Complete Paper: PDF logo Colfax_InfiniBand_for_MIC.pdf (2.32 mb)

Tags: , , , , , , , , , , , ,

Primer on Computing with Intel Xeon Phi Coprocessors

by Andrey Vladimirov 6. March 2014 14:35

Presentation slides: PDF logo Colfax-G4-XeonPhi-Presentation.pdf (10.31 mb)

Geant4 is a high energy physics application package for simulation of elementary particle transport through matter. It is used in fundamental physics experiments, as well as in industrial and medical applications. For example, the ATLAS detector at LHC and the Fermi Gamma-Ray Space Telescope rely on Geant4 simulations, DNA damage due to ionizing radiation is studied by a derivative project Geant4-DNA, and radiotherapy planning can benefit from calculations with Geant4.

Geant4 has long been employing distributed-memory parallelism in the MPI framework. However, due to the trend of increasing ratio of core count to memory size in modern computing systems, and due to the need to process larger geometry models, Geant4 is undergoing modernization through inclusion of thread parallelism in shared memory. This effort is led by SLAC researchers Dr. Makoto Asai and Dr. Andrea Dotti (see, e.g., slides 1 and slides 2).

A beneficial by-product of such modernization is the possibility to use the Intel Many Integrated Core (MIC) architecture of Intel Xeon Phi coprocessors for Geant4 calculations. This possibility is being actively investigated by Dr. Dotti, who has extensively discussed his work on this project with us at Colfax.

I was fortunate to be invited to give a presentation for Geant4 users at the SLAC Geant4 Tutorial 2014 held at Stanford University. The talk discusses the Intel Many Integrated Core Architecture and points to the resources for learning about optimization of computing applications for Intel Xeon Phi coprocessors. The slides of the talk can be downloaded from this page.

Presentation slides: PDF logo Colfax-G4-XeonPhi-Presentation.pdf (10.31 mb)

Tags: , , , ,

"Heterochromic" Computer and Finding the Optimal System Configuration for Medical Device Engineering

by Andrey Vladimirov 27. January 2014 09:55

Report: PDF logo Carestream_HPC_Study-December2013.pdf (4.01 mb)

Designing a computing system configuration for optimal performance of a given task is always challenging, especially if the acquisition budget is fixed. It is difficult, if not impossible, to analytically resolve all of the following questions:

  • How well does the application scale across multiple cores?
  • What is the efficiency and scalability of the application with accelerators (GPGPUs or coprocessors)?
  • Should measures be taken to prevent I/O bottlenecks?
  • Is it more efficient to scale up a single task or partition the system for multiple tasks?
  • What combination of CPU models, accelerator count, and per-core software licenses gives the best return on investment?

Rigorous benchmarking is the most reliable method of ensuring the "best bang for buck", however, it requires access to the computing systems of interest. Colfax takes pride in being able to offer interested customers opportunities for deducing the optimal configuration for specific tasks.

Recently we received a request from Peter Newman, Systems Engineer at Carestream Health, for evaluating the performance of the software tool ANSYS Mechanical on Colfax's computing solutions. His goal was to find the optimum number of computing accelerators (if any) and software licenses that he needed to purchase in order to achieve the best performance of specific calculations in ANSYS.

In order to allow Mr. Newman to seamlessly benchmark a variety of system configurations, we provided him access to a unique machine built by Colfax, based on an Intel Xeon E5 CPU, and supporting four Nvidia Tesla K40 GPGPUs and four Intel Xeon Phi 7120P coprocessors. Normally, this system is built either with eight GPGPUs as CXT9000, or outfitted with eight Xeon Phi coprocessors as CXP9000. However, the ``heterochromic'' (i.e., featuring both Nvidia's and Intel's accelerators) configuration that we produced for this project allowed the customer to benchmark the ANSYS software on both the Nividia Tesla and Intel Xeon Phi platforms with minimal logistic effort. Indeed, the software had to be installed only once, and the benchmark scripts and data collection scripts could all be retained in one place.

The methodology of the study was developed by Peter Newman, who also executed the benchmarks, collected and analyzed the data, and summarized findings in a comprehensive report. Mr. Jason Zbick of SimuTech Group, an ANSYS distributor, participated in the study and provided support for ANSYS Mechanical installation and configuration. Colfax's involvement included custom system configuration, maintenance of secure remote access to the system and assistance with automated result collection.

The result of the testing, pertinent to the current state of the software and the specific models used in Carestream, allowed Mr. Newman to empirically find the best way to spend the funds allocated for improvement of his computing infrastructure. His feedback was,
``This study greatly changed my plan for what to purchase with the budget I had to work with... Having all the hardware at once made the testing very efficient. That was the best part. ''

The customer has generously shared with us the report on his thorough research. The report can be downloaded below.

Report: PDF logo Carestream_HPC_Study-December2013.pdf (4.01 mb)

Tags: , , , , , , , , , ,

Parallel Computing in the Search for New Physics at LHC

by Andrey Vladimirov 2. December 2013 15:35

Manuscript of Publication: PDF logo (submitted to JINST)

Publication in JINST: doi:10.1088/1748-0221/9/04/P04005

Feature in International Journal of Innovation: PDF logo p36-38_Valerie_Halyo-LR.pdf (724.08 kb)

In the past few months we have had the pleasure of collaborating with Prof. Valerie Halyo of Princeton University on modernization of a high energy physics application for the needs of the Large Hadron Collider (LHC). The objective of our project is to improve the performance of the trigger at LHC, so as to enable real-time detection of exotic collision event products, such as black holes or jets.

For the numerical algorithm of the new trigger software, the Hough transform was chosen. This method allows fast detection of straight or curved tracks in a set of points (detector hits), which could be the traces of new exotic particles. The nature of the numerical Hough transform is highly parallelizable, however, existing implementations did not use hardware parallelism or used it sub-optimally.

Colfax's role in the project was to optimize a thread-parallel implementation of the Hough transform for multi-core processors. The result of our involvement was a code capable of detecting 5000 tracks in a synthetic dataset 250x faster than prior art, on a multi-core desktop CPU. By benchmarking the application on a server based on multi-core Intel Xeon E5 processors, we obtained a yet 5x greater performance. The techniques used for optimization, briefly discussed in the report paper (see below), are featured in our book on parallel programming and in our developer training program. They focus on code portability across multi- and many-core platforms, with the emphasis on future-proofing the optimized application.

Our results are reported in a publication submitted for peer review to JINST (see link at the top and bottom of this post). Prof. Halyo's work was also featured in an article in International Journal of Innovation, available for download here (courtesy of Prof. Halyo).

Manuscript of Publication: PDF logo (submitted to JINST)

Publication in JINST: doi:10.1088/1748-0221/9/04/P04005

Feature in International Journal of Innovation: PDF logo p36-38_Valerie_Halyo-LR.pdf (724.08 kb)

Tags: , , ,


Receive our monthly Newsletter to be notified about new Colfax Research publications, educational materials and news on parallel programming. It is completely free, and you can unsubscribe any time.

About Colfax Research

Colfax International provides an arsenal of novel computational tools, which need to be leveraged in order to harness their full power. We are collaborating with researchers in science and industry, including our customers, to produce case studies, white papers, and develop a wide knowledge base of the applications of current and future computational technologies.

This blog will contain a variety of information, from hardware benchmarks and HPC news highlights, to discussions of programming issues and reports on research projects carried out in our collaborations. In addition to our in-house research, we will present contributions from authors in the academia, industry and finance, as well as software developers. Our hope is that this information will be useful to a wide audience interested in innovative computing technologies and their applications.

Author Profiles

Andrey Vladimirov, PhD, is the Head of HPC Research at Colfax International. His primary research interest is the application of modern computing technologies to computationally demanding scientific problems. Prior to joining Colfax, Andrey was involved in theoretical astrophysics research at the Ioffe Institute (Russia), North Carolina State University, and Stanford University (USA), where he studied cosmic rays, collisionless plasmas and the interstellar medium using computer simulations. 

All posts by this author...

Author Profiles

Vadim Karpusenko, PhD, is a Principal HPC Research Engineer at Colfax International. His research interests are in the area of physical modeling with HPC clusters, highly parallel architectures, and code optimization. Vadim holds a PhD in Physics from North Carolina State University for his computational research of the free energy and stability of helical secondary structures of proteins.

All posts by this author...