Second Edition of "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors"

by Andrey Vladimirov 19. May 2015 10:48

We did it! The second edition of our book, "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors", is available at and on See table of contents below:

Tags: , ,

Interview with James Reinders: future of Intel MIC architecture, parallel programming, education

by Vadim Karpusenko 5. March 2015 15:35

A few weeks ago we recorded our conversation with James Reinders, the Director and Chief Evangelist at Intel Corporation. We discussed the future of the parallel programming and Intel MIC architecture products: Intel Xeon Phi coprocessors, Knights Landing (KNL), and future 3rd generation - Knight Hill (KNH). We also talked about how students can learn parallel programming and optimization for high performance applications.

Watch the whole interview by clicking the player above, or jump straight to one of the questions in the list below.

  1. James Reinders and his role at Intel. - 00:47
  2. Why Parallel Programming and Code Modernization is important? - 01:49
  3. Brief introduction to MIC architecture and Xeon Phi coprocessors. - 04:03
  4. What type of applications benefit from MIC architecture? - 07:16
  5. How to approach porting your code for MIC architecture? - 09:58
  6. What is new in Knights Landing. - 15:24
  7. Details of chip design of Knights Landing. - 19:54
  8. 3rd MIC generation - Knights Hill. - 21:16
  9. How to future-proof my code? - 23:15
  10. High bandwidth memory on KNL. - 27:35
  11. Details on James Reinders’ books. - 29:59
  12. Future of Parallel Programming. - 34:37
  13. New parallel programming languages? - 38:16
  14. Future of the parallel libraries. - 40:01
  15. How to learn parallel programming? - 45:22
  16. Colfax Developer Training. - 48:20

Tags: , , , , , , ,

Scientific Computing with Intel Xeon Phi Coprocessors

by Andrey Vladimirov 4. February 2015 07:56

I had the privilege of giving a presentation at the HPC Advisory Council Stanford Conference 2015. Thanks to insideHPC, a recording of this presentation is available on YouTube. Slides are available here.

If you are interested in individual case studies mentioned in the talk, here they are:

Paper: 2013a, 2013b Papers: 2013, 2014 Paper: 2013 Paper: 2014

Tags: , , , , , ,

Fluid Dynamics with Fortran on Intel Xeon Phi coprocessors

by Andrey Vladimirov 4. February 2015 07:06

In this demonstration, a Colfax ProEdge™ SXP8400 workstation runs a shallow water flow solver, demonstrating CFD acceleration with Intel Xeon Phi coprocessors. The key feature of this demonstration is that exactly the same source code is used to compile the MPI executables for the Intel Xeon E5-2697 V3 processor and for Intel Xeon Phi 7120A coprocessors. The code is written in Fortran with OpenMP and MPI. For performance results with this code in a MIC-enabled cluster, see companion paper.

Tags: , , ,

Performance to Power and Performance to Cost Ratios with Intel Xeon Phi Coprocessors (and why 1x Acceleration May Be Enough)

by Andrey Vladimirov 27. January 2015 16:45

Complete paper: PDF logo Colfax_1x.pdf (321.39 kb)

The paper studies two performance metrics of systems enabled with Intel Xeon Phi coprocessors: the ratio of performance to consumed electrical power and the ratio of performance to purchasing system cost, both under the assumption of linear parallel scalability of the application.

Performance to power values are measured for three workloads: a compute-bound workload (DGEMM), a memory bandwidth-bound workload (STREAM), and a latency-limited workload (small matrix LU decomposition). Performance to cost ratios are computed, using system configurations and prices available at Colfax International, as functions of the acceleration factor and of the number of coprocessors per system. That study considers hypothetical applications with acceleration factor from 0.35x to 2x.

In all studies, systems with Intel Xeon Phi coprocessors yield better metrics than systems with only Intel Xeon processors. That applies even with acceleration factor of 1x, as long as the application can be distributed between the CPU and the coprocessor.

Complete paper: PDF logo Colfax_1x.pdf (321.39 kb)

Tags: , , , , , ,

Fine-Tuning Vectorization and Memory Traffic on Intel Xeon Phi Coprocessors: LU Decomposition of Small Matrices

by Andrey Vladimirov 27. January 2015 16:30

Complete paper: PDF logo Colfax_LU.pdf (604.37 kb)

Common techniques for fine-tuning the performance of automatically vectorized loops in applications for Intel Xeon Phi coprocessors are discussed. These techniques include strength reduction, regularizing the vectorization pattern, data alignment and aligned data hint, and pointer disambiguation. In addition, the loop tiling technique of memory traffic tuning is shown. The optimization methods are illustrated on an example of single-threaded LU decomposition of a single precision matrix of size 128x128.

Benchmarks show that the discussed optimizations improve the performance on the coprocessor by a factor of 2.8 compared to the unoptimized code, and by a factor of 1.7 on the multi-core host system, achieving roughly the same performance on the host and on the coprocessor.

The code discussed in the paper can be freely downloaded from this page.

Complete paper: PDF logo Colfax_LU.pdf (604.37 kb)

Tags: , , , , , , ,

Crash Course on Programming and Optimization with Intel Xeon Phi Coprocessors at SC14

by Andrey Vladimirov 16. November 2014 07:33

PDF logo Colfax-Intro.pdf (10.25 mb) - Part 1: Introduction, Programming Models

PDF logo Colfax-Optimization.pdf (9.01 mb) - Part 2: Optimization Techniques

Programming and optimization of applications for Intel Xeon Phi processors is going to be discussed in more than ten presentations in four concurrent track sessions at the Intel HPC Developer Conference at SC14 in New Orleans, LA on November 16, 2014.

Colfax has contributed two of these presentations: one a crash course on the applicability domain and programming models for Intel Xeon Phi coprocessors, and another a demonstration of optimization of an N-body simulation for coprocessors on the node level and cluster level. Slides of our presentations can be downloaded from this page. Stay tuned for an upcoming Colfax Research paper with downloadable code for the example demonstrated in our slides.

If you are attending SC14 in New Orleans, visit us at Colfax's booth 1047 and also at the Intel Channel Pavilion.

PDF logo Colfax-Intro.pdf (10.25 mb) - Part 1: Introduction, Programming Models

PDF logo Colfax-Optimization.pdf (9.01 mb) - Part 2: Optimization Techniques

Tags: , , , , , , , , , ,

Pilot episode of video course "Parallel programming and optimization with Intel Xeon Phi coprocessors". We want your feedback!

by Vadim Karpusenko 31. October 2014 13:45

We are preparing an online video version of our course "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors". Below you will find a pilot episode of this course. It comes in two parts. In the first part, we give a theoretical overview of using the strip-mining technique to make automatic vectorization possible. In the second part, we demonstrate a practical exercise where we use this technique to optimize an example application for the MIC architecture.

We would like to hear your feedback on the pilot episode as well as general suggestions on developing such a course to better fit the needs of the HPC community. If you watch the video, please take a few minutes to leave your comments here:

Thank you!

Lecture part:

Practical part:

Tags: , , , , ,

Slide Deck for Intel Xeon Phi Coprocessor Programming Training

by Andrey Vladimirov 13. October 2014 15:57

We are making publicly available the slide deck (280 pages) of the Colfax developer training titled "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors" (click for download).

This training is an intensive course for developers wishing to leverage the Intel MIC architecture. It is also useful for multi-core processor programming. The course is based on a book of the same name, which contains targeted exercises ("labs") for hands-on practicum.

This year, "Parallel Programming and Optimization..." has visited over 40 locations across the United States: research institutions, government labs, universities, and regional trainings. Over 700 students attended the course. Many of these events were free to attendees thanks to Intel's sponsorship.

Check back with us at the end of the year for a schedule of additional regional trainings, and for the second edition of the book featuring information on future manycore architectures, cluster configuration, networking on Xeon Phi with InfiniBand, usage of the latest compilers and driver stack, new practical exercises, computer display-friendly page format, and more.

Tags: , ,

Intel Cilk Plus for Complex Parallel Algorithms: "Enormous Fast Fourier Transforms" (EFFT) Library

by Ryo Asai 18. September 2014 15:42

Electronic preprint: PDF logo Colfax_EFFT.pdf (724 kb), arXiv:1409.5757

In this paper we demonstrate the methodology for parallelizing the computation of large one-dimensional discrete fast Fourier transforms (DFFTs) on multi-core Intel Xeon processors. DFFTs based on the recursive Cooley-Tukey method have to control cache utilization, memory bandwidth and vector hardware usage, and at the same time scale across multiple threads or compute nodes. Our method builds on single-threaded Intel Math Kernel Library (MKL) implementation of DFFT, and uses the Intel Cilk Plus framework for thread parallelism. We demonstrate the ability of Intel Cilk Plus to handle parallel recursion with nested loop-centric parallelism without tuning the code to the number of cores or cache metrics. The result of our work is a library called EFFT that performs 1D DFTs of size 2^N for N>=21 faster than the corresponding Intel MKL parallel DFT implementation by up to 1.5x, and faster than FFTW by up to 2.5x. The code of EFFT is available for free download under the GPLv3 license. This work provides a new efficient DFFT implementation, and at the same time demonstrates an educational example of how computer science problems with complex parallel patterns can be optimized for high performance using the Intel Cilk Plus framework.

Electronic preprint: PDF logo Colfax_EFFT.pdf (724 kb)

Tags: , , , , , , ,


Receive our monthly Newsletter to be notified about new Colfax Research publications, educational materials and news on parallel programming. It is completely free, and you can unsubscribe any time.

About Colfax Research

Colfax International provides an arsenal of novel computational tools, which need to be leveraged in order to harness their full power. We are collaborating with researchers in science and industry, including our customers, to produce case studies, white papers, and develop a wide knowledge base of the applications of current and future computational technologies.

This blog will contain a variety of information, from hardware benchmarks and HPC news highlights, to discussions of programming issues and reports on research projects carried out in our collaborations. In addition to our in-house research, we will present contributions from authors in the academia, industry and finance, as well as software developers. Our hope is that this information will be useful to a wide audience interested in innovative computing technologies and their applications.

Author Profile: Andrey Vladimirov

Andrey Vladimirov, PhD, is the Head of HPC Research at Colfax International. His primary research interest is the application of modern computing technologies to computationally demanding scientific problems. Prior to joining Colfax, Andrey was involved in theoretical astrophysics research at the Ioffe Institute (Russia), North Carolina State University, and Stanford University (USA), where he studied cosmic rays, collisionless plasmas and the interstellar medium using computer simulations. 

All posts by this author...

Author Profile:
Vadim Karpusenko

Vadim Karpusenko, PhD, is a Principal HPC Research Engineer at Colfax International. His research interests are in the area of physical modeling with HPC clusters, highly parallel architectures, and code optimization. Vadim holds a PhD in Physics from North Carolina State University for his computational research of the free energy and stability of helical secondary structures of proteins.

All posts by this author...

Author Profile:
Ryo Asai

Ryo Asai is a Researcher at Colfax International. He develops optimization methods for scientific applications targeting emerging parallel computing platforms, computing accelerators and interconnect technologies. Ryo holds a B.S. degree in Physics from University of California, Berkeley.

All posts by this author...