Cache Traffic Optimization on Intel Xeon Phi Coprocessors for Parallel In-Place Square Matrix Transposition with Intel Cilk Plus and OpenMP

by Andrey Vladimirov 25. April 2013 10:05

Complete paper: PDF logo Colfax_Transposition_Xeon_Phi.pdf (602 kb)

Follow-up publication:

Numerical algorithms sensitive to the performance of processor caches can be optimized by increasing the locality of data access. Loop tiling and recursive divide-and-conquer are common methods for cache traffic optimization. This paper studies the applicability of these optimization methods in the Intel Xeon Phi architecture for the in-place square matrix transposition operation. Optimized implementations in the Intel Cilk Plus and OpenMP frameworks are presented and benchmarked. Cache-oblivious nature of the recursive algorithm is compared to the tunable character of the tiled method. Results show that Intel Xeon Phi coprocessors transpose large matrices faster than the host system, however, smaller matrices are more efficiently transposed by the host. On the coprocessor, the Intel Cilk Plus framework excels for large matrix sizes, but incurs a significant parallelization overhead for smaller sizes. Transposition of smaller matrices on the coprocessor is faster with OpenMP.


  1. If you are interested in this paper, make sure to also read a follow-up publication (improved results, downloadable public code) available at
  2. Note that in the present paper, the rate of transposition in GB/s is calculated as the matrix size divided by the transposition time. In the follow-up paper and most other publications on the subject, this ratio is further multiplied by 2x. Multiply the transposition rate reported in this paper by 2x in order to compare it to the follow-up results.

Complete paper: PDF logo Colfax_Transposition_Xeon_Phi.pdf (602 kb)

Follow-up publication:

Tags: , , , , , , ,

Add comment

  Country flag

  • Comment
  • Preview


Receive our monthly Newsletter to be notified about new Colfax Research publications, educational materials and news on parallel programming. It is completely free, and you can unsubscribe any time.

About Colfax Research

Colfax International provides an arsenal of novel computational tools, which need to be leveraged in order to harness their full power. We are collaborating with researchers in science and industry, including our customers, to produce case studies, white papers, and develop a wide knowledge base of the applications of current and future computational technologies.

This blog will contain a variety of information, from hardware benchmarks and HPC news highlights, to discussions of programming issues and reports on research projects carried out in our collaborations. In addition to our in-house research, we will present contributions from authors in the academia, industry and finance, as well as software developers. Our hope is that this information will be useful to a wide audience interested in innovative computing technologies and their applications.

Author Profile: Andrey Vladimirov

Andrey Vladimirov, PhD, is the Head of HPC Research at Colfax International. His primary research interest is the application of modern computing technologies to computationally demanding scientific problems. Prior to joining Colfax, Andrey was involved in theoretical astrophysics research at the Ioffe Institute (Russia), North Carolina State University, and Stanford University (USA), where he studied cosmic rays, collisionless plasmas and the interstellar medium using computer simulations. 

All posts by this author...

Author Profile:
Vadim Karpusenko

Vadim Karpusenko, PhD, is a Principal HPC Research Engineer at Colfax International. His research interests are in the area of physical modeling with HPC clusters, highly parallel architectures, and code optimization. Vadim holds a PhD in Physics from North Carolina State University for his computational research of the free energy and stability of helical secondary structures of proteins.

All posts by this author...

Author Profile:
Ryo Asai

Ryo Asai is a Researcher at Colfax International. He develops optimization methods for scientific applications targeting emerging parallel computing platforms, computing accelerators and interconnect technologies. Ryo holds a B.S. degree in Physics from University of California, Berkeley.

All posts by this author...