TY - JOUR
T1 - Hybrid OpenMP/AVX acceleration of a higher order quiet direct simulation method for the euler equations
AU - Smith, Matthew R.
AU - Liu, Ji Yueh
AU - Kuo, Fang An
AU - Wu, Jong Shin
PY - 2013
Y1 - 2013
N2 - Presented is the Quiet Direct Simulation (QDS) applied to parallel computation using a hybrid OpenMP/AVX parallelization paradigm. Due to the high locality of the QDS scheme, the method has been successfully applied to parallel computation using Graphics Processing Units (GPU) - we show here that the same principles which allow high performance on GPU devices also permit high performance when using Advanced Vector extensions (AVX). Furthermore, since modern CPU's employ a large number of cores, we can further extend the performance by using AVX on each available CPU core using shared memory (OpenMP) parallelization. We present a simple direction-split higher order extension to the QDS method, and then apply it to AVX through the use of intrinsic functions in the flux computation and state computation modules. High performance is obtained by ensuring that all flux computations are performed using only AVX intrinsic functions - no computations are performed in serial. Through this approach, a single workstation with 2x Xeon CPU's (16 physical cores) allows a performance increase of over 177 times that of a single core alone. We also demonstrate that built-in optimization does not fully exploit AVX parallelization through the examination of assembly code.
AB - Presented is the Quiet Direct Simulation (QDS) applied to parallel computation using a hybrid OpenMP/AVX parallelization paradigm. Due to the high locality of the QDS scheme, the method has been successfully applied to parallel computation using Graphics Processing Units (GPU) - we show here that the same principles which allow high performance on GPU devices also permit high performance when using Advanced Vector extensions (AVX). Furthermore, since modern CPU's employ a large number of cores, we can further extend the performance by using AVX on each available CPU core using shared memory (OpenMP) parallelization. We present a simple direction-split higher order extension to the QDS method, and then apply it to AVX through the use of intrinsic functions in the flux computation and state computation modules. High performance is obtained by ensuring that all flux computations are performed using only AVX intrinsic functions - no computations are performed in serial. Through this approach, a single workstation with 2x Xeon CPU's (16 physical cores) allows a performance increase of over 177 times that of a single core alone. We also demonstrate that built-in optimization does not fully exploit AVX parallelization through the examination of assembly code.
UR - http://www.scopus.com/inward/record.url?scp=84891707934&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84891707934&partnerID=8YFLogxK
U2 - 10.1016/j.proeng.2013.07.108
DO - 10.1016/j.proeng.2013.07.108
M3 - Conference article
AN - SCOPUS:84891707934
SN - 1877-7058
VL - 61
SP - 152
EP - 157
JO - Procedia Engineering
JF - Procedia Engineering
T2 - 25th International Conference on Parallel Computational Fluid Dynamics, ParCFD 2013
Y2 - 20 May 2013 through 24 May 2013
ER -