LSQR (Sparse Equations and Least Squares) is a widely used Krylov subspace method to solve large-scale linear systems in seismic tomography. This paper presents a parallel MPI-CUDA implementation for LSQR solver. On CUDA level, our contributions include: (1) utilize CUBLAS and CUSPARSE to compute major steps in LSQR; (2) optimize memory copy between host memory and device memory; (3) develop a CUDA kernel to perform transpose SpMV without transposing the matrix in memory or preserving additional copy. On MPI level, our contributions include: (1) decompose both matrix and vector to increase parallelism; (2) design a static load balancing strategy. In our experiment, the single GPU code achieves up to 17.6× speedup with 15.7 GFlops in single precision and 15.2× speedup with 12.0 GFlops in double precision compared with the original serial CPU code. The MPI-GPU code achieves up to 3.7× speedup with 268 GFlops in single precision and 3.8× speedup with 223 GFlops in double precision on 135 MPI tasks compared with the corresponding MPI-CPU code. The MPI-GPU code scales on both strong and weak scaling tests. In addition, our parallel implementations have better performance than the LSQR subroutine in PETSc library.
|頁（從 - 到）||76-85|
|期刊||Procedia Computer Science|
|出版狀態||Published - 2012|
|事件||12th Annual International Conference on Computational Science, ICCS 2012 - Omaha, NB, United States|
持續時間: 2012 6月 4 → 2012 6月 6
All Science Journal Classification (ASJC) codes