An MPI-CUDA implementation and optimization for parallel Sparse Equations and Least Squares (LSQR)

He Huang, Liqiang Wang, En Jui Lee, Po Chen

Research output: Contribution to journalConference articlepeer-review

25 Citations (Scopus)


LSQR (Sparse Equations and Least Squares) is a widely used Krylov subspace method to solve large-scale linear systems in seismic tomography. This paper presents a parallel MPI-CUDA implementation for LSQR solver. On CUDA level, our contributions include: (1) utilize CUBLAS and CUSPARSE to compute major steps in LSQR; (2) optimize memory copy between host memory and device memory; (3) develop a CUDA kernel to perform transpose SpMV without transposing the matrix in memory or preserving additional copy. On MPI level, our contributions include: (1) decompose both matrix and vector to increase parallelism; (2) design a static load balancing strategy. In our experiment, the single GPU code achieves up to 17.6× speedup with 15.7 GFlops in single precision and 15.2× speedup with 12.0 GFlops in double precision compared with the original serial CPU code. The MPI-GPU code achieves up to 3.7× speedup with 268 GFlops in single precision and 3.8× speedup with 223 GFlops in double precision on 135 MPI tasks compared with the corresponding MPI-CPU code. The MPI-GPU code scales on both strong and weak scaling tests. In addition, our parallel implementations have better performance than the LSQR subroutine in PETSc library.

Original languageEnglish
Pages (from-to)76-85
Number of pages10
JournalProcedia Computer Science
Publication statusPublished - 2012
Event12th Annual International Conference on Computational Science, ICCS 2012 - Omaha, NB, United States
Duration: 2012 Jun 42012 Jun 6

All Science Journal Classification (ASJC) codes

  • General Computer Science


Dive into the research topics of 'An MPI-CUDA implementation and optimization for parallel Sparse Equations and Least Squares (LSQR)'. Together they form a unique fingerprint.

Cite this