Comparative Analysis and Implementation of Parallel Tridiagonal Solvers

Team Members: Tanisha Mehta (50%), May Paek (50%)

Link to Github Repository

Project Overview

We implemented and benchmarked multiple parallel algorithms for solving tridiagonal systems of equations, targeting both multi-core CPUs (MPI and OpenMP) and NVIDIA GPUs (CUDA). Our deliverables include an MPI-based Brugnano block-SPIKE solver, an MPI-based recursive doubling solver, and an MPI-based differential equation solver, as well as OpenMP versions of the Brugnano and recursive doubling algorithms. On the GPU side, we developed a Cyclic Reduction (CR) and a Parallel Cyclic Reduction (PCR), plus a PCR/CR hybrid solver using CUDA. We also implemented a sequential Thomas algorithm as a baseline for correctness and performance comparison. We will present performance, scalability, and trade-off analyses between MPI, OpenMP, and CUDA implementations, highlighting communication vs. computation balance. All implementations were tested on multi-core CPU machines and NVIDIA GPUs (RTX 3080 Ti), and we plan to showcase our results and insights at the parallelism competition.

Detailed Proposal

Project Schedule

Week / Date Range Tasks
Week 1 (Mar 25–31) Finalize project plan, set up GitHub, start sequential Thomas implementation
Week 2 (Apr 1–7) Finish and test sequential solver; start Paramveer MPI implementation
Week 3 (Apr 8–14) Complete MPI implementation; finish and complete parallel full-recursive-doubling factorization; begin differential equation implementation; benchmark against sequential
Week 4 (Apr 15–18)
  • Discretize a sample differential equation using finite difference method (Tanisha)
  • Define appropriate boundary and initial conditions (Tanisha)
  • Generate the corresponding tridiagonal system and confirm correct form (May)
  • Run and verify numerical output on known PDE solutions (May)
Week 4 (Apr 19–21)
  • Implement PTA forward substitution CUDA kernel (May & Tanisha)
  • Implement PTA backward substitution CUDA kernel (May & Tanisha)
  • Optimize both kernels for coalesced memory access and shared memory usage (Tanisha)
  • Benchmark kernel on increasing matrix sizes; verify correctness vs sequential (May)
Week 5 (Apr 22–24)
  • If still ahead of schedule, implement Cyclic Reduction (CR) forward reduction step as CUDA kernel (May & Tanisha)
  • Implement back-substitution step for CR (May & Tanisha)
  • If still ahead of schedule, implement Parallel Cyclic Reduction (PCR) version with fully parallel reduction (May & Tanisha)
  • Test correctness and performance of CR and PCR on power-of-2 and non-power-of-2 sized matrices (May & Tanisha)
Week 5 (Apr 25–28)
  • Collect final benchmark results across sequential, MPI, PTA, CR, and PCR (May & Tanisha)
  • Generate visualizations: timing plots, speedup graphs, memory usage comparisons (Tanisha)
  • Finalize technical report, poster content, and README for GitHub (May)
  • Submit final paper and upload all materials (May & Tanisha)