Comparative Analysis and Implementation of Parallel Tridiagonal Solvers

Team Members: Tanisha Mehta, May Paek

Project Overview

We are implementing and benchmarking multiple parallel algorithms for solving tridiagonal systems of equations across diverse computing architectures: multi-core CPUs (MPI), NVIDIA GPUs (CUDA), and classic serial implementations. Specifically, we will implement the Sequential Thomas Algorithm, Paramveer Singh's MPI-based method (developed at LLNL), a differential equation-based method, the Parallel Thomas Algorithm (PTA) optimized with coalesced memory access, and optionally, the Parallel Cyclic Reduction (PCR) method. The goal is to compare the strengths, performance, and trade-offs of each method.

Detailed Proposal

Read the full project proposal here: Detailed Proposal.

Read our halfway milestone report here: Project Milestone Report.

Project Schedule

Week / Date Range	Tasks
Week 1 (Mar 25–31)	Finalize project plan, set up GitHub, start sequential Thomas implementation
Week 2 (Apr 1–7)	Finish and test sequential solver; start Paramveer MPI implementation
Week 3 (Apr 8–14)	Complete MPI implementation; finish and complete parallel full-recursive-doubling factorization; begin differential equation implementation; benchmark against sequential
Week 4 (Apr 15–18)	Discretize a sample differential equation using finite difference method (Tanisha) Define appropriate boundary and initial conditions (Tanisha) Generate the corresponding tridiagonal system and confirm correct form (May) Run and verify numerical output on known PDE solutions (May)
Week 4 (Apr 19–21)	Implement PTA forward substitution CUDA kernel (May & Tanisha) Implement PTA backward substitution CUDA kernel (May & Tanisha) Optimize both kernels for coalesced memory access and shared memory usage (Tanisha) Benchmark kernel on increasing matrix sizes; verify correctness vs sequential (May)
Week 5 (Apr 22–24)	If still ahead of schedule, implement Cyclic Reduction (CR) forward reduction step as CUDA kernel (May & Tanisha) Implement back-substitution step for CR (May & Tanisha) If still ahead of schedule, implement Parallel Cyclic Reduction (PCR) version with fully parallel reduction (May & Tanisha) Test correctness and performance of CR and PCR on power-of-2 and non-power-of-2 sized matrices (May & Tanisha)
Week 5 (Apr 25–28)	Collect final benchmark results across sequential, MPI, PTA, CR, and PCR (May & Tanisha) Generate visualizations: timing plots, speedup graphs, memory usage comparisons (Tanisha) Finalize technical report, poster content, and README for GitHub (May) Submit final paper and upload all materials (May & Tanisha)

Progress Summary [Past 3 Weeks]

Over the past three weeks, we have made significant progress on our tridiagonal solver project. We finalized the overall project plan, set up our GitHub repository, and finished implementing the sequential Thomas algorithm. We also created a testGenerator.py script to generate several test cases and a validation.py script to ensure correctness against known solutions.

Building on that foundation, we began work on the MPI-based implementation inspired by the Burango factorization method, with a focus on correct message passing and distributed memory management. We completed the MPI implementation, ensuring that broadcast, gather, and scatter operations functioned correctly for arbitrary matrix sizes. Our results were verified against the sequential implementation, and we incorporated input parsing and runtime measurement using MPI and chrono.

Although the Burango factorization method demonstrated better speedup, it introduced a slight loss in accuracy—a tradeoff noted in the original paper. Because of this, we decided to implement an MPI-based parallel full recursive doubling factorization method as well. While this approach offers somewhat lower speedup due to communication overhead, it preserves 100% accuracy, making it the preferred choice for correctness-critical applications.

Additionally, we began implementing a differential equation solver by discretizing the problem into a tridiagonal system. Debugging is ongoing, particularly in areas related to multithreaded stability. We also conducted intermediate comparisons between the sequential and MPI implementations across varying thread counts to quantify speedup and guide further optimization.