Team Members: Tanisha Mehta, May Paek
We are implementing and benchmarking multiple parallel algorithms for solving tridiagonal systems of equations across diverse computing architectures: multi-core CPUs (MPI), NVIDIA GPUs (CUDA), and classic serial implementations. Specifically, we will implement the Sequential Thomas Algorithm, Paramveer Singh's MPI-based method (developed at LLNL), a differential equation-based method, the Parallel Thomas Algorithm (PTA) optimized with coalesced memory access, and optionally, the Parallel Cyclic Reduction (PCR) method. The goal is to compare the strengths, performance, and trade-offs of each method.
Read the full project proposal here: Detailed Proposal.
Read our halfway milestone report here: Project Milestone Report.
Week / Date Range | Tasks |
---|---|
Week 1 (Mar 25–31) | Finalize project plan, set up GitHub, start sequential Thomas implementation |
Week 2 (Apr 1–7) | Finish and test sequential solver; start Paramveer MPI implementation |
Week 3 (Apr 8–14) | Complete MPI implementation; finish and complete parallel full-recursive-doubling factorization; begin differential equation implementation; benchmark against sequential |
Week 4 (Apr 15–18) |
|
Week 4 (Apr 19–21) |
|
Week 5 (Apr 22–24) |
|
Week 5 (Apr 25–28) |
|
Over the past three weeks, we have made significant progress on our tridiagonal solver project. We finalized the overall project plan, set up our GitHub repository, and finished implementing the sequential Thomas algorithm. We also created a testGenerator.py
script to generate several test cases and a validation.py
script to ensure correctness against known solutions.
Building on that foundation, we began work on the MPI-based implementation inspired by the Burango factorization method, with a focus on correct message passing and distributed memory management. We completed the MPI implementation, ensuring that broadcast, gather, and scatter operations functioned correctly for arbitrary matrix sizes. Our results were verified against the sequential implementation, and we incorporated input parsing and runtime measurement using MPI and chrono.
Although the Burango factorization method demonstrated better speedup, it introduced a slight loss in accuracy—a tradeoff noted in the original paper. Because of this, we decided to implement an MPI-based parallel full recursive doubling factorization method as well. While this approach offers somewhat lower speedup due to communication overhead, it preserves 100% accuracy, making it the preferred choice for correctness-critical applications.
Additionally, we began implementing a differential equation solver by discretizing the problem into a tridiagonal system. Debugging is ongoing, particularly in areas related to multithreaded stability. We also conducted intermediate comparisons between the sequential and MPI implementations across varying thread counts to quantify speedup and guide further optimization.