Optimizing Parallel Scan Algorithms with Interleaved Tiles
Publication date
Authors
DOI
Document Type
Master Thesis
Metadata
Show full item recordCollections
License
CC-BY-NC-ND
Abstract
Parallel scan is a fundamental primitive used in many parallel algorithms. Several algorithms have been designed to perform parallel scan efficiently. This thesis
focuses on the chained scan algorithm, which improves upon several aspects of older
parallel scan algorithms. Chained scan splits the input data into tiles, each of which
are computed with their own parallel scan, without the need for global synchronization. Although this removes global synchronizations, each tile still requires a local
synchronization during its computation. We present an optimization for chained
scan which interleaves tiles. This optimization aims to reduce the bottleneck of
these local synchronizations by interleaving the work of another tile whenever a tile
would otherwise wait for synchronization.
We implement this algorithm in Rust, and benchmark it against existing scan
algorithms. We also test the effects of different tile sizes when executed on a GPU
architecture. Lastly, we implement the algorithm in the array language Accelerate.
Accelerate features scan as a core primitive used in many parallel functions, and is
well-suited for optimizing parallel scans. We evaluate how the proposed algorithm
performs compared to the existing implementation of scan in Accelerate.
Keywords
Parallelism, Accelerate, Scan, Parallel scan, Optimization, Haskell, Functional programming, Interleaved scan