Optimizing Parallel Scan Algorithms with Interleaved Tiles

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

Parallel scan is a fundamental primitive used in many parallel algorithms. Several algorithms have been designed to perform parallel scan efficiently. This thesis focuses on the chained scan algorithm, which improves upon several aspects of older parallel scan algorithms. Chained scan splits the input data into tiles, each of which are computed with their own parallel scan, without the need for global synchronization. Although this removes global synchronizations, each tile still requires a local synchronization during its computation. We present an optimization for chained scan which interleaves tiles. This optimization aims to reduce the bottleneck of these local synchronizations by interleaving the work of another tile whenever a tile would otherwise wait for synchronization. We implement this algorithm in Rust, and benchmark it against existing scan algorithms. We also test the effects of different tile sizes when executed on a GPU architecture. Lastly, we implement the algorithm in the array language Accelerate. Accelerate features scan as a core primitive used in many parallel functions, and is well-suited for optimizing parallel scans. We evaluate how the proposed algorithm performs compared to the existing implementation of scan in Accelerate.

Keywords

Parallelism, Accelerate, Scan, Parallel scan, Optimization, Haskell, Functional programming, Interleaved scan

Citation