장명환’s paper has been accepted in

Title: Orchestrating Large-Scale SpGEMMs using Dynamic Block Distribution and Data Transfer Minimization on Heterogeneous Systems
Author: Taehyeong Park, Seokwon Kang, Myung-Hwan Jang, Sang-Wook Kim, and Yongjun Park
Abstract
Sparse matrix-matrix multiplication (SpGEMM) is one of the most important kernels in many emerging applications such as database, deep learning, graph analysis, and recommendation system. Since SpGEMM requires huge amount of computations, many SpGEMM techniques have been implemented based on Graphic Processing Units (GPUs), to fully exploit dataparallelism. However, traditional SpGEMM techniques often do not fully utilize the GPU because most non-zero elements of target sparse matrices are present in a few hub nodes and non-hub nodes barely have non-zero elements. The dataspecific characteristics (power law) incur significant performance degradation due to the load imbalance between GPU cores and low utilization of each core. Many recent implementations have tried to solve this challenge with smart pre/postprocessing, but because of their large overheads, net performance hardly gets much better, and sometimes even worse. Also, non-hub nodes are inherently not suitable to be computed on GPU even after the optimizations. More importantly, the performance is no longer dominated by kernel execution, but more dominated by data transfers such as device-to-host transfer and file I/Os, due to the rapid growth of GPU computing power and input data size. To solve the challenges, this paper proposes Dynamic Block Distributor, a novel full-system-level SpGEMM orchestration framework on heterogeneous systems, which improves the overall performance by enabling an efficient CPU-GPU collaboration and further minimizing data transfer overhead between all system elements. It first divides the whole matrix into smaller units and then offloads the computation of each unit to an appropriate computing unit between a GPU and a CPU based on its workload type and the run-time resource utilization status. It also minimizes data transfer overhead with simple but wellsuited techniques: Row Collecting, I/O Overlapping, and I/O Binding. Our experiments show that it speeds up the SpGEMM execution latency including both kernel execution and device-tohost transfers by 3.17x on an average, and the overall execution time by 1.84x on an average, compared to the state-of-the-art cuSPARSE library.