TY - GEN
T1 - Towards Optimal Fast Matrix Multiplication on CPU-GPU Platforms
AU - Shao, Senhao
AU - Wang, Yizhuo
AU - Ji, Weixing
AU - Gao, Jianhua
N1 - Publisher Copyright:
© 2022, Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - Increasing computing power has become available through the use of GPUs, bringing new opportunities to accelerate fast matrix multiplication using GPUs. Although researchers have proposed several optimization schemes for the Strassen algorithm on the GPU, they have not fully utilized the computing resources of CPU. In this paper, we propose a CPU-GPU heterogeneous implementation for the Winograd algorithm based on task graph scheduling. It uses work-stealing scheduler to achieve balanced load. We also propose two recursive task graph extension strategies: homogeneous and heterogeneous extension. We invoke different execution strategies in different recursive levels and design a predictor based on the random forest regression model to make a decision. Finally, the experimental evaluations are performed on a CPU-GPU heterogeneous platform. It shows that the improved Winograd algorithm achieves an average speedup of 1.6x, 1.5x and 1.4x against to cuBLAS, Winograd on CPU, and Winograd on GPU for matrices with matrix dimension greater than 5000, respectively.
AB - Increasing computing power has become available through the use of GPUs, bringing new opportunities to accelerate fast matrix multiplication using GPUs. Although researchers have proposed several optimization schemes for the Strassen algorithm on the GPU, they have not fully utilized the computing resources of CPU. In this paper, we propose a CPU-GPU heterogeneous implementation for the Winograd algorithm based on task graph scheduling. It uses work-stealing scheduler to achieve balanced load. We also propose two recursive task graph extension strategies: homogeneous and heterogeneous extension. We invoke different execution strategies in different recursive levels and design a predictor based on the random forest regression model to make a decision. Finally, the experimental evaluations are performed on a CPU-GPU heterogeneous platform. It shows that the improved Winograd algorithm achieves an average speedup of 1.6x, 1.5x and 1.4x against to cuBLAS, Winograd on CPU, and Winograd on GPU for matrices with matrix dimension greater than 5000, respectively.
KW - CPU-GPU heterogeneous architecture
KW - Matrix multiplication
KW - Random forest regression
KW - Winograd algorithm
UR - http://www.scopus.com/inward/record.url?scp=85127649116&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-96772-7_21
DO - 10.1007/978-3-030-96772-7_21
M3 - Conference contribution
AN - SCOPUS:85127649116
SN - 9783030967710
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 223
EP - 236
BT - Parallel and Distributed Computing, Applications and Technologies - 22nd International Conference, PDCAT 2021, Proceedings
A2 - Shen, Hong
A2 - Sang, Yingpeng
A2 - Zhang, Yong
A2 - Xiao, Nong
A2 - Arabnia, Hamid R.
A2 - Fox, Geoffrey
A2 - Gupta, Ajay
A2 - Malek, Manu
PB - Springer Science and Business Media Deutschland GmbH
T2 - 22nd International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2021
Y2 - 17 December 2021 through 19 December 2021
ER -