TY - JOUR
T1 - ConCeal
T2 - A Winograd convolution code template for optimising GCU in parallel
AU - Chen, Tian
AU - Tan, Yu an
AU - Baker, Thar
AU - Wu, Haokai
AU - Zhang, Qiuyu
AU - Li, Yuanzhang
N1 - Publisher Copyright:
© 2025 Elsevier Inc.
PY - 2025/9
Y1 - 2025/9
N2 - By minimising arithmetic operations, Winograd convolution substantially reduces the computational complexity of convolution, a pivotal operation in the training and inference stages of Convolutional Neural Networks (CNNs). This study leverages the hardware architecture and capabilities of Shanghai Enflame Technology's AI accelerator, the General Computing Unit (GCU). We develop a code template named ConCeal for Winograd convolution with 3 × 3 kernels, employing a set of interrelated optimisations, including task partitioning, memory layout design, and parallelism. These optimisations fully exploit GCU's computing resources by optimising dataflow and parallelizing the execution of tasks on GCU cores, thereby enhancing Winograd convolution. Moreover, the integrated optimisations in the template are efficiently applicable to other operators, such as max pooling. Using this template, we implement and assess the performance of four Winograd convolution operators on GCU. The experimental results showcase that Conceal operators achieve a maximum of 2.04× and an average of 1.49× speedup compared to the fastest GEMM-based convolution implementations on GCU. Additionally, the ConCeal operators demonstrate competitive or superior computing resource utilisation in certain ResNet and VGG convolution layers when compared to cuDNN on RTX2080.
AB - By minimising arithmetic operations, Winograd convolution substantially reduces the computational complexity of convolution, a pivotal operation in the training and inference stages of Convolutional Neural Networks (CNNs). This study leverages the hardware architecture and capabilities of Shanghai Enflame Technology's AI accelerator, the General Computing Unit (GCU). We develop a code template named ConCeal for Winograd convolution with 3 × 3 kernels, employing a set of interrelated optimisations, including task partitioning, memory layout design, and parallelism. These optimisations fully exploit GCU's computing resources by optimising dataflow and parallelizing the execution of tasks on GCU cores, thereby enhancing Winograd convolution. Moreover, the integrated optimisations in the template are efficiently applicable to other operators, such as max pooling. Using this template, we implement and assess the performance of four Winograd convolution operators on GCU. The experimental results showcase that Conceal operators achieve a maximum of 2.04× and an average of 1.49× speedup compared to the fastest GEMM-based convolution implementations on GCU. Additionally, the ConCeal operators demonstrate competitive or superior computing resource utilisation in certain ResNet and VGG convolution layers when compared to cuDNN on RTX2080.
KW - Parallel access
KW - Parallel channel
KW - Parallel computing
KW - Parallel Winograd convolution
UR - http://www.scopus.com/inward/record.url?scp=105005742268&partnerID=8YFLogxK
U2 - 10.1016/j.jpdc.2025.105108
DO - 10.1016/j.jpdc.2025.105108
M3 - Article
AN - SCOPUS:105005742268
SN - 0743-7315
VL - 203
JO - Journal of Parallel and Distributed Computing
JF - Journal of Parallel and Distributed Computing
M1 - 105108
ER -