TY - JOUR
T1 - A Multicore Programmable Variable-Precision Near-Memory Accelerator for CNN and Transformer Models
AU - Yang, Yiming
AU - Yuan, Yiyang
AU - Wang, Xinghua
AU - Li, Xiaoran
AU - Wu, Hao
AU - Liu, Qihao
AU - Tang, Weiye
AU - Fu, Xiangqu
AU - Zhang, Feng
N1 - Publisher Copyright:
© 1966-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Convolutional neural network (CNN) and transformer are the most popular neural network models in computer vision (CV) and natural language processing (NLP). It is quite common to use both these two models in multimodal scenarios, such as text-to-image generation. However, these two models have very different memory mappings, dataflows and mathematical operators, making it difficult to accelerate both types of models simultaneously. To address the forementioned challenges, we propose a multi-core programmable near-memory accelerator and introduce an arbitration-free multi-port static random-access memory (SRAM) array to improve storage utilization while maintaining flexibility. To achieve performance comparable to computing-in-memory (CIM) designs, we use near-memory variable-precision multiplier-accumulators (NVMACs) to perform multiply-accumulate (MAC) operations tightly close to the memory to maximize the memory access throughput and support the mixed-precision neural network inference. We use a fine-grained instruction set architecture (ISA) to support software sparsity and reduce overhead caused by coarse-grained non-MAC operations with low utilization. A chip is fabricated in a 28 nm process and achieves 6.3-to-101.4TOPS/W energy efficiency for transformer model and 7.3-to-194.6 TOPS/W for CNN model, 1.2× to 4.2× compared with other state-of-the-art designs, while efficiently supporting both CNN and transformer overloads.
AB - Convolutional neural network (CNN) and transformer are the most popular neural network models in computer vision (CV) and natural language processing (NLP). It is quite common to use both these two models in multimodal scenarios, such as text-to-image generation. However, these two models have very different memory mappings, dataflows and mathematical operators, making it difficult to accelerate both types of models simultaneously. To address the forementioned challenges, we propose a multi-core programmable near-memory accelerator and introduce an arbitration-free multi-port static random-access memory (SRAM) array to improve storage utilization while maintaining flexibility. To achieve performance comparable to computing-in-memory (CIM) designs, we use near-memory variable-precision multiplier-accumulators (NVMACs) to perform multiply-accumulate (MAC) operations tightly close to the memory to maximize the memory access throughput and support the mixed-precision neural network inference. We use a fine-grained instruction set architecture (ISA) to support software sparsity and reduce overhead caused by coarse-grained non-MAC operations with low utilization. A chip is fabricated in a 28 nm process and achieves 6.3-to-101.4TOPS/W energy efficiency for transformer model and 7.3-to-194.6 TOPS/W for CNN model, 1.2× to 4.2× compared with other state-of-the-art designs, while efficiently supporting both CNN and transformer overloads.
KW - Arbitration-free multi-port static random-access memory (SRAM) array
KW - compute-in-memory process-near-memory (PNM)
KW - convolutional neural network (CNN)
KW - instruction set architecture (ISA)
KW - near-memory variable-precision multiplier-accumulator (NVMAC)
KW - transformer
UR - https://www.scopus.com/pages/publications/105021526729
U2 - 10.1109/JSSC.2025.3624011
DO - 10.1109/JSSC.2025.3624011
M3 - Article
AN - SCOPUS:105021526729
SN - 0018-9200
JO - IEEE Journal of Solid-State Circuits
JF - IEEE Journal of Solid-State Circuits
ER -