A 28-nm 19.9-to-258.5-TOPS/W 8b Digital Computing-in-Memory Processor With Two-Cycle Macro Featuring Winograd-Domain Convolution and Macro-Level Parallel Dual-Side Sparsity

Hao Wu; Yong Chen; Yiyang Yuan; Jinshan Yue; Xinghua Wang; Xiaoran Li; Feng Zhang

doi:10.1109/JSSC.2024.3409356

A 28-nm 19.9-to-258.5-TOPS/W 8b Digital Computing-in-Memory Processor With Two-Cycle Macro Featuring Winograd-Domain Convolution and Macro-Level Parallel Dual-Side Sparsity

Hao Wu, Yong Chen, Yiyang Yuan, Jinshan Yue, Xinghua Wang, Xiaoran Li, Feng Zhang

School of Integrated Circuits and Electronics

Research output: Contribution to journal › Article › peer-review

Abstract

Recently, computing in memory (CiM) has been proven to be an energy-efficient and promising architecture for artificial intelligence (AI) algorithms. And yet, current CiM schemes generally suffer from limited throughput compared to their digital counterparts, and the key reason is that the CiM macro calculation must iterate through multiple cycles. Thus, the need to reduce the calculation cycle of the macro while keeping high energy efficiency and the necessity of developing acceleration methods for the universal CiM-based processor have become major issues faced by the current CiM architectures. To surmount these critical problems, we propose a processor based on a two-cycle CiM macro. Our work makes three main contributions: 1) we present a Radix16-based digital-CiM macro with look-up table (LUT) optimization to reduce dynamic power consumption; 2) we devise a hybrid Winograd microarchitecture and dataflow that supports (2, 3) and (4, 3) Winograd convolution, meaning that a good compromise can be reached between the accuracy of the algorithm and the reduction in workload; and 3) we propose a macrolevel parallel dual-side sparse CiM core that uses a horizontal direction compression method to reduce the input cycle of activation data and improve the mapping efficiency of the weight data in the macros. A prototype of the processor is fabricated in a 28-nm CMOS, which achieves a peak system energy efficiency of 19.9–258.5-TOPS/W for a voltage supply of 0.6–1.1 V, and an operating frequency of 78–287 MHz, a 2.55–7.08<inline-formula> <tex-math notation="LaTeX">$\times$</tex-math> </inline-formula> higher than other state-of-the-art CiM processors.

Original language	English
Pages (from-to)	1-15
Number of pages	15
Journal	IEEE Journal of Solid-State Circuits
DOIs	https://doi.org/10.1109/JSSC.2024.3409356
Publication status	Accepted/In press - 2024

Keywords

Accuracy
Artificial intelligence
Artificial intelligence (AI)
Circuits
CMOS
computing-in-memory (CiM)
Energy efficiency
energy efficiency
look-up table (LUT)
multiply-accumulation (MAC)
neural network (NN)
Power demand
Radix16
Table lookup
Throughput
unstructured sparsity
Winograd convolution

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1109/JSSC.2024.3409356

Cite this

@article{f4d4d3725db44f46b4995ab21c636254,

title = "A 28-nm 19.9-to-258.5-TOPS/W 8b Digital Computing-in-Memory Processor With Two-Cycle Macro Featuring Winograd-Domain Convolution and Macro-Level Parallel Dual-Side Sparsity",

abstract = "Recently, computing in memory (CiM) has been proven to be an energy-efficient and promising architecture for artificial intelligence (AI) algorithms. And yet, current CiM schemes generally suffer from limited throughput compared to their digital counterparts, and the key reason is that the CiM macro calculation must iterate through multiple cycles. Thus, the need to reduce the calculation cycle of the macro while keeping high energy efficiency and the necessity of developing acceleration methods for the universal CiM-based processor have become major issues faced by the current CiM architectures. To surmount these critical problems, we propose a processor based on a two-cycle CiM macro. Our work makes three main contributions: 1) we present a Radix16-based digital-CiM macro with look-up table (LUT) optimization to reduce dynamic power consumption; 2) we devise a hybrid Winograd microarchitecture and dataflow that supports (2, 3) and (4, 3) Winograd convolution, meaning that a good compromise can be reached between the accuracy of the algorithm and the reduction in workload; and 3) we propose a macrolevel parallel dual-side sparse CiM core that uses a horizontal direction compression method to reduce the input cycle of activation data and improve the mapping efficiency of the weight data in the macros. A prototype of the processor is fabricated in a 28-nm CMOS, which achieves a peak system energy efficiency of 19.9–258.5-TOPS/W for a voltage supply of 0.6–1.1 V, and an operating frequency of 78–287 MHz, a 2.55–7.08 $\times$ higher than other state-of-the-art CiM processors.",

keywords = "Accuracy, Artificial intelligence, Artificial intelligence (AI), Circuits, CMOS, computing-in-memory (CiM), Energy efficiency, energy efficiency, look-up table (LUT), multiply-accumulation (MAC), neural network (NN), Power demand, Radix16, Table lookup, Throughput, unstructured sparsity, Winograd convolution",

author = "Hao Wu and Yong Chen and Yiyang Yuan and Jinshan Yue and Xinghua Wang and Xiaoran Li and Feng Zhang",

note = "Publisher Copyright: IEEE",

year = "2024",

doi = "10.1109/JSSC.2024.3409356",

language = "English",

pages = "1--15",

journal = "IEEE Journal of Solid-State Circuits",

issn = "0018-9200",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - A 28-nm 19.9-to-258.5-TOPS/W 8b Digital Computing-in-Memory Processor With Two-Cycle Macro Featuring Winograd-Domain Convolution and Macro-Level Parallel Dual-Side Sparsity

AU - Wu, Hao

AU - Chen, Yong

AU - Yuan, Yiyang

AU - Yue, Jinshan

AU - Wang, Xinghua

AU - Li, Xiaoran

AU - Zhang, Feng

N1 - Publisher Copyright: IEEE

PY - 2024

Y1 - 2024

N2 - Recently, computing in memory (CiM) has been proven to be an energy-efficient and promising architecture for artificial intelligence (AI) algorithms. And yet, current CiM schemes generally suffer from limited throughput compared to their digital counterparts, and the key reason is that the CiM macro calculation must iterate through multiple cycles. Thus, the need to reduce the calculation cycle of the macro while keeping high energy efficiency and the necessity of developing acceleration methods for the universal CiM-based processor have become major issues faced by the current CiM architectures. To surmount these critical problems, we propose a processor based on a two-cycle CiM macro. Our work makes three main contributions: 1) we present a Radix16-based digital-CiM macro with look-up table (LUT) optimization to reduce dynamic power consumption; 2) we devise a hybrid Winograd microarchitecture and dataflow that supports (2, 3) and (4, 3) Winograd convolution, meaning that a good compromise can be reached between the accuracy of the algorithm and the reduction in workload; and 3) we propose a macrolevel parallel dual-side sparse CiM core that uses a horizontal direction compression method to reduce the input cycle of activation data and improve the mapping efficiency of the weight data in the macros. A prototype of the processor is fabricated in a 28-nm CMOS, which achieves a peak system energy efficiency of 19.9–258.5-TOPS/W for a voltage supply of 0.6–1.1 V, and an operating frequency of 78–287 MHz, a 2.55–7.08 $\times$ higher than other state-of-the-art CiM processors.

AB - Recently, computing in memory (CiM) has been proven to be an energy-efficient and promising architecture for artificial intelligence (AI) algorithms. And yet, current CiM schemes generally suffer from limited throughput compared to their digital counterparts, and the key reason is that the CiM macro calculation must iterate through multiple cycles. Thus, the need to reduce the calculation cycle of the macro while keeping high energy efficiency and the necessity of developing acceleration methods for the universal CiM-based processor have become major issues faced by the current CiM architectures. To surmount these critical problems, we propose a processor based on a two-cycle CiM macro. Our work makes three main contributions: 1) we present a Radix16-based digital-CiM macro with look-up table (LUT) optimization to reduce dynamic power consumption; 2) we devise a hybrid Winograd microarchitecture and dataflow that supports (2, 3) and (4, 3) Winograd convolution, meaning that a good compromise can be reached between the accuracy of the algorithm and the reduction in workload; and 3) we propose a macrolevel parallel dual-side sparse CiM core that uses a horizontal direction compression method to reduce the input cycle of activation data and improve the mapping efficiency of the weight data in the macros. A prototype of the processor is fabricated in a 28-nm CMOS, which achieves a peak system energy efficiency of 19.9–258.5-TOPS/W for a voltage supply of 0.6–1.1 V, and an operating frequency of 78–287 MHz, a 2.55–7.08 $\times$ higher than other state-of-the-art CiM processors.

KW - Accuracy

KW - Artificial intelligence

KW - Artificial intelligence (AI)

KW - Circuits

KW - CMOS

KW - computing-in-memory (CiM)

KW - Energy efficiency

KW - energy efficiency

KW - look-up table (LUT)

KW - multiply-accumulation (MAC)

KW - neural network (NN)

KW - Power demand

KW - Radix16

KW - Table lookup

KW - Throughput

KW - unstructured sparsity

KW - Winograd convolution

UR - http://www.scopus.com/inward/record.url?scp=85196516021&partnerID=8YFLogxK

U2 - 10.1109/JSSC.2024.3409356

DO - 10.1109/JSSC.2024.3409356

M3 - Article

AN - SCOPUS:85196516021

SN - 0018-9200

SP - 1

EP - 15

JO - IEEE Journal of Solid-State Circuits

JF - IEEE Journal of Solid-State Circuits

ER -

A 28-nm 19.9-to-258.5-TOPS/W 8b Digital Computing-in-Memory Processor With Two-Cycle Macro Featuring Winograd-Domain Convolution and Macro-Level Parallel Dual-Side Sparsity

Abstract

Keywords

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this