An ASIP for Neural Network Inference on Embedded Devices with 99% PE Utilization and 100% Memory Hidden under Low Silicon Cost

Muxuan Gao; He Chen; Dake Liu

doi:10.3390/s22103841

An ASIP for Neural Network Inference on Embedded Devices with 99% PE Utilization and 100% Memory Hidden under Low Silicon Cost

Muxuan Gao, He Chen^*, Dake Liu

^*此作品的通讯作者

信息与电子学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

The computation efficiency and flexibility of the accelerator hinder deep neural network (DNN) implementation in embedded applications. Although there are many publications on deep neural network (DNN) processors, there is still much room for deep optimization to further improve results. Multiple dimensions must be simultaneously considered when designing a DNN processor to reach the performance limit of the architecture, including architecture decision, flexibility, energy efficiency, and silicon cost minimization. Flexibility is defined as the ability to support as many multiple networks as possible and to easily adjust the scale. For energy efficiency, there are huge opportunities for power efficiency optimization, which involves access minimization and memory latency minimization based on on-chip memory minimization. Therefore, this work focused on low-power and low-latency data access with minimized silicon cost. This research was implemented based on an ASIP (application specific instruction set processor) in which an ISA was based on the caffe2 inference operator and the hardware design was based on a single instruction multiple data (SIMD) architecture. The scalability and system performance of our SoC extension scheme were demonstrated. The VLIW was used to execute multiple instructions in parallel. All costs for data access time were thus eliminated for the convolution layer. Finally, the processor was synthesized based on TSMC 65 nm technology with a 200 MHz clock, and the Soc extension scheme was analyzed in an experimental model. Our design was tested on several typical neural networks, achieving 196 GOPS at 200 MHz and 241 GOPS/W on the VGG16Net and AlexNet.

源语言	英语
文章编号	3841
期刊	Sensors
卷	22
期	10
DOI	https://doi.org/10.3390/s22103841
出版状态	已出版 - 1 5月 2022

联合国可持续发展目标

此成果有助于实现下列可持续发展目标：

访问文件

10.3390/s22103841

其它文件与链接

链接到 Scopus 的出版物

引用此

Gao, M., Chen, H., & Liu, D. (2022). An ASIP for Neural Network Inference on Embedded Devices with 99% PE Utilization and 100% Memory Hidden under Low Silicon Cost. Sensors, 22(10), 文章 3841. https://doi.org/10.3390/s22103841

@article{de9ac8263969456c84c32091e892ab0e,

title = "An ASIP for Neural Network Inference on Embedded Devices with 99% PE Utilization and 100% Memory Hidden under Low Silicon Cost",

abstract = "The computation efficiency and flexibility of the accelerator hinder deep neural network (DNN) implementation in embedded applications. Although there are many publications on deep neural network (DNN) processors, there is still much room for deep optimization to further improve results. Multiple dimensions must be simultaneously considered when designing a DNN processor to reach the performance limit of the architecture, including architecture decision, flexibility, energy efficiency, and silicon cost minimization. Flexibility is defined as the ability to support as many multiple networks as possible and to easily adjust the scale. For energy efficiency, there are huge opportunities for power efficiency optimization, which involves access minimization and memory latency minimization based on on-chip memory minimization. Therefore, this work focused on low-power and low-latency data access with minimized silicon cost. This research was implemented based on an ASIP (application specific instruction set processor) in which an ISA was based on the caffe2 inference operator and the hardware design was based on a single instruction multiple data (SIMD) architecture. The scalability and system performance of our SoC extension scheme were demonstrated. The VLIW was used to execute multiple instructions in parallel. All costs for data access time were thus eliminated for the convolution layer. Finally, the processor was synthesized based on TSMC 65 nm technology with a 200 MHz clock, and the Soc extension scheme was analyzed in an experimental model. Our design was tested on several typical neural networks, achieving 196 GOPS at 200 MHz and 241 GOPS/W on the VGG16Net and AlexNet.",

keywords = "deep learning processor, deep neural networks, instruction set architecture (ISA), machine learning, scheduling framework",

author = "Muxuan Gao and He Chen and Dake Liu",

note = "Publisher Copyright: {\textcopyright} 2022 by the authors. Licensee MDPI, Basel, Switzerland.",

year = "2022",

month = may,

day = "1",

doi = "10.3390/s22103841",

language = "English",

volume = "22",

journal = "Sensors",

issn = "1424-8220",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "10",

}

TY - JOUR

T1 - An ASIP for Neural Network Inference on Embedded Devices with 99% PE Utilization and 100% Memory Hidden under Low Silicon Cost

AU - Gao, Muxuan

AU - Chen, He

AU - Liu, Dake

PY - 2022/5/1

Y1 - 2022/5/1

N2 - The computation efficiency and flexibility of the accelerator hinder deep neural network (DNN) implementation in embedded applications. Although there are many publications on deep neural network (DNN) processors, there is still much room for deep optimization to further improve results. Multiple dimensions must be simultaneously considered when designing a DNN processor to reach the performance limit of the architecture, including architecture decision, flexibility, energy efficiency, and silicon cost minimization. Flexibility is defined as the ability to support as many multiple networks as possible and to easily adjust the scale. For energy efficiency, there are huge opportunities for power efficiency optimization, which involves access minimization and memory latency minimization based on on-chip memory minimization. Therefore, this work focused on low-power and low-latency data access with minimized silicon cost. This research was implemented based on an ASIP (application specific instruction set processor) in which an ISA was based on the caffe2 inference operator and the hardware design was based on a single instruction multiple data (SIMD) architecture. The scalability and system performance of our SoC extension scheme were demonstrated. The VLIW was used to execute multiple instructions in parallel. All costs for data access time were thus eliminated for the convolution layer. Finally, the processor was synthesized based on TSMC 65 nm technology with a 200 MHz clock, and the Soc extension scheme was analyzed in an experimental model. Our design was tested on several typical neural networks, achieving 196 GOPS at 200 MHz and 241 GOPS/W on the VGG16Net and AlexNet.

AB - The computation efficiency and flexibility of the accelerator hinder deep neural network (DNN) implementation in embedded applications. Although there are many publications on deep neural network (DNN) processors, there is still much room for deep optimization to further improve results. Multiple dimensions must be simultaneously considered when designing a DNN processor to reach the performance limit of the architecture, including architecture decision, flexibility, energy efficiency, and silicon cost minimization. Flexibility is defined as the ability to support as many multiple networks as possible and to easily adjust the scale. For energy efficiency, there are huge opportunities for power efficiency optimization, which involves access minimization and memory latency minimization based on on-chip memory minimization. Therefore, this work focused on low-power and low-latency data access with minimized silicon cost. This research was implemented based on an ASIP (application specific instruction set processor) in which an ISA was based on the caffe2 inference operator and the hardware design was based on a single instruction multiple data (SIMD) architecture. The scalability and system performance of our SoC extension scheme were demonstrated. The VLIW was used to execute multiple instructions in parallel. All costs for data access time were thus eliminated for the convolution layer. Finally, the processor was synthesized based on TSMC 65 nm technology with a 200 MHz clock, and the Soc extension scheme was analyzed in an experimental model. Our design was tested on several typical neural networks, achieving 196 GOPS at 200 MHz and 241 GOPS/W on the VGG16Net and AlexNet.

KW - deep learning processor

KW - deep neural networks

KW - instruction set architecture (ISA)

KW - machine learning

KW - scheduling framework

UR - http://www.scopus.com/inward/record.url?scp=85130395986&partnerID=8YFLogxK

U2 - 10.3390/s22103841

DO - 10.3390/s22103841

M3 - Article

C2 - 35632250

AN - SCOPUS:85130395986

SN - 1424-8220

VL - 22

JO - Sensors

JF - Sensors

IS - 10

M1 - 3841

ER -

An ASIP for Neural Network Inference on Embedded Devices with 99% PE Utilization and 100% Memory Hidden under Low Silicon Cost

摘要

联合国可持续发展目标

访问文件

其它文件与链接

指纹

引用此