DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices

Guanyu Xu; Zhiwei Hao; Yong Luo; Han Hu; Jianping An; Shiwen Mao

doi:10.1109/TMC.2023.3315138

DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices

Guanyu Xu, Zhiwei Hao, Yong Luo, Han Hu^*, Jianping An, Shiwen Mao

^*此作品的通讯作者

科研成果: 期刊稿件 › 文章 › 同行评审

5 引用（Scopus）

摘要

Recent years have witnessed the great success of vision transformer (ViT), which has achieved state-of-the-art performance on multiple computer vision benchmarks. However, ViT models suffer from vast amounts of parameters and high computation cost, leading to difficult deployment on resource-constrained edge devices. Existing solutions mostly compress ViT models to a compact model but still cannot achieve real-time inference. To tackle this issue, we propose to explore the divisibility of transformer structure, and decompose the large ViT into multiple small models for collaborative inference at edge devices. Our objective is to achieve fast and energy-efficient collaborative inference while maintaining comparable accuracy compared with large ViTs. To this end, we first propose a collaborative inference framework termed DeViT to facilitate edge deployment by decomposing large ViTs. Subsequently, we design a decomposition-and-ensemble algorithm based on knowledge distillation, termed DEKD, to fuse multiple small decomposed models while dramatically reducing communication overheads, and handle heterogeneous models by developing a feature matching module to promote the imitations of decomposed models from the large ViT. Extensive experiments for three representative ViT backbones on four widely-used datasets demonstrate our method achieves efficient collaborative inference for ViTs and outperforms existing lightweight ViTs, striking a good trade-off between efficiency and accuracy. For example, our DeViTs improves end-to-end latency by 2.89× with only 1.65% accuracy sacrifice using CIFAR-100 compared to the large ViT, ViT-L/16, on the GPU server. DeDeiTs surpasses the recent efficient ViT, MobileViT-S, by 3.54% in accuracy on ImageNet-1 K, while running 1.72× faster and requiring 55.28% lower energy consumption on the edge device.

源语言	英语
页（从-至）	5917-5932
页数	16
期刊	IEEE Transactions on Mobile Computing
卷	23
期	5
DOI	https://doi.org/10.1109/TMC.2023.3315138
出版状态	已出版 - 1 5月 2024

联合国可持续发展目标

此成果有助于实现下列可持续发展目标：

访问文件

10.1109/TMC.2023.3315138

其它文件与链接

链接到 Scopus 的出版物

引用此

Xu, G., Hao, Z., Luo, Y., Hu, H., An, J., & Mao, S. (2024). DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices. IEEE Transactions on Mobile Computing, 23(5), 5917-5932. https://doi.org/10.1109/TMC.2023.3315138

@article{d6db22e82d9f4cf58dad55bd55c2c13f,

title = "DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices",

abstract = "Recent years have witnessed the great success of vision transformer (ViT), which has achieved state-of-the-art performance on multiple computer vision benchmarks. However, ViT models suffer from vast amounts of parameters and high computation cost, leading to difficult deployment on resource-constrained edge devices. Existing solutions mostly compress ViT models to a compact model but still cannot achieve real-time inference. To tackle this issue, we propose to explore the divisibility of transformer structure, and decompose the large ViT into multiple small models for collaborative inference at edge devices. Our objective is to achieve fast and energy-efficient collaborative inference while maintaining comparable accuracy compared with large ViTs. To this end, we first propose a collaborative inference framework termed DeViT to facilitate edge deployment by decomposing large ViTs. Subsequently, we design a decomposition-and-ensemble algorithm based on knowledge distillation, termed DEKD, to fuse multiple small decomposed models while dramatically reducing communication overheads, and handle heterogeneous models by developing a feature matching module to promote the imitations of decomposed models from the large ViT. Extensive experiments for three representative ViT backbones on four widely-used datasets demonstrate our method achieves efficient collaborative inference for ViTs and outperforms existing lightweight ViTs, striking a good trade-off between efficiency and accuracy. For example, our DeViTs improves end-to-end latency by 2.89× with only 1.65% accuracy sacrifice using CIFAR-100 compared to the large ViT, ViT-L/16, on the GPU server. DeDeiTs surpasses the recent efficient ViT, MobileViT-S, by 3.54% in accuracy on ImageNet-1 K, while running 1.72× faster and requiring 55.28% lower energy consumption on the edge device.",

keywords = "Collaborative inference, edge computing, model decomposition, vision transformer",

author = "Guanyu Xu and Zhiwei Hao and Yong Luo and Han Hu and Jianping An and Shiwen Mao",

note = "Publisher Copyright: {\textcopyright} 2002-2012 IEEE.",

year = "2024",

month = may,

day = "1",

doi = "10.1109/TMC.2023.3315138",

language = "English",

volume = "23",

pages = "5917--5932",

journal = "IEEE Transactions on Mobile Computing",

issn = "1536-1233",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "5",

}

TY - JOUR

T1 - DeViT

T2 - Decomposing Vision Transformers for Collaborative Inference in Edge Devices

AU - Xu, Guanyu

AU - Hao, Zhiwei

AU - Luo, Yong

AU - Hu, Han

AU - An, Jianping

AU - Mao, Shiwen

PY - 2024/5/1

Y1 - 2024/5/1

N2 - Recent years have witnessed the great success of vision transformer (ViT), which has achieved state-of-the-art performance on multiple computer vision benchmarks. However, ViT models suffer from vast amounts of parameters and high computation cost, leading to difficult deployment on resource-constrained edge devices. Existing solutions mostly compress ViT models to a compact model but still cannot achieve real-time inference. To tackle this issue, we propose to explore the divisibility of transformer structure, and decompose the large ViT into multiple small models for collaborative inference at edge devices. Our objective is to achieve fast and energy-efficient collaborative inference while maintaining comparable accuracy compared with large ViTs. To this end, we first propose a collaborative inference framework termed DeViT to facilitate edge deployment by decomposing large ViTs. Subsequently, we design a decomposition-and-ensemble algorithm based on knowledge distillation, termed DEKD, to fuse multiple small decomposed models while dramatically reducing communication overheads, and handle heterogeneous models by developing a feature matching module to promote the imitations of decomposed models from the large ViT. Extensive experiments for three representative ViT backbones on four widely-used datasets demonstrate our method achieves efficient collaborative inference for ViTs and outperforms existing lightweight ViTs, striking a good trade-off between efficiency and accuracy. For example, our DeViTs improves end-to-end latency by 2.89× with only 1.65% accuracy sacrifice using CIFAR-100 compared to the large ViT, ViT-L/16, on the GPU server. DeDeiTs surpasses the recent efficient ViT, MobileViT-S, by 3.54% in accuracy on ImageNet-1 K, while running 1.72× faster and requiring 55.28% lower energy consumption on the edge device.

AB - Recent years have witnessed the great success of vision transformer (ViT), which has achieved state-of-the-art performance on multiple computer vision benchmarks. However, ViT models suffer from vast amounts of parameters and high computation cost, leading to difficult deployment on resource-constrained edge devices. Existing solutions mostly compress ViT models to a compact model but still cannot achieve real-time inference. To tackle this issue, we propose to explore the divisibility of transformer structure, and decompose the large ViT into multiple small models for collaborative inference at edge devices. Our objective is to achieve fast and energy-efficient collaborative inference while maintaining comparable accuracy compared with large ViTs. To this end, we first propose a collaborative inference framework termed DeViT to facilitate edge deployment by decomposing large ViTs. Subsequently, we design a decomposition-and-ensemble algorithm based on knowledge distillation, termed DEKD, to fuse multiple small decomposed models while dramatically reducing communication overheads, and handle heterogeneous models by developing a feature matching module to promote the imitations of decomposed models from the large ViT. Extensive experiments for three representative ViT backbones on four widely-used datasets demonstrate our method achieves efficient collaborative inference for ViTs and outperforms existing lightweight ViTs, striking a good trade-off between efficiency and accuracy. For example, our DeViTs improves end-to-end latency by 2.89× with only 1.65% accuracy sacrifice using CIFAR-100 compared to the large ViT, ViT-L/16, on the GPU server. DeDeiTs surpasses the recent efficient ViT, MobileViT-S, by 3.54% in accuracy on ImageNet-1 K, while running 1.72× faster and requiring 55.28% lower energy consumption on the edge device.

KW - Collaborative inference

KW - edge computing

KW - model decomposition

KW - vision transformer

UR - http://www.scopus.com/inward/record.url?scp=85171738623&partnerID=8YFLogxK

U2 - 10.1109/TMC.2023.3315138

DO - 10.1109/TMC.2023.3315138

M3 - Article

AN - SCOPUS:85171738623

SN - 1536-1233

VL - 23

SP - 5917

EP - 5932

JO - IEEE Transactions on Mobile Computing

JF - IEEE Transactions on Mobile Computing

IS - 5

ER -

DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices

摘要

联合国可持续发展目标

访问文件

其它文件与链接

指纹

引用此