An Empirical Analysis of Vision Transformer and CNN in Resource-Constrained Federated Learning

Xiaojiang Zuo; Qinglong Zhang; Rui Han

doi:10.1145/3568199.3568201

An Empirical Analysis of Vision Transformer and CNN in Resource-Constrained Federated Learning

Xiaojiang Zuo, Qinglong Zhang, Rui Han^*

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Federated learning (FL) is an emerging distributed machine learning method that collaboratively trains a universal model among clients while maintaining their data privacy. Recently, several efforts attempt to introduce vision transformer (ViT) models into FL training. However, deploying and training such ViT models from scratch in practice is not trivial, existing works overlook the existence of the clients with low resources (e.g., mobile phones), which is a common and practical FL setting. In this paper, we use low-resolution images as model input to satisfy the resource constraints and investigate several ViT models to explore whether ViT models still outperform CNN models in this setting. Our experiment was performed on CIFAR10 and Fashion MNIST with their IID and non-IID versions, and the results demonstrate that ViT models can achieve a better global test accuracy than CNN models while using a comparable training cost, suggesting that they are ideally suitable for FL training with resource-constrained devices.

Original language	English
Title of host publication	Proceedings of MLMI 2022 - 2022 5th International Conference on Machine Learning and Machine Intelligence
Publisher	Association for Computing Machinery
Pages	8-13
Number of pages	6
ISBN (Electronic)	9781450397551
DOIs	https://doi.org/10.1145/3568199.3568201
Publication status	Published - 23 Sept 2022
Event	5th International Conference on Machine Learning and Machine Intelligence, MLMI 2022 - Virtual, Online, China Duration: 23 Sept 2022 → 25 Sept 2022

Publication series

Name	ACM International Conference Proceeding Series

Conference

Conference	5th International Conference on Machine Learning and Machine Intelligence, MLMI 2022
Country/Territory	China
City	Virtual, Online
Period	23/09/22 → 25/09/22

Keywords

CNN
Deep Learning
Federated Learning
Vision Transformer

Access to Document

10.1145/3568199.3568201

Cite this

Zuo, X., Zhang, Q., & Han, R. (2022). An Empirical Analysis of Vision Transformer and CNN in Resource-Constrained Federated Learning. In Proceedings of MLMI 2022 - 2022 5th International Conference on Machine Learning and Machine Intelligence (pp. 8-13). (ACM International Conference Proceeding Series). Association for Computing Machinery. https://doi.org/10.1145/3568199.3568201

@inproceedings{3995389d2b584f94b490497470ec61f9,

title = "An Empirical Analysis of Vision Transformer and CNN in Resource-Constrained Federated Learning",

abstract = "Federated learning (FL) is an emerging distributed machine learning method that collaboratively trains a universal model among clients while maintaining their data privacy. Recently, several efforts attempt to introduce vision transformer (ViT) models into FL training. However, deploying and training such ViT models from scratch in practice is not trivial, existing works overlook the existence of the clients with low resources (e.g., mobile phones), which is a common and practical FL setting. In this paper, we use low-resolution images as model input to satisfy the resource constraints and investigate several ViT models to explore whether ViT models still outperform CNN models in this setting. Our experiment was performed on CIFAR10 and Fashion MNIST with their IID and non-IID versions, and the results demonstrate that ViT models can achieve a better global test accuracy than CNN models while using a comparable training cost, suggesting that they are ideally suitable for FL training with resource-constrained devices.",

keywords = "CNN, Deep Learning, Federated Learning, Vision Transformer",

author = "Xiaojiang Zuo and Qinglong Zhang and Rui Han",

note = "Publisher Copyright: {\textcopyright} 2022 ACM.; 5th International Conference on Machine Learning and Machine Intelligence, MLMI 2022 ; Conference date: 23-09-2022 Through 25-09-2022",

year = "2022",

month = sep,

day = "23",

doi = "10.1145/3568199.3568201",

language = "English",

series = "ACM International Conference Proceeding Series",

publisher = "Association for Computing Machinery",

pages = "8--13",

booktitle = "Proceedings of MLMI 2022 - 2022 5th International Conference on Machine Learning and Machine Intelligence",

}

Zuo, X, Zhang, Q & Han, R 2022, An Empirical Analysis of Vision Transformer and CNN in Resource-Constrained Federated Learning. in Proceedings of MLMI 2022 - 2022 5th International Conference on Machine Learning and Machine Intelligence. ACM International Conference Proceeding Series, Association for Computing Machinery, pp. 8-13, 5th International Conference on Machine Learning and Machine Intelligence, MLMI 2022, Virtual, Online, China, 23/09/22. https://doi.org/10.1145/3568199.3568201

An Empirical Analysis of Vision Transformer and CNN in Resource-Constrained Federated Learning. / Zuo, Xiaojiang; Zhang, Qinglong; Han, Rui.
Proceedings of MLMI 2022 - 2022 5th International Conference on Machine Learning and Machine Intelligence. Association for Computing Machinery, 2022. p. 8-13 (ACM International Conference Proceeding Series).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - An Empirical Analysis of Vision Transformer and CNN in Resource-Constrained Federated Learning

AU - Zuo, Xiaojiang

AU - Zhang, Qinglong

AU - Han, Rui

PY - 2022/9/23

Y1 - 2022/9/23

N2 - Federated learning (FL) is an emerging distributed machine learning method that collaboratively trains a universal model among clients while maintaining their data privacy. Recently, several efforts attempt to introduce vision transformer (ViT) models into FL training. However, deploying and training such ViT models from scratch in practice is not trivial, existing works overlook the existence of the clients with low resources (e.g., mobile phones), which is a common and practical FL setting. In this paper, we use low-resolution images as model input to satisfy the resource constraints and investigate several ViT models to explore whether ViT models still outperform CNN models in this setting. Our experiment was performed on CIFAR10 and Fashion MNIST with their IID and non-IID versions, and the results demonstrate that ViT models can achieve a better global test accuracy than CNN models while using a comparable training cost, suggesting that they are ideally suitable for FL training with resource-constrained devices.

AB - Federated learning (FL) is an emerging distributed machine learning method that collaboratively trains a universal model among clients while maintaining their data privacy. Recently, several efforts attempt to introduce vision transformer (ViT) models into FL training. However, deploying and training such ViT models from scratch in practice is not trivial, existing works overlook the existence of the clients with low resources (e.g., mobile phones), which is a common and practical FL setting. In this paper, we use low-resolution images as model input to satisfy the resource constraints and investigate several ViT models to explore whether ViT models still outperform CNN models in this setting. Our experiment was performed on CIFAR10 and Fashion MNIST with their IID and non-IID versions, and the results demonstrate that ViT models can achieve a better global test accuracy than CNN models while using a comparable training cost, suggesting that they are ideally suitable for FL training with resource-constrained devices.

KW - CNN

KW - Deep Learning

KW - Federated Learning

KW - Vision Transformer

UR - http://www.scopus.com/inward/record.url?scp=85149943651&partnerID=8YFLogxK

U2 - 10.1145/3568199.3568201

DO - 10.1145/3568199.3568201

M3 - Conference contribution

AN - SCOPUS:85149943651

T3 - ACM International Conference Proceeding Series

SP - 8

EP - 13

BT - Proceedings of MLMI 2022 - 2022 5th International Conference on Machine Learning and Machine Intelligence

PB - Association for Computing Machinery

T2 - 5th International Conference on Machine Learning and Machine Intelligence, MLMI 2022

Y2 - 23 September 2022 through 25 September 2022

ER -

An Empirical Analysis of Vision Transformer and CNN in Resource-Constrained Federated Learning

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this