TY - GEN
T1 - An Empirical Analysis of Vision Transformer and CNN in Resource-Constrained Federated Learning
AU - Zuo, Xiaojiang
AU - Zhang, Qinglong
AU - Han, Rui
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/9/23
Y1 - 2022/9/23
N2 - Federated learning (FL) is an emerging distributed machine learning method that collaboratively trains a universal model among clients while maintaining their data privacy. Recently, several efforts attempt to introduce vision transformer (ViT) models into FL training. However, deploying and training such ViT models from scratch in practice is not trivial, existing works overlook the existence of the clients with low resources (e.g., mobile phones), which is a common and practical FL setting. In this paper, we use low-resolution images as model input to satisfy the resource constraints and investigate several ViT models to explore whether ViT models still outperform CNN models in this setting. Our experiment was performed on CIFAR10 and Fashion MNIST with their IID and non-IID versions, and the results demonstrate that ViT models can achieve a better global test accuracy than CNN models while using a comparable training cost, suggesting that they are ideally suitable for FL training with resource-constrained devices.
AB - Federated learning (FL) is an emerging distributed machine learning method that collaboratively trains a universal model among clients while maintaining their data privacy. Recently, several efforts attempt to introduce vision transformer (ViT) models into FL training. However, deploying and training such ViT models from scratch in practice is not trivial, existing works overlook the existence of the clients with low resources (e.g., mobile phones), which is a common and practical FL setting. In this paper, we use low-resolution images as model input to satisfy the resource constraints and investigate several ViT models to explore whether ViT models still outperform CNN models in this setting. Our experiment was performed on CIFAR10 and Fashion MNIST with their IID and non-IID versions, and the results demonstrate that ViT models can achieve a better global test accuracy than CNN models while using a comparable training cost, suggesting that they are ideally suitable for FL training with resource-constrained devices.
KW - CNN
KW - Deep Learning
KW - Federated Learning
KW - Vision Transformer
UR - http://www.scopus.com/inward/record.url?scp=85149943651&partnerID=8YFLogxK
U2 - 10.1145/3568199.3568201
DO - 10.1145/3568199.3568201
M3 - Conference contribution
AN - SCOPUS:85149943651
T3 - ACM International Conference Proceeding Series
SP - 8
EP - 13
BT - Proceedings of MLMI 2022 - 2022 5th International Conference on Machine Learning and Machine Intelligence
PB - Association for Computing Machinery
T2 - 5th International Conference on Machine Learning and Machine Intelligence, MLMI 2022
Y2 - 23 September 2022 through 25 September 2022
ER -