TY - JOUR
T1 - Data-efficient Large Vision Models through Sequential Autoregression
AU - Hao, Zhiwei
AU - Guo, Jianyuan
AU - Wang, Chengcheng
AU - Tang, Yehui
AU - Wu, Han
AU - Hu, Han
AU - Han, Kai
AU - Xu, Chang
N1 - Publisher Copyright:
Copyright 2024 by the author(s)
PY - 2024
Y1 - 2024
N2 - Training general-purpose vision models on purely sequential visual data, eschewing linguistic inputs, has heralded a new frontier in visual understanding. These models are intended to not only comprehend but also seamlessly transit to out-of-domain tasks. However, current endeavors are hamstrung by an over-reliance on colossal models, exemplified by models with upwards of 3B parameters, and the necessity for an extensive corpus of visual data, often comprising a staggering 400B tokens (Bai et al., 2023). In this paper, we delve into the development of an efficient, autoregression-based vision model, in-novatively architected to operate on a limited dataset. We meticulously demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding during the testing phase. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint, and a marked decrease in training data requirements, thereby paving the way for more sustainable and accessible advancements in the field of generalist vision models. The code is available at https://github.com/ggjy/DeLVM.
AB - Training general-purpose vision models on purely sequential visual data, eschewing linguistic inputs, has heralded a new frontier in visual understanding. These models are intended to not only comprehend but also seamlessly transit to out-of-domain tasks. However, current endeavors are hamstrung by an over-reliance on colossal models, exemplified by models with upwards of 3B parameters, and the necessity for an extensive corpus of visual data, often comprising a staggering 400B tokens (Bai et al., 2023). In this paper, we delve into the development of an efficient, autoregression-based vision model, in-novatively architected to operate on a limited dataset. We meticulously demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding during the testing phase. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint, and a marked decrease in training data requirements, thereby paving the way for more sustainable and accessible advancements in the field of generalist vision models. The code is available at https://github.com/ggjy/DeLVM.
UR - http://www.scopus.com/inward/record.url?scp=85203786785&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85203786785
SN - 2640-3498
VL - 235
SP - 17572
EP - 17596
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
T2 - 41st International Conference on Machine Learning, ICML 2024
Y2 - 21 July 2024 through 27 July 2024
ER -