Skip to main navigation Skip to search Skip to main content

Data-efficient Large Vision Models through Sequential Autoregression

  • Zhiwei Hao
  • , Jianyuan Guo
  • , Chengcheng Wang
  • , Yehui Tang
  • , Han Wu
  • , Han Hu*
  • , Kai Han
  • , Chang Xu*
  • *Corresponding author for this work
  • Beijing Institute of Technology
  • The University of Sydney
  • Huawei Technologies Co., Ltd.

Research output: Contribution to journalConference articlepeer-review

Abstract

Training general-purpose vision models on purely sequential visual data, eschewing linguistic inputs, has heralded a new frontier in visual understanding. These models are intended to not only comprehend but also seamlessly transit to out-of-domain tasks. However, current endeavors are hamstrung by an over-reliance on colossal models, exemplified by models with upwards of 3B parameters, and the necessity for an extensive corpus of visual data, often comprising a staggering 400B tokens (Bai et al., 2023). In this paper, we delve into the development of an efficient, autoregression-based vision model, in-novatively architected to operate on a limited dataset. We meticulously demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding during the testing phase. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint, and a marked decrease in training data requirements, thereby paving the way for more sustainable and accessible advancements in the field of generalist vision models. The code is available at https://github.com/ggjy/DeLVM.

Original languageEnglish
Pages (from-to)17572-17596
Number of pages25
JournalProceedings of Machine Learning Research
Volume235
Publication statusPublished - 2024
Event41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria
Duration: 21 Jul 202427 Jul 2024

Fingerprint

Dive into the research topics of 'Data-efficient Large Vision Models through Sequential Autoregression'. Together they form a unique fingerprint.

Cite this