RPViT: Vision Transformer Based on Region Proposal

Jing Ge, Qianxiang Wang, Jiahui Tong*, Guangyu Gao

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Vision Transformers constantly absorb the characteristics of convolutional neural networks to solve its shortcomings in translational invariance and scale invariance. However, dividing the image by a simple grid often destroys the position and scale features in the image at the beginning of the network. In this paper, we propose a vision transformer based on region proposal, which obtains the inductive bias in a simple way. Specifically, RPViT achieves locality and scale-invariance by extracting regions with locality using a traditional region proposal algorithm and deflating objects of different scales to the same scale by a bilinear interpolation algorithm. In addition, to enable the network to fully utilize and encode diverse candidate objects, a multi-class token approach based on orthogonalization is proposed and applied. Experiments on ImageNet demonstrate that RPViT outperforms baseline converters and related work.

Original languageEnglish
Title of host publicationICIGP 2022 - Proceedings of the 2022 5th International Conference on Image and Graphics Processing
PublisherAssociation for Computing Machinery
Pages220-225
Number of pages6
ISBN (Electronic)9781450395465
DOIs
Publication statusPublished - 7 Jan 2022
Event5th International Conference on Image and Graphics Processing, ICIGP 2022 - Virtual, Online, China
Duration: 7 Jan 20229 Jan 2022

Publication series

NameACM International Conference Proceeding Series

Conference

Conference5th International Conference on Image and Graphics Processing, ICIGP 2022
Country/TerritoryChina
CityVirtual, Online
Period7/01/229/01/22

Keywords

  • bilinear interpolation
  • locality and scale-invariance
  • orthogonalization
  • region proposal
  • vision transformers

Fingerprint

Dive into the research topics of 'RPViT: Vision Transformer Based on Region Proposal'. Together they form a unique fingerprint.

Cite this