RPViT: Vision Transformer Based on Region Proposal

Jing Ge; Qianxiang Wang; Jiahui Tong; Guangyu Gao

doi:10.1145/3512388.3512421

RPViT: Vision Transformer Based on Region Proposal

Jing Ge, Qianxiang Wang, Jiahui Tong^*, Guangyu Gao

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Vision Transformers constantly absorb the characteristics of convolutional neural networks to solve its shortcomings in translational invariance and scale invariance. However, dividing the image by a simple grid often destroys the position and scale features in the image at the beginning of the network. In this paper, we propose a vision transformer based on region proposal, which obtains the inductive bias in a simple way. Specifically, RPViT achieves locality and scale-invariance by extracting regions with locality using a traditional region proposal algorithm and deflating objects of different scales to the same scale by a bilinear interpolation algorithm. In addition, to enable the network to fully utilize and encode diverse candidate objects, a multi-class token approach based on orthogonalization is proposed and applied. Experiments on ImageNet demonstrate that RPViT outperforms baseline converters and related work.

Original language	English
Title of host publication	ICIGP 2022 - Proceedings of the 2022 5th International Conference on Image and Graphics Processing
Publisher	Association for Computing Machinery
Pages	220-225
Number of pages	6
ISBN (Electronic)	9781450395465
DOIs	https://doi.org/10.1145/3512388.3512421
Publication status	Published - 7 Jan 2022
Event	5th International Conference on Image and Graphics Processing, ICIGP 2022 - Virtual, Online, China Duration: 7 Jan 2022 → 9 Jan 2022

Publication series

Name	ACM International Conference Proceeding Series

Conference

Conference	5th International Conference on Image and Graphics Processing, ICIGP 2022
Country/Territory	China
City	Virtual, Online
Period	7/01/22 → 9/01/22

Keywords

bilinear interpolation
locality and scale-invariance
orthogonalization
region proposal
vision transformers

Access to Document

10.1145/3512388.3512421

Cite this

Ge, J., Wang, Q., Tong, J., & Gao, G. (2022). RPViT: Vision Transformer Based on Region Proposal. In ICIGP 2022 - Proceedings of the 2022 5th International Conference on Image and Graphics Processing (pp. 220-225). (ACM International Conference Proceeding Series). Association for Computing Machinery. https://doi.org/10.1145/3512388.3512421

@inproceedings{3e50012e23ae4527962470aa00d066f1,

title = "RPViT: Vision Transformer Based on Region Proposal",

abstract = "Vision Transformers constantly absorb the characteristics of convolutional neural networks to solve its shortcomings in translational invariance and scale invariance. However, dividing the image by a simple grid often destroys the position and scale features in the image at the beginning of the network. In this paper, we propose a vision transformer based on region proposal, which obtains the inductive bias in a simple way. Specifically, RPViT achieves locality and scale-invariance by extracting regions with locality using a traditional region proposal algorithm and deflating objects of different scales to the same scale by a bilinear interpolation algorithm. In addition, to enable the network to fully utilize and encode diverse candidate objects, a multi-class token approach based on orthogonalization is proposed and applied. Experiments on ImageNet demonstrate that RPViT outperforms baseline converters and related work.",

keywords = "bilinear interpolation, locality and scale-invariance, orthogonalization, region proposal, vision transformers",

author = "Jing Ge and Qianxiang Wang and Jiahui Tong and Guangyu Gao",

note = "Publisher Copyright: {\textcopyright} 2022 ACM.; 5th International Conference on Image and Graphics Processing, ICIGP 2022 ; Conference date: 07-01-2022 Through 09-01-2022",

year = "2022",

month = jan,

day = "7",

doi = "10.1145/3512388.3512421",

language = "English",

series = "ACM International Conference Proceeding Series",

publisher = "Association for Computing Machinery",

pages = "220--225",

booktitle = "ICIGP 2022 - Proceedings of the 2022 5th International Conference on Image and Graphics Processing",

}

Ge, J, Wang, Q, Tong, J & Gao, G 2022, RPViT: Vision Transformer Based on Region Proposal. in ICIGP 2022 - Proceedings of the 2022 5th International Conference on Image and Graphics Processing. ACM International Conference Proceeding Series, Association for Computing Machinery, pp. 220-225, 5th International Conference on Image and Graphics Processing, ICIGP 2022, Virtual, Online, China, 7/01/22. https://doi.org/10.1145/3512388.3512421

RPViT: Vision Transformer Based on Region Proposal. / Ge, Jing; Wang, Qianxiang; Tong, Jiahui et al.
ICIGP 2022 - Proceedings of the 2022 5th International Conference on Image and Graphics Processing. Association for Computing Machinery, 2022. p. 220-225 (ACM International Conference Proceeding Series).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - RPViT

T2 - 5th International Conference on Image and Graphics Processing, ICIGP 2022

AU - Ge, Jing

AU - Wang, Qianxiang

AU - Tong, Jiahui

AU - Gao, Guangyu

PY - 2022/1/7

Y1 - 2022/1/7

N2 - Vision Transformers constantly absorb the characteristics of convolutional neural networks to solve its shortcomings in translational invariance and scale invariance. However, dividing the image by a simple grid often destroys the position and scale features in the image at the beginning of the network. In this paper, we propose a vision transformer based on region proposal, which obtains the inductive bias in a simple way. Specifically, RPViT achieves locality and scale-invariance by extracting regions with locality using a traditional region proposal algorithm and deflating objects of different scales to the same scale by a bilinear interpolation algorithm. In addition, to enable the network to fully utilize and encode diverse candidate objects, a multi-class token approach based on orthogonalization is proposed and applied. Experiments on ImageNet demonstrate that RPViT outperforms baseline converters and related work.

AB - Vision Transformers constantly absorb the characteristics of convolutional neural networks to solve its shortcomings in translational invariance and scale invariance. However, dividing the image by a simple grid often destroys the position and scale features in the image at the beginning of the network. In this paper, we propose a vision transformer based on region proposal, which obtains the inductive bias in a simple way. Specifically, RPViT achieves locality and scale-invariance by extracting regions with locality using a traditional region proposal algorithm and deflating objects of different scales to the same scale by a bilinear interpolation algorithm. In addition, to enable the network to fully utilize and encode diverse candidate objects, a multi-class token approach based on orthogonalization is proposed and applied. Experiments on ImageNet demonstrate that RPViT outperforms baseline converters and related work.

KW - bilinear interpolation

KW - locality and scale-invariance

KW - orthogonalization

KW - region proposal

KW - vision transformers

UR - http://www.scopus.com/inward/record.url?scp=85127613563&partnerID=8YFLogxK

U2 - 10.1145/3512388.3512421

DO - 10.1145/3512388.3512421

M3 - Conference contribution

AN - SCOPUS:85127613563

T3 - ACM International Conference Proceeding Series

SP - 220

EP - 225

BT - ICIGP 2022 - Proceedings of the 2022 5th International Conference on Image and Graphics Processing

PB - Association for Computing Machinery

Y2 - 7 January 2022 through 9 January 2022

ER -

RPViT: Vision Transformer Based on Region Proposal

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this