TY - GEN
T1 - RPViT
T2 - 5th International Conference on Image and Graphics Processing, ICIGP 2022
AU - Ge, Jing
AU - Wang, Qianxiang
AU - Tong, Jiahui
AU - Gao, Guangyu
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/1/7
Y1 - 2022/1/7
N2 - Vision Transformers constantly absorb the characteristics of convolutional neural networks to solve its shortcomings in translational invariance and scale invariance. However, dividing the image by a simple grid often destroys the position and scale features in the image at the beginning of the network. In this paper, we propose a vision transformer based on region proposal, which obtains the inductive bias in a simple way. Specifically, RPViT achieves locality and scale-invariance by extracting regions with locality using a traditional region proposal algorithm and deflating objects of different scales to the same scale by a bilinear interpolation algorithm. In addition, to enable the network to fully utilize and encode diverse candidate objects, a multi-class token approach based on orthogonalization is proposed and applied. Experiments on ImageNet demonstrate that RPViT outperforms baseline converters and related work.
AB - Vision Transformers constantly absorb the characteristics of convolutional neural networks to solve its shortcomings in translational invariance and scale invariance. However, dividing the image by a simple grid often destroys the position and scale features in the image at the beginning of the network. In this paper, we propose a vision transformer based on region proposal, which obtains the inductive bias in a simple way. Specifically, RPViT achieves locality and scale-invariance by extracting regions with locality using a traditional region proposal algorithm and deflating objects of different scales to the same scale by a bilinear interpolation algorithm. In addition, to enable the network to fully utilize and encode diverse candidate objects, a multi-class token approach based on orthogonalization is proposed and applied. Experiments on ImageNet demonstrate that RPViT outperforms baseline converters and related work.
KW - bilinear interpolation
KW - locality and scale-invariance
KW - orthogonalization
KW - region proposal
KW - vision transformers
UR - http://www.scopus.com/inward/record.url?scp=85127613563&partnerID=8YFLogxK
U2 - 10.1145/3512388.3512421
DO - 10.1145/3512388.3512421
M3 - Conference contribution
AN - SCOPUS:85127613563
T3 - ACM International Conference Proceeding Series
SP - 220
EP - 225
BT - ICIGP 2022 - Proceedings of the 2022 5th International Conference on Image and Graphics Processing
PB - Association for Computing Machinery
Y2 - 7 January 2022 through 9 January 2022
ER -