TY - JOUR
T1 - A simple but effective vision transformer framework for visible–infrared person re-identification
AU - Li, Yudong
AU - Zhao, Sanyuan
AU - Shen, Jianbing
N1 - Publisher Copyright:
© 2024 Elsevier Inc.
PY - 2024/12
Y1 - 2024/12
N2 - In the context of visible–infrared person re-identification (VI-ReID), the acquisition of a robust visual representation is paramount. Existing approaches predominantly rely on convolutional neural networks (CNNs), which are guided by intricately designed loss functions to extract features. In contrast, the vision transformer (ViT), a potent visual backbone, has often yielded subpar results in VI-ReID. We contend that the prevailing training methodologies and insights derived from CNNs do not seamlessly apply to ViT, leading to the underutilization of its potential in VI-ReID. One notable limitation is ViT's appetite for extensive data, exemplified by the JFT-300M dataset, to surpass CNNs. Consequently, ViT struggles to transfer its knowledge from visible to infrared images due to inadequate training data. Even the largest available dataset, SYSU-MM01, proves insufficient for ViT to glean a robust representation of infrared images. This predicament is exacerbated when ViT is trained on the smaller RegDB dataset, where slight data flow modifications drastically affect performance—a stark contrast to CNN behavior. These observations lead us to conjecture that the CNN-inspired paradigm impedes ViT's progress in VI-ReID. In light of these challenges, we undertake comprehensive ablation studies to shed new light on ViT's applicability in VI-ReID. We propose a straightforward yet effective framework, named “Idformer”, to train a high-performing ViT for VI-ReID. Idformer serves as a robust baseline that can be further enhanced with carefully designed techniques akin to those used for CNNs. Remarkably, our method attains competitive results even in the absence of auxiliary information, achieving 78.58%/76.99% Rank-1/mAP on the SYSU-MM01 dataset, as well as 96.82%/91.83% Rank-1/mAP on the RegDB dataset. The code will be made publicly accessible.
AB - In the context of visible–infrared person re-identification (VI-ReID), the acquisition of a robust visual representation is paramount. Existing approaches predominantly rely on convolutional neural networks (CNNs), which are guided by intricately designed loss functions to extract features. In contrast, the vision transformer (ViT), a potent visual backbone, has often yielded subpar results in VI-ReID. We contend that the prevailing training methodologies and insights derived from CNNs do not seamlessly apply to ViT, leading to the underutilization of its potential in VI-ReID. One notable limitation is ViT's appetite for extensive data, exemplified by the JFT-300M dataset, to surpass CNNs. Consequently, ViT struggles to transfer its knowledge from visible to infrared images due to inadequate training data. Even the largest available dataset, SYSU-MM01, proves insufficient for ViT to glean a robust representation of infrared images. This predicament is exacerbated when ViT is trained on the smaller RegDB dataset, where slight data flow modifications drastically affect performance—a stark contrast to CNN behavior. These observations lead us to conjecture that the CNN-inspired paradigm impedes ViT's progress in VI-ReID. In light of these challenges, we undertake comprehensive ablation studies to shed new light on ViT's applicability in VI-ReID. We propose a straightforward yet effective framework, named “Idformer”, to train a high-performing ViT for VI-ReID. Idformer serves as a robust baseline that can be further enhanced with carefully designed techniques akin to those used for CNNs. Remarkably, our method attains competitive results even in the absence of auxiliary information, achieving 78.58%/76.99% Rank-1/mAP on the SYSU-MM01 dataset, as well as 96.82%/91.83% Rank-1/mAP on the RegDB dataset. The code will be made publicly accessible.
KW - Cross-modality
KW - Visual infrared person re-identification
KW - ViT
UR - http://www.scopus.com/inward/record.url?scp=85206271981&partnerID=8YFLogxK
U2 - 10.1016/j.cviu.2024.104192
DO - 10.1016/j.cviu.2024.104192
M3 - Article
AN - SCOPUS:85206271981
SN - 1077-3142
VL - 249
JO - Computer Vision and Image Understanding
JF - Computer Vision and Image Understanding
M1 - 104192
ER -