Middle fusion and multi-stage, multi-form prompts for robust RGB-T tracking

Qiming Wang; Yongqiang Bai; Hongxing Song

doi:10.1016/j.neucom.2024.127959

Middle fusion and multi-stage, multi-form prompts for robust RGB-T tracking

Qiming Wang, Yongqiang Bai^*, Hongxing Song

^*Corresponding author for this work

School of Automation

Research output: Contribution to journal › Article › peer-review

2 Citations (Scopus)

Abstract

RGB-T tracking, a vital downstream task of object tracking, has made remarkable progress in recent years. Yet, it remains hindered by two major challenges: (1) the trade-off between performance and efficiency; (2) the scarcity of training data. To address the latter challenge, some recent methods employ prompts to fine-tune pre-trained RGB tracking models and leverage upstream knowledge in a parameter-efficient manner. However, these methods inadequately explore modality-independent patterns and disregard the dynamic reliability of different modalities in open scenarios. We propose M3PT, a novel RGB-T prompt tracking method that leverages middle fusion and multi-modal and multi-stage visual prompts to overcome these challenges. We pioneer the use of the adjustable middle fusion meta-framework for RGB-T tracking, which could help the tracker balance the performance with efficiency, to meet various demands of application. Furthermore, based on the meta-framework, we utilize multiple flexible prompt strategies to adapt the pre-trained model to comprehensive exploration of uni-modal patterns and improved modeling of fusion-modal features in diverse modality-priority scenarios, harnessing the potential of prompt learning in RGB-T tracking. Evaluating on 6 existing challenging benchmarks, our method surpasses previous state-of-the-art prompt fine-tuning methods while maintaining great competitiveness against excellent full-parameter fine-tuning methods, with only 0.34 M fine-tuned parameters. Our code are available at https://github.com/rainbowsea123/M3PT.

Original language	English
Article number	127959
Journal	Neurocomputing
Volume	596
DOIs	https://doi.org/10.1016/j.neucom.2024.127959
Publication status	Published - 1 Sept 2024

Keywords

Deep learning
Multi-modal fusion
Prompt learning
RGB-T tracking

Access to Document

10.1016/j.neucom.2024.127959

Cite this

@article{5aac52c5db7142cca318583de8f37576,

title = "Middle fusion and multi-stage, multi-form prompts for robust RGB-T tracking",

abstract = "RGB-T tracking, a vital downstream task of object tracking, has made remarkable progress in recent years. Yet, it remains hindered by two major challenges: (1) the trade-off between performance and efficiency; (2) the scarcity of training data. To address the latter challenge, some recent methods employ prompts to fine-tune pre-trained RGB tracking models and leverage upstream knowledge in a parameter-efficient manner. However, these methods inadequately explore modality-independent patterns and disregard the dynamic reliability of different modalities in open scenarios. We propose M3PT, a novel RGB-T prompt tracking method that leverages middle fusion and multi-modal and multi-stage visual prompts to overcome these challenges. We pioneer the use of the adjustable middle fusion meta-framework for RGB-T tracking, which could help the tracker balance the performance with efficiency, to meet various demands of application. Furthermore, based on the meta-framework, we utilize multiple flexible prompt strategies to adapt the pre-trained model to comprehensive exploration of uni-modal patterns and improved modeling of fusion-modal features in diverse modality-priority scenarios, harnessing the potential of prompt learning in RGB-T tracking. Evaluating on 6 existing challenging benchmarks, our method surpasses previous state-of-the-art prompt fine-tuning methods while maintaining great competitiveness against excellent full-parameter fine-tuning methods, with only 0.34 M fine-tuned parameters. Our code are available at https://github.com/rainbowsea123/M3PT.",

keywords = "Deep learning, Multi-modal fusion, Prompt learning, RGB-T tracking",

author = "Qiming Wang and Yongqiang Bai and Hongxing Song",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier B.V.",

year = "2024",

month = sep,

day = "1",

doi = "10.1016/j.neucom.2024.127959",

language = "English",

volume = "596",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Middle fusion and multi-stage, multi-form prompts for robust RGB-T tracking

AU - Wang, Qiming

AU - Bai, Yongqiang

AU - Song, Hongxing

PY - 2024/9/1

Y1 - 2024/9/1

N2 - RGB-T tracking, a vital downstream task of object tracking, has made remarkable progress in recent years. Yet, it remains hindered by two major challenges: (1) the trade-off between performance and efficiency; (2) the scarcity of training data. To address the latter challenge, some recent methods employ prompts to fine-tune pre-trained RGB tracking models and leverage upstream knowledge in a parameter-efficient manner. However, these methods inadequately explore modality-independent patterns and disregard the dynamic reliability of different modalities in open scenarios. We propose M3PT, a novel RGB-T prompt tracking method that leverages middle fusion and multi-modal and multi-stage visual prompts to overcome these challenges. We pioneer the use of the adjustable middle fusion meta-framework for RGB-T tracking, which could help the tracker balance the performance with efficiency, to meet various demands of application. Furthermore, based on the meta-framework, we utilize multiple flexible prompt strategies to adapt the pre-trained model to comprehensive exploration of uni-modal patterns and improved modeling of fusion-modal features in diverse modality-priority scenarios, harnessing the potential of prompt learning in RGB-T tracking. Evaluating on 6 existing challenging benchmarks, our method surpasses previous state-of-the-art prompt fine-tuning methods while maintaining great competitiveness against excellent full-parameter fine-tuning methods, with only 0.34 M fine-tuned parameters. Our code are available at https://github.com/rainbowsea123/M3PT.

AB - RGB-T tracking, a vital downstream task of object tracking, has made remarkable progress in recent years. Yet, it remains hindered by two major challenges: (1) the trade-off between performance and efficiency; (2) the scarcity of training data. To address the latter challenge, some recent methods employ prompts to fine-tune pre-trained RGB tracking models and leverage upstream knowledge in a parameter-efficient manner. However, these methods inadequately explore modality-independent patterns and disregard the dynamic reliability of different modalities in open scenarios. We propose M3PT, a novel RGB-T prompt tracking method that leverages middle fusion and multi-modal and multi-stage visual prompts to overcome these challenges. We pioneer the use of the adjustable middle fusion meta-framework for RGB-T tracking, which could help the tracker balance the performance with efficiency, to meet various demands of application. Furthermore, based on the meta-framework, we utilize multiple flexible prompt strategies to adapt the pre-trained model to comprehensive exploration of uni-modal patterns and improved modeling of fusion-modal features in diverse modality-priority scenarios, harnessing the potential of prompt learning in RGB-T tracking. Evaluating on 6 existing challenging benchmarks, our method surpasses previous state-of-the-art prompt fine-tuning methods while maintaining great competitiveness against excellent full-parameter fine-tuning methods, with only 0.34 M fine-tuned parameters. Our code are available at https://github.com/rainbowsea123/M3PT.

KW - Deep learning

KW - Multi-modal fusion

KW - Prompt learning

KW - RGB-T tracking

UR - http://www.scopus.com/inward/record.url?scp=85195215076&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2024.127959

DO - 10.1016/j.neucom.2024.127959

M3 - Article

AN - SCOPUS:85195215076

SN - 0925-2312

VL - 596

JO - Neurocomputing

JF - Neurocomputing

M1 - 127959

ER -

Middle fusion and multi-stage, multi-form prompts for robust RGB-T tracking

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this