Abstract
Reliable semantic segmentation is essential for intelligent systems, yet significant problems remain: 1) Existing RGB-Thermal (RGB-T) segmentation models mainly rely on visual features and lack textual information, which may lead to inaccurate segmentation when categories share similar visual characteristics. 2) While SAM excels in instance-level segmentation, integrating it with thermal images and text is hindered by modality heterogeneity and computational inefficiency. Motivated by these observations, we introduce AdaptRGB-T, a parameter-efficient fine-tuning framework using Low-Rank Adaptation (LoRA) to adapt for RGB-T semantic segmentation. Specifically, we propose an Enhanced Transformer Block (ETB) that freezes SAM's original transformer blocks and incorporates trainable LoRA layers for efficient RGB-T feature fusion. Additionally, we incorporate CLIP-generated text embeddings in the mask decoder to enable semantic alignment, which further rectifies classification errors and improves semantic understanding accuracy. Experimental results across diverse datasets demonstrate that our method achieves superior performance in challenging scenarios with fewer trainable parameters. The code will be available at https://github.com/mengyu212/AdaptRGBT.
| Original language | English |
|---|---|
| Article number | 132060 |
| Journal | Neurocomputing |
| Volume | 664 |
| DOIs | |
| Publication status | Published - 1 Feb 2026 |
| Externally published | Yes |
Keywords
- LoRA fine-tuning
- RGB-t semantic segmentation
- Segment anything model
- Textual guidance