AeriaICLIP: Lightweight Open-Vocabulary Segmentation for UAV-Based Aerial Images

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The increasing use of unmanned aerial vehicles (UAVs) for remote sensing image segmentation has revolutionized applications such as smart agriculture, disaster monitoring, and urban planning. However, current methods often rely on fully supervised learning, requiring extensive labeled data and struggling with zero-shot capabilities for unseen categories. To address these challenges, we propose AerialCLIP, a lightweight open-vocabulary method for real-time semantic segmentation of UAV-captured remote sensing images, based on the widely-used vision-language model (VLM), i.e., CLIP. While CLIP excels in zero-shot predictions, its large parameter size prevents direct application on UAV platforms with limited computational resources. Therefore, we introduce a two-stage architecture, incorporating a saliency-based mask proposal generation (SMPG) module to efficiently generate foreground class masks. Additionally, we apply knowledge distillation to reduce the computational overhead of CLIP, enabling deployment on resource-constrained edge devices. Our extensive experiments across multiple UAV-based remote sensing datasets-UAVid, UDD5, and VDD-demonstrate that AerialCLIP achieves significant improvements, with an average mIoU of 44.1%, 51.2%, and 45.9%, respectively, while reducing model parameters by over 50%, showcasing both high accuracy and parameter efficiency.

Original languageEnglish
Title of host publicationProceedings of the 44th Chinese Control Conference, CCC 2025
EditorsJian Sun, Hongpeng Yin
PublisherIEEE Computer Society
Pages8193-8198
Number of pages6
ISBN (Electronic)9789887581611
DOIs
Publication statusPublished - 2025
Event44th Chinese Control Conference, CCC 2025 - Chongqing, China
Duration: 28 Jul 202530 Jul 2025

Publication series

NameChinese Control Conference, CCC
ISSN (Print)1934-1768
ISSN (Electronic)2161-2927

Conference

Conference44th Chinese Control Conference, CCC 2025
Country/TerritoryChina
CityChongqing
Period28/07/2530/07/25

Keywords

  • open-vocabulary learning
  • Remote sensing
  • semantic segmentation
  • UAVs
  • vision-language models

Fingerprint

Dive into the research topics of 'AeriaICLIP: Lightweight Open-Vocabulary Segmentation for UAV-Based Aerial Images'. Together they form a unique fingerprint.

Cite this