TY - GEN
T1 - AeriaICLIP
T2 - 44th Chinese Control Conference, CCC 2025
AU - Jia, Puyang
AU - Gao, Yan
AU - Li, Weixing
AU - Gao, Qi
AU - Pan, Feng
N1 - Publisher Copyright:
© 2025 Technical Committee on Control Theory, Chinese Association of Automation.
PY - 2025
Y1 - 2025
N2 - The increasing use of unmanned aerial vehicles (UAVs) for remote sensing image segmentation has revolutionized applications such as smart agriculture, disaster monitoring, and urban planning. However, current methods often rely on fully supervised learning, requiring extensive labeled data and struggling with zero-shot capabilities for unseen categories. To address these challenges, we propose AerialCLIP, a lightweight open-vocabulary method for real-time semantic segmentation of UAV-captured remote sensing images, based on the widely-used vision-language model (VLM), i.e., CLIP. While CLIP excels in zero-shot predictions, its large parameter size prevents direct application on UAV platforms with limited computational resources. Therefore, we introduce a two-stage architecture, incorporating a saliency-based mask proposal generation (SMPG) module to efficiently generate foreground class masks. Additionally, we apply knowledge distillation to reduce the computational overhead of CLIP, enabling deployment on resource-constrained edge devices. Our extensive experiments across multiple UAV-based remote sensing datasets-UAVid, UDD5, and VDD-demonstrate that AerialCLIP achieves significant improvements, with an average mIoU of 44.1%, 51.2%, and 45.9%, respectively, while reducing model parameters by over 50%, showcasing both high accuracy and parameter efficiency.
AB - The increasing use of unmanned aerial vehicles (UAVs) for remote sensing image segmentation has revolutionized applications such as smart agriculture, disaster monitoring, and urban planning. However, current methods often rely on fully supervised learning, requiring extensive labeled data and struggling with zero-shot capabilities for unseen categories. To address these challenges, we propose AerialCLIP, a lightweight open-vocabulary method for real-time semantic segmentation of UAV-captured remote sensing images, based on the widely-used vision-language model (VLM), i.e., CLIP. While CLIP excels in zero-shot predictions, its large parameter size prevents direct application on UAV platforms with limited computational resources. Therefore, we introduce a two-stage architecture, incorporating a saliency-based mask proposal generation (SMPG) module to efficiently generate foreground class masks. Additionally, we apply knowledge distillation to reduce the computational overhead of CLIP, enabling deployment on resource-constrained edge devices. Our extensive experiments across multiple UAV-based remote sensing datasets-UAVid, UDD5, and VDD-demonstrate that AerialCLIP achieves significant improvements, with an average mIoU of 44.1%, 51.2%, and 45.9%, respectively, while reducing model parameters by over 50%, showcasing both high accuracy and parameter efficiency.
KW - open-vocabulary learning
KW - Remote sensing
KW - semantic segmentation
KW - UAVs
KW - vision-language models
UR - https://www.scopus.com/pages/publications/105020276029
U2 - 10.23919/CCC64809.2025.11179646
DO - 10.23919/CCC64809.2025.11179646
M3 - Conference contribution
AN - SCOPUS:105020276029
T3 - Chinese Control Conference, CCC
SP - 8193
EP - 8198
BT - Proceedings of the 44th Chinese Control Conference, CCC 2025
A2 - Sun, Jian
A2 - Yin, Hongpeng
PB - IEEE Computer Society
Y2 - 28 July 2025 through 30 July 2025
ER -