Improving the Generalization of Visual Classification Models Across IoT Cameras via Cross-Modal Inference and Fusion

Qing Ling Guan; Yuze Zheng; Lei Meng; Li Quan Dong; Qun Hao

doi:10.1109/JIOT.2023.3265645

Improving the Generalization of Visual Classification Models Across IoT Cameras via Cross-Modal Inference and Fusion

Qing Ling Guan, Yuze Zheng, Lei Meng^*, Li Quan Dong, Qun Hao

^*此作品的通讯作者

光电学院

科研成果: 期刊稿件 › 文章 › 同行评审

11 引用（Scopus）

摘要

The performance of visual classification models across Internet of Things devices is usually limited by the changes in local environments, resulted from the diverse appearances of the target objects and differences in light conditions and background scenes. To alleviate these problems, existing studies usually introduce the multimodal information to guide the learning process of the visual classification models, making the models extract the visual features from the discriminative image regions. Especially, cross-modal alignment between visual and textual features has been considered as an effective way for this task by learning a domain-consistent latent feature space for the visual and semantic features. However, this approach may suffer from the heterogeneity between multiple modalities, such as the multimodal features and the differences in the learned feature values. To alleviate this problem, this article first presents a comparative analysis of the functionality of various alignment strategies and their impacts on improving visual classification. Subsequently, a cross-modal inference and fusion framework (termed as CRIF) is proposed to align the heterogeneous features in both the feature distributions and values. More importantly, CRIF includes a cross-modal information enrichment module to improve the final classification and learn the mappings from the visual to the semantic space. We conduct experiments on four benchmarking data sets, i.e., the Vireo-Food172, NUS-WIDE, MSR-VTT, and ActivityNet Captions data sets. We report state-of-the-art results for basic classification tasks on the four data sets and conduct subsequent experiments on feature alignment and fusion. The experimental results verify that CRIF can effectively improve the learning ability of the visual classification models, and it is a model-agnostic framework that consistently improves the performance of state-of-the-art visual classification models.

源语言	英语
页（从-至）	15835-15846
页数	12
期刊	IEEE Internet of Things Journal
卷	10
期	18
DOI	https://doi.org/10.1109/JIOT.2023.3265645
出版状态	已出版 - 15 9月 2023

访问文件

10.1109/JIOT.2023.3265645

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{329f7d3c8f494f228c50c514d88539ca,

title = "Improving the Generalization of Visual Classification Models Across IoT Cameras via Cross-Modal Inference and Fusion",

abstract = "The performance of visual classification models across Internet of Things devices is usually limited by the changes in local environments, resulted from the diverse appearances of the target objects and differences in light conditions and background scenes. To alleviate these problems, existing studies usually introduce the multimodal information to guide the learning process of the visual classification models, making the models extract the visual features from the discriminative image regions. Especially, cross-modal alignment between visual and textual features has been considered as an effective way for this task by learning a domain-consistent latent feature space for the visual and semantic features. However, this approach may suffer from the heterogeneity between multiple modalities, such as the multimodal features and the differences in the learned feature values. To alleviate this problem, this article first presents a comparative analysis of the functionality of various alignment strategies and their impacts on improving visual classification. Subsequently, a cross-modal inference and fusion framework (termed as CRIF) is proposed to align the heterogeneous features in both the feature distributions and values. More importantly, CRIF includes a cross-modal information enrichment module to improve the final classification and learn the mappings from the visual to the semantic space. We conduct experiments on four benchmarking data sets, i.e., the Vireo-Food172, NUS-WIDE, MSR-VTT, and ActivityNet Captions data sets. We report state-of-the-art results for basic classification tasks on the four data sets and conduct subsequent experiments on feature alignment and fusion. The experimental results verify that CRIF can effectively improve the learning ability of the visual classification models, and it is a model-agnostic framework that consistently improves the performance of state-of-the-art visual classification models.",

keywords = "Feature alignment, heterogeneous domain, image classification, semantic inference",

author = "Guan, {Qing Ling} and Yuze Zheng and Lei Meng and Dong, {Li Quan} and Qun Hao",

note = "Publisher Copyright: {\textcopyright} 2014 IEEE.",

year = "2023",

month = sep,

day = "15",

doi = "10.1109/JIOT.2023.3265645",

language = "English",

volume = "10",

pages = "15835--15846",

journal = "IEEE Internet of Things Journal",

issn = "2327-4662",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "18",

}

TY - JOUR

T1 - Improving the Generalization of Visual Classification Models Across IoT Cameras via Cross-Modal Inference and Fusion

AU - Guan, Qing Ling

AU - Zheng, Yuze

AU - Meng, Lei

AU - Dong, Li Quan

AU - Hao, Qun

PY - 2023/9/15

Y1 - 2023/9/15

N2 - The performance of visual classification models across Internet of Things devices is usually limited by the changes in local environments, resulted from the diverse appearances of the target objects and differences in light conditions and background scenes. To alleviate these problems, existing studies usually introduce the multimodal information to guide the learning process of the visual classification models, making the models extract the visual features from the discriminative image regions. Especially, cross-modal alignment between visual and textual features has been considered as an effective way for this task by learning a domain-consistent latent feature space for the visual and semantic features. However, this approach may suffer from the heterogeneity between multiple modalities, such as the multimodal features and the differences in the learned feature values. To alleviate this problem, this article first presents a comparative analysis of the functionality of various alignment strategies and their impacts on improving visual classification. Subsequently, a cross-modal inference and fusion framework (termed as CRIF) is proposed to align the heterogeneous features in both the feature distributions and values. More importantly, CRIF includes a cross-modal information enrichment module to improve the final classification and learn the mappings from the visual to the semantic space. We conduct experiments on four benchmarking data sets, i.e., the Vireo-Food172, NUS-WIDE, MSR-VTT, and ActivityNet Captions data sets. We report state-of-the-art results for basic classification tasks on the four data sets and conduct subsequent experiments on feature alignment and fusion. The experimental results verify that CRIF can effectively improve the learning ability of the visual classification models, and it is a model-agnostic framework that consistently improves the performance of state-of-the-art visual classification models.

AB - The performance of visual classification models across Internet of Things devices is usually limited by the changes in local environments, resulted from the diverse appearances of the target objects and differences in light conditions and background scenes. To alleviate these problems, existing studies usually introduce the multimodal information to guide the learning process of the visual classification models, making the models extract the visual features from the discriminative image regions. Especially, cross-modal alignment between visual and textual features has been considered as an effective way for this task by learning a domain-consistent latent feature space for the visual and semantic features. However, this approach may suffer from the heterogeneity between multiple modalities, such as the multimodal features and the differences in the learned feature values. To alleviate this problem, this article first presents a comparative analysis of the functionality of various alignment strategies and their impacts on improving visual classification. Subsequently, a cross-modal inference and fusion framework (termed as CRIF) is proposed to align the heterogeneous features in both the feature distributions and values. More importantly, CRIF includes a cross-modal information enrichment module to improve the final classification and learn the mappings from the visual to the semantic space. We conduct experiments on four benchmarking data sets, i.e., the Vireo-Food172, NUS-WIDE, MSR-VTT, and ActivityNet Captions data sets. We report state-of-the-art results for basic classification tasks on the four data sets and conduct subsequent experiments on feature alignment and fusion. The experimental results verify that CRIF can effectively improve the learning ability of the visual classification models, and it is a model-agnostic framework that consistently improves the performance of state-of-the-art visual classification models.

KW - Feature alignment

KW - heterogeneous domain

KW - image classification

KW - semantic inference

UR - http://www.scopus.com/inward/record.url?scp=85153401392&partnerID=8YFLogxK

U2 - 10.1109/JIOT.2023.3265645

DO - 10.1109/JIOT.2023.3265645

M3 - Article

AN - SCOPUS:85153401392

SN - 2327-4662

VL - 10

SP - 15835

EP - 15846

JO - IEEE Internet of Things Journal

JF - IEEE Internet of Things Journal

IS - 18

ER -

Improving the Generalization of Visual Classification Models Across IoT Cameras via Cross-Modal Inference and Fusion

摘要

访问文件

其它文件与链接

指纹

引用此