Weakly Supervised Visual Saliency Prediction

Lai Zhou; Tianfei Zhou; Salman Khan; Hanqiu Sun; Jianbing Shen; Ling Shao

doi:10.1109/TIP.2022.3158064

Weakly Supervised Visual Saliency Prediction

Lai Zhou, Tianfei Zhou^*, Salman Khan, Hanqiu Sun, Jianbing Shen, Ling Shao

^*此作品的通讯作者

科研成果: 期刊稿件 › 文章 › 同行评审

14 引用（Scopus）

摘要

The success of current deep saliency models heavily depends on large amounts of annotated human fixation data to fit the highly non-linear mapping between the stimuli and visual saliency. Such fully supervised data-driven approaches are annotation-intensive and often fail to consider the underlying mechanisms of visual attention. In contrast, in this paper, we introduce a model based on various cognitive theories of visual saliency, which learns visual attention patterns in a weakly supervised manner. Our approach incorporates insights from cognitive science as differentiable submodules, resulting in a unified, end-to-end trainable framework. Specifically, our model encapsulates the following important components motivated from biological vision. (a) As scene semantics are closely related to visually attentive regions, our model encodes discriminative spatial information for scene understanding through spatial visual semantics embedding. (b) To model the objectness factors in visual attention deployment, we incorporate object-level semantics embedding and object relation information. (c) Considering the 'winner-take-all' mechanism in visual stimuli processing, we model the competition mechanism among objects with softmax based neural attention. (d) Lastly, a conditional center prior is learned to mimic the spatial distribution bias of visual attention. Furthermore, we propose novel loss functions to utilize supervision cues from image-level semantics, saliency prior knowledge, and self-information compression. Experiments show that our method achieves promising results, and even outperforms many of its fully supervised counterparts. Overall, our weakly supervised saliency method makes an essential step towards reducing the annotation budget of current approaches, as well as providing a more comprehensive understanding of the visual attention mechanism. Our code is available at: https://github.com/ashleylqx/WeakFixation.git.

源语言	英语
页（从-至）	3111-3124
页数	14
期刊	IEEE Transactions on Image Processing
卷	31
DOI	https://doi.org/10.1109/TIP.2022.3158064
出版状态	已出版 - 2022
已对外发布	是

访问文件

10.1109/TIP.2022.3158064

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{251bd29234d64a31920d2d62cb6928d8,

title = "Weakly Supervised Visual Saliency Prediction",

abstract = "The success of current deep saliency models heavily depends on large amounts of annotated human fixation data to fit the highly non-linear mapping between the stimuli and visual saliency. Such fully supervised data-driven approaches are annotation-intensive and often fail to consider the underlying mechanisms of visual attention. In contrast, in this paper, we introduce a model based on various cognitive theories of visual saliency, which learns visual attention patterns in a weakly supervised manner. Our approach incorporates insights from cognitive science as differentiable submodules, resulting in a unified, end-to-end trainable framework. Specifically, our model encapsulates the following important components motivated from biological vision. (a) As scene semantics are closely related to visually attentive regions, our model encodes discriminative spatial information for scene understanding through spatial visual semantics embedding. (b) To model the objectness factors in visual attention deployment, we incorporate object-level semantics embedding and object relation information. (c) Considering the 'winner-take-all' mechanism in visual stimuli processing, we model the competition mechanism among objects with softmax based neural attention. (d) Lastly, a conditional center prior is learned to mimic the spatial distribution bias of visual attention. Furthermore, we propose novel loss functions to utilize supervision cues from image-level semantics, saliency prior knowledge, and self-information compression. Experiments show that our method achieves promising results, and even outperforms many of its fully supervised counterparts. Overall, our weakly supervised saliency method makes an essential step towards reducing the annotation budget of current approaches, as well as providing a more comprehensive understanding of the visual attention mechanism. Our code is available at: https://github.com/ashleylqx/WeakFixation.git.",

keywords = "Visual attention prediction, deep learning, saliency prediction, weakly supervised learning",

author = "Lai Zhou and Tianfei Zhou and Salman Khan and Hanqiu Sun and Jianbing Shen and Ling Shao",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2022",

doi = "10.1109/TIP.2022.3158064",

language = "English",

volume = "31",

pages = "3111--3124",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Weakly Supervised Visual Saliency Prediction

AU - Zhou, Lai

AU - Zhou, Tianfei

AU - Khan, Salman

AU - Sun, Hanqiu

AU - Shen, Jianbing

AU - Shao, Ling

PY - 2022

Y1 - 2022

N2 - The success of current deep saliency models heavily depends on large amounts of annotated human fixation data to fit the highly non-linear mapping between the stimuli and visual saliency. Such fully supervised data-driven approaches are annotation-intensive and often fail to consider the underlying mechanisms of visual attention. In contrast, in this paper, we introduce a model based on various cognitive theories of visual saliency, which learns visual attention patterns in a weakly supervised manner. Our approach incorporates insights from cognitive science as differentiable submodules, resulting in a unified, end-to-end trainable framework. Specifically, our model encapsulates the following important components motivated from biological vision. (a) As scene semantics are closely related to visually attentive regions, our model encodes discriminative spatial information for scene understanding through spatial visual semantics embedding. (b) To model the objectness factors in visual attention deployment, we incorporate object-level semantics embedding and object relation information. (c) Considering the 'winner-take-all' mechanism in visual stimuli processing, we model the competition mechanism among objects with softmax based neural attention. (d) Lastly, a conditional center prior is learned to mimic the spatial distribution bias of visual attention. Furthermore, we propose novel loss functions to utilize supervision cues from image-level semantics, saliency prior knowledge, and self-information compression. Experiments show that our method achieves promising results, and even outperforms many of its fully supervised counterparts. Overall, our weakly supervised saliency method makes an essential step towards reducing the annotation budget of current approaches, as well as providing a more comprehensive understanding of the visual attention mechanism. Our code is available at: https://github.com/ashleylqx/WeakFixation.git.

AB - The success of current deep saliency models heavily depends on large amounts of annotated human fixation data to fit the highly non-linear mapping between the stimuli and visual saliency. Such fully supervised data-driven approaches are annotation-intensive and often fail to consider the underlying mechanisms of visual attention. In contrast, in this paper, we introduce a model based on various cognitive theories of visual saliency, which learns visual attention patterns in a weakly supervised manner. Our approach incorporates insights from cognitive science as differentiable submodules, resulting in a unified, end-to-end trainable framework. Specifically, our model encapsulates the following important components motivated from biological vision. (a) As scene semantics are closely related to visually attentive regions, our model encodes discriminative spatial information for scene understanding through spatial visual semantics embedding. (b) To model the objectness factors in visual attention deployment, we incorporate object-level semantics embedding and object relation information. (c) Considering the 'winner-take-all' mechanism in visual stimuli processing, we model the competition mechanism among objects with softmax based neural attention. (d) Lastly, a conditional center prior is learned to mimic the spatial distribution bias of visual attention. Furthermore, we propose novel loss functions to utilize supervision cues from image-level semantics, saliency prior knowledge, and self-information compression. Experiments show that our method achieves promising results, and even outperforms many of its fully supervised counterparts. Overall, our weakly supervised saliency method makes an essential step towards reducing the annotation budget of current approaches, as well as providing a more comprehensive understanding of the visual attention mechanism. Our code is available at: https://github.com/ashleylqx/WeakFixation.git.

KW - Visual attention prediction

KW - deep learning

KW - saliency prediction

KW - weakly supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85127805341&partnerID=8YFLogxK

U2 - 10.1109/TIP.2022.3158064

DO - 10.1109/TIP.2022.3158064

M3 - Article

C2 - 35380961

AN - SCOPUS:85127805341

SN - 1057-7149

VL - 31

SP - 3111

EP - 3124

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

ER -

Weakly Supervised Visual Saliency Prediction

摘要

访问文件

其它文件与链接

指纹

引用此