Weakly Supervised Visual Saliency Prediction

Lai Zhou; Tianfei Zhou; Salman Khan; Hanqiu Sun; Jianbing Shen; Ling Shao

doi:10.1109/TIP.2022.3158064

Weakly Supervised Visual Saliency Prediction

Lai Zhou, Tianfei Zhou^*, Salman Khan, Hanqiu Sun, Jianbing Shen, Ling Shao

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

14 Citations (Scopus)

Abstract

The success of current deep saliency models heavily depends on large amounts of annotated human fixation data to fit the highly non-linear mapping between the stimuli and visual saliency. Such fully supervised data-driven approaches are annotation-intensive and often fail to consider the underlying mechanisms of visual attention. In contrast, in this paper, we introduce a model based on various cognitive theories of visual saliency, which learns visual attention patterns in a weakly supervised manner. Our approach incorporates insights from cognitive science as differentiable submodules, resulting in a unified, end-to-end trainable framework. Specifically, our model encapsulates the following important components motivated from biological vision. (a) As scene semantics are closely related to visually attentive regions, our model encodes discriminative spatial information for scene understanding through spatial visual semantics embedding. (b) To model the objectness factors in visual attention deployment, we incorporate object-level semantics embedding and object relation information. (c) Considering the 'winner-take-all' mechanism in visual stimuli processing, we model the competition mechanism among objects with softmax based neural attention. (d) Lastly, a conditional center prior is learned to mimic the spatial distribution bias of visual attention. Furthermore, we propose novel loss functions to utilize supervision cues from image-level semantics, saliency prior knowledge, and self-information compression. Experiments show that our method achieves promising results, and even outperforms many of its fully supervised counterparts. Overall, our weakly supervised saliency method makes an essential step towards reducing the annotation budget of current approaches, as well as providing a more comprehensive understanding of the visual attention mechanism. Our code is available at: https://github.com/ashleylqx/WeakFixation.git.

Original language	English
Pages (from-to)	3111-3124
Number of pages	14
Journal	IEEE Transactions on Image Processing
Volume	31
DOIs	https://doi.org/10.1109/TIP.2022.3158064
Publication status	Published - 2022
Externally published	Yes

Keywords

Visual attention prediction
deep learning
saliency prediction
weakly supervised learning

Access to Document

10.1109/TIP.2022.3158064

Cite this

@article{251bd29234d64a31920d2d62cb6928d8,

title = "Weakly Supervised Visual Saliency Prediction",

abstract = "The success of current deep saliency models heavily depends on large amounts of annotated human fixation data to fit the highly non-linear mapping between the stimuli and visual saliency. Such fully supervised data-driven approaches are annotation-intensive and often fail to consider the underlying mechanisms of visual attention. In contrast, in this paper, we introduce a model based on various cognitive theories of visual saliency, which learns visual attention patterns in a weakly supervised manner. Our approach incorporates insights from cognitive science as differentiable submodules, resulting in a unified, end-to-end trainable framework. Specifically, our model encapsulates the following important components motivated from biological vision. (a) As scene semantics are closely related to visually attentive regions, our model encodes discriminative spatial information for scene understanding through spatial visual semantics embedding. (b) To model the objectness factors in visual attention deployment, we incorporate object-level semantics embedding and object relation information. (c) Considering the 'winner-take-all' mechanism in visual stimuli processing, we model the competition mechanism among objects with softmax based neural attention. (d) Lastly, a conditional center prior is learned to mimic the spatial distribution bias of visual attention. Furthermore, we propose novel loss functions to utilize supervision cues from image-level semantics, saliency prior knowledge, and self-information compression. Experiments show that our method achieves promising results, and even outperforms many of its fully supervised counterparts. Overall, our weakly supervised saliency method makes an essential step towards reducing the annotation budget of current approaches, as well as providing a more comprehensive understanding of the visual attention mechanism. Our code is available at: https://github.com/ashleylqx/WeakFixation.git.",

keywords = "Visual attention prediction, deep learning, saliency prediction, weakly supervised learning",

author = "Lai Zhou and Tianfei Zhou and Salman Khan and Hanqiu Sun and Jianbing Shen and Ling Shao",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2022",

doi = "10.1109/TIP.2022.3158064",

language = "English",

volume = "31",

pages = "3111--3124",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Weakly Supervised Visual Saliency Prediction

AU - Zhou, Lai

AU - Zhou, Tianfei

AU - Khan, Salman

AU - Sun, Hanqiu

AU - Shen, Jianbing

AU - Shao, Ling

PY - 2022

Y1 - 2022

N2 - The success of current deep saliency models heavily depends on large amounts of annotated human fixation data to fit the highly non-linear mapping between the stimuli and visual saliency. Such fully supervised data-driven approaches are annotation-intensive and often fail to consider the underlying mechanisms of visual attention. In contrast, in this paper, we introduce a model based on various cognitive theories of visual saliency, which learns visual attention patterns in a weakly supervised manner. Our approach incorporates insights from cognitive science as differentiable submodules, resulting in a unified, end-to-end trainable framework. Specifically, our model encapsulates the following important components motivated from biological vision. (a) As scene semantics are closely related to visually attentive regions, our model encodes discriminative spatial information for scene understanding through spatial visual semantics embedding. (b) To model the objectness factors in visual attention deployment, we incorporate object-level semantics embedding and object relation information. (c) Considering the 'winner-take-all' mechanism in visual stimuli processing, we model the competition mechanism among objects with softmax based neural attention. (d) Lastly, a conditional center prior is learned to mimic the spatial distribution bias of visual attention. Furthermore, we propose novel loss functions to utilize supervision cues from image-level semantics, saliency prior knowledge, and self-information compression. Experiments show that our method achieves promising results, and even outperforms many of its fully supervised counterparts. Overall, our weakly supervised saliency method makes an essential step towards reducing the annotation budget of current approaches, as well as providing a more comprehensive understanding of the visual attention mechanism. Our code is available at: https://github.com/ashleylqx/WeakFixation.git.

AB - The success of current deep saliency models heavily depends on large amounts of annotated human fixation data to fit the highly non-linear mapping between the stimuli and visual saliency. Such fully supervised data-driven approaches are annotation-intensive and often fail to consider the underlying mechanisms of visual attention. In contrast, in this paper, we introduce a model based on various cognitive theories of visual saliency, which learns visual attention patterns in a weakly supervised manner. Our approach incorporates insights from cognitive science as differentiable submodules, resulting in a unified, end-to-end trainable framework. Specifically, our model encapsulates the following important components motivated from biological vision. (a) As scene semantics are closely related to visually attentive regions, our model encodes discriminative spatial information for scene understanding through spatial visual semantics embedding. (b) To model the objectness factors in visual attention deployment, we incorporate object-level semantics embedding and object relation information. (c) Considering the 'winner-take-all' mechanism in visual stimuli processing, we model the competition mechanism among objects with softmax based neural attention. (d) Lastly, a conditional center prior is learned to mimic the spatial distribution bias of visual attention. Furthermore, we propose novel loss functions to utilize supervision cues from image-level semantics, saliency prior knowledge, and self-information compression. Experiments show that our method achieves promising results, and even outperforms many of its fully supervised counterparts. Overall, our weakly supervised saliency method makes an essential step towards reducing the annotation budget of current approaches, as well as providing a more comprehensive understanding of the visual attention mechanism. Our code is available at: https://github.com/ashleylqx/WeakFixation.git.

KW - Visual attention prediction

KW - deep learning

KW - saliency prediction

KW - weakly supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85127805341&partnerID=8YFLogxK

U2 - 10.1109/TIP.2022.3158064

DO - 10.1109/TIP.2022.3158064

M3 - Article

C2 - 35380961

AN - SCOPUS:85127805341

SN - 1057-7149

VL - 31

SP - 3111

EP - 3124

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

ER -

Weakly Supervised Visual Saliency Prediction

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this