Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning

Huaizheng Zhang; Yong Luo; Qiming Ai; Yonggang Wen; Han Hu

doi:10.1145/3394171.3413582

Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning

Huaizheng Zhang, Yong Luo, Qiming Ai, Yonggang Wen, Han Hu^*

^*Corresponding author for this work

School of Information and Electronics

Nanyang Technological University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

12 Citations (Scopus)

Abstract

Given the massive market of advertising and the sharply increasing online multimedia content (such as videos), it is now fashionable to promote advertisements (ads) together with the multimedia content. However, manually finding relevant ads to match the provided content is labor-intensive, and hence some automatic advertising techniques are developed. Since ads are usually hard to understand only according to its visual appearance due to the contained visual metaphor, some other modalities, such as the contained texts, should be exploited for understanding. To further improve user experience, it is necessary to understand both the ads' topic and sentiment. This motivates us to develop a novel deep multimodal multitask framework that integrates multiple modalities to achieve effective topic and sentiment prediction simultaneously for ads understanding. In particular, in our framework termed Deep$M^2$Ad, we first extract multimodal information from ads and learn high-level and comparable representations. The visual metaphor of the ad is decoded in an unsupervised manner. The obtained representations are then fed into the proposed hierarchical multimodal attention modules to learn task-specific representations for final prediction. A multitask loss function is also designed to jointly train both the topic and sentiment prediction models in an end-to-end manner, where bottom-layer parameters are shared to alleviate over-fitting. We conduct extensive experiments on a large-scale advertisement dataset and achieve state-of-the-art performance for both prediction tasks. The obtained results could be utilized as a benchmark for ads understanding.

Original language	English
Title of host publication	MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia
Publisher	Association for Computing Machinery, Inc
Pages	430-438
Number of pages	9
ISBN (Electronic)	9781450379885
DOIs	https://doi.org/10.1145/3394171.3413582
Publication status	Published - 12 Oct 2020
Event	28th ACM International Conference on Multimedia, MM 2020 - Virtual, Online, United States Duration: 12 Oct 2020 → 16 Oct 2020

Publication series

Name	MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia

Conference

Conference	28th ACM International Conference on Multimedia, MM 2020
Country/Territory	United States
City	Virtual, Online
Period	12/10/20 → 16/10/20

Keywords

ads understanding
multimodal learning
multitask learning
neural networks
online advertising

Access to Document

10.1145/3394171.3413582

Cite this

Zhang, H., Luo, Y., Ai, Q., Wen, Y., & Hu, H. (2020). Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning. In MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (pp. 430-438). (MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3394171.3413582

@inproceedings{d976ec203a0c4cccb27b73bef84eb632,

title = "Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning",

abstract = "Given the massive market of advertising and the sharply increasing online multimedia content (such as videos), it is now fashionable to promote advertisements (ads) together with the multimedia content. However, manually finding relevant ads to match the provided content is labor-intensive, and hence some automatic advertising techniques are developed. Since ads are usually hard to understand only according to its visual appearance due to the contained visual metaphor, some other modalities, such as the contained texts, should be exploited for understanding. To further improve user experience, it is necessary to understand both the ads' topic and sentiment. This motivates us to develop a novel deep multimodal multitask framework that integrates multiple modalities to achieve effective topic and sentiment prediction simultaneously for ads understanding. In particular, in our framework termed Deep$M^2$Ad, we first extract multimodal information from ads and learn high-level and comparable representations. The visual metaphor of the ad is decoded in an unsupervised manner. The obtained representations are then fed into the proposed hierarchical multimodal attention modules to learn task-specific representations for final prediction. A multitask loss function is also designed to jointly train both the topic and sentiment prediction models in an end-to-end manner, where bottom-layer parameters are shared to alleviate over-fitting. We conduct extensive experiments on a large-scale advertisement dataset and achieve state-of-the-art performance for both prediction tasks. The obtained results could be utilized as a benchmark for ads understanding.",

keywords = "ads understanding, multimodal learning, multitask learning, neural networks, online advertising",

author = "Huaizheng Zhang and Yong Luo and Qiming Ai and Yonggang Wen and Han Hu",

note = "Publisher Copyright: {\textcopyright} 2020 ACM.; 28th ACM International Conference on Multimedia, MM 2020 ; Conference date: 12-10-2020 Through 16-10-2020",

year = "2020",

month = oct,

day = "12",

doi = "10.1145/3394171.3413582",

language = "English",

series = "MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "430--438",

booktitle = "MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia",

}

Zhang, H, Luo, Y, Ai, Q, Wen, Y & Hu, H 2020, Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning. in MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia. MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia, Association for Computing Machinery, Inc, pp. 430-438, 28th ACM International Conference on Multimedia, MM 2020, Virtual, Online, United States, 12/10/20. https://doi.org/10.1145/3394171.3413582

Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning. / Zhang, Huaizheng; Luo, Yong; Ai, Qiming et al.
MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2020. p. 430-438 (MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Look, Read and Feel

T2 - 28th ACM International Conference on Multimedia, MM 2020

AU - Zhang, Huaizheng

AU - Luo, Yong

AU - Ai, Qiming

AU - Wen, Yonggang

AU - Hu, Han

PY - 2020/10/12

Y1 - 2020/10/12

N2 - Given the massive market of advertising and the sharply increasing online multimedia content (such as videos), it is now fashionable to promote advertisements (ads) together with the multimedia content. However, manually finding relevant ads to match the provided content is labor-intensive, and hence some automatic advertising techniques are developed. Since ads are usually hard to understand only according to its visual appearance due to the contained visual metaphor, some other modalities, such as the contained texts, should be exploited for understanding. To further improve user experience, it is necessary to understand both the ads' topic and sentiment. This motivates us to develop a novel deep multimodal multitask framework that integrates multiple modalities to achieve effective topic and sentiment prediction simultaneously for ads understanding. In particular, in our framework termed Deep$M^2$Ad, we first extract multimodal information from ads and learn high-level and comparable representations. The visual metaphor of the ad is decoded in an unsupervised manner. The obtained representations are then fed into the proposed hierarchical multimodal attention modules to learn task-specific representations for final prediction. A multitask loss function is also designed to jointly train both the topic and sentiment prediction models in an end-to-end manner, where bottom-layer parameters are shared to alleviate over-fitting. We conduct extensive experiments on a large-scale advertisement dataset and achieve state-of-the-art performance for both prediction tasks. The obtained results could be utilized as a benchmark for ads understanding.

AB - Given the massive market of advertising and the sharply increasing online multimedia content (such as videos), it is now fashionable to promote advertisements (ads) together with the multimedia content. However, manually finding relevant ads to match the provided content is labor-intensive, and hence some automatic advertising techniques are developed. Since ads are usually hard to understand only according to its visual appearance due to the contained visual metaphor, some other modalities, such as the contained texts, should be exploited for understanding. To further improve user experience, it is necessary to understand both the ads' topic and sentiment. This motivates us to develop a novel deep multimodal multitask framework that integrates multiple modalities to achieve effective topic and sentiment prediction simultaneously for ads understanding. In particular, in our framework termed Deep$M^2$Ad, we first extract multimodal information from ads and learn high-level and comparable representations. The visual metaphor of the ad is decoded in an unsupervised manner. The obtained representations are then fed into the proposed hierarchical multimodal attention modules to learn task-specific representations for final prediction. A multitask loss function is also designed to jointly train both the topic and sentiment prediction models in an end-to-end manner, where bottom-layer parameters are shared to alleviate over-fitting. We conduct extensive experiments on a large-scale advertisement dataset and achieve state-of-the-art performance for both prediction tasks. The obtained results could be utilized as a benchmark for ads understanding.

KW - ads understanding

KW - multimodal learning

KW - multitask learning

KW - neural networks

KW - online advertising

UR - http://www.scopus.com/inward/record.url?scp=85104159119&partnerID=8YFLogxK

U2 - 10.1145/3394171.3413582

DO - 10.1145/3394171.3413582

M3 - Conference contribution

AN - SCOPUS:85104159119

T3 - MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia

SP - 430

EP - 438

BT - MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

Y2 - 12 October 2020 through 16 October 2020

ER -

Zhang H, Luo Y, Ai Q, Wen Y, Hu H. Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning. In MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, Inc. 2020. p. 430-438. (MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia). doi: 10.1145/3394171.3413582

Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this