TY - JOUR
T1 - Learning Cooperative Neural Modules for Stylized Image Captioning
AU - Wu, Xinxiao
AU - Zhao, Wentian
AU - Luo, Jiebo
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2022/9
Y1 - 2022/9
N2 - Recent progress in stylized image captioning has been achieved through the encoder-decoder framework that generates a sentence in one-pass decoding process. However, it remains difficult for such a decoding process to simultaneously capture the syntactic structure, infer the semantic concepts and express the linguistic styles. Research in psycholinguistics has revealed that the language production process of humans involves multiple stages, starting with several rough concepts and ending with fluent sentences. With this in mind, we propose a novel stylized image captioning approach that generates stylized sentences in a multi-pass decoding process by training three cooperative neural modules under the reinforcement learning paradigm. A low-level neural module called syntax module first generates the overall syntactic structure of the stylized sentence. Next, two high-level neural modules, namely concept module and style module, incorporate the words that describe factual content and the words that express linguistic style, respectively. Since the three modules contribute to different aspects of the stylized sentence, i.e. the fluency, the relevancy of the factual content and the style accuracy, we encourage the modules to specialize in their own tasks by designing different rewards for different actions. We also design an attention mechanism to facilitate the communication between the high-level and low-level modules. With the help of the attention mechanism, the high-level modules are able to take the global structure of the sentence into consideration and maintain the consistency between the factual content and the linguistic style. Evaluations on several public benchmark datasets demonstrate that our method outperforms the existing one-pass decoding methods in terms of multiple different evaluation metrics.
AB - Recent progress in stylized image captioning has been achieved through the encoder-decoder framework that generates a sentence in one-pass decoding process. However, it remains difficult for such a decoding process to simultaneously capture the syntactic structure, infer the semantic concepts and express the linguistic styles. Research in psycholinguistics has revealed that the language production process of humans involves multiple stages, starting with several rough concepts and ending with fluent sentences. With this in mind, we propose a novel stylized image captioning approach that generates stylized sentences in a multi-pass decoding process by training three cooperative neural modules under the reinforcement learning paradigm. A low-level neural module called syntax module first generates the overall syntactic structure of the stylized sentence. Next, two high-level neural modules, namely concept module and style module, incorporate the words that describe factual content and the words that express linguistic style, respectively. Since the three modules contribute to different aspects of the stylized sentence, i.e. the fluency, the relevancy of the factual content and the style accuracy, we encourage the modules to specialize in their own tasks by designing different rewards for different actions. We also design an attention mechanism to facilitate the communication between the high-level and low-level modules. With the help of the attention mechanism, the high-level modules are able to take the global structure of the sentence into consideration and maintain the consistency between the factual content and the linguistic style. Evaluations on several public benchmark datasets demonstrate that our method outperforms the existing one-pass decoding methods in terms of multiple different evaluation metrics.
KW - Cooperative modular networks
KW - Multi-pass decoding
KW - Reinforcement learning
KW - Stylized image captioning
UR - http://www.scopus.com/inward/record.url?scp=85134645418&partnerID=8YFLogxK
U2 - 10.1007/s11263-022-01636-2
DO - 10.1007/s11263-022-01636-2
M3 - Article
AN - SCOPUS:85134645418
SN - 0920-5691
VL - 130
SP - 2305
EP - 2320
JO - International Journal of Computer Vision
JF - International Journal of Computer Vision
IS - 9
ER -