Content-Attention Representation by Factorized Action-Scene Network for Action Recognition

Jingyi Hou; Xinxiao Wu; Yuchao Sun; Yunde Jia

doi:10.1109/TMM.2017.2771462

Content-Attention Representation by Factorized Action-Scene Network for Action Recognition

Jingyi Hou, Xinxiao Wu, Yuchao Sun, Yunde Jia^*

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 期刊稿件 › 文章 › 同行评审

44 引用（Scopus）

摘要

During action recognition in videos, irrelevant motions in the background can greatly degrade the performance of recognizing specific actions with which we actually concern ourself here. In this paper, a novel deep neural network, called factorized action-scene network (FASNet), is proposed to encode and fuse the most relevant and informative semantic cues for action recognition. Specifically, we decompose the FASNet into two components. One is a newly designed encoding network, named content attention network (CANet), which encodes local spatialoral features to learn the action representations with good robustness to the noise of irrelevant motions. The other is a fusion network, which integrates the pretrained CANet to fuse the encoded spatialoral features with contextual scene feature extracted from the same video, for learning more descriptive and discriminative action representations. Moreover, different from the existing deep learning based tasks for generic action recognition, which applies softmax loss function as the training guidance, we formulate two loss functions for guiding the proposed model to accomplish more specific action recognition tasks, i.e., the multilabel correlation loss for multilabel action recognition and the triplet loss for complex event detection. Extensive experiments on the Hollywood2 dataset and the TRECVID MEDTest 14 dataset show that our method achieves superior performance compared with the state-of-the-art methods.

源语言	英语
页（从-至）	1537-1547
页数	11
期刊	IEEE Transactions on Multimedia
卷	20
期	6
DOI	https://doi.org/10.1109/TMM.2017.2771462
出版状态	已出版 - 6月 2018

访问文件

10.1109/TMM.2017.2771462

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{86669417d3c24a09bd79d3449b91effa,

title = "Content-Attention Representation by Factorized Action-Scene Network for Action Recognition",

abstract = "During action recognition in videos, irrelevant motions in the background can greatly degrade the performance of recognizing specific actions with which we actually concern ourself here. In this paper, a novel deep neural network, called factorized action-scene network (FASNet), is proposed to encode and fuse the most relevant and informative semantic cues for action recognition. Specifically, we decompose the FASNet into two components. One is a newly designed encoding network, named content attention network (CANet), which encodes local spatialoral features to learn the action representations with good robustness to the noise of irrelevant motions. The other is a fusion network, which integrates the pretrained CANet to fuse the encoded spatialoral features with contextual scene feature extracted from the same video, for learning more descriptive and discriminative action representations. Moreover, different from the existing deep learning based tasks for generic action recognition, which applies softmax loss function as the training guidance, we formulate two loss functions for guiding the proposed model to accomplish more specific action recognition tasks, i.e., the multilabel correlation loss for multilabel action recognition and the triplet loss for complex event detection. Extensive experiments on the Hollywood2 dataset and the TRECVID MEDTest 14 dataset show that our method achieves superior performance compared with the state-of-the-art methods.",

keywords = "Deep neural network, complex event detection, multi-label action recognition",

author = "Jingyi Hou and Xinxiao Wu and Yuchao Sun and Yunde Jia",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2018",

month = jun,

doi = "10.1109/TMM.2017.2771462",

language = "English",

volume = "20",

pages = "1537--1547",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "6",

}

TY - JOUR

T1 - Content-Attention Representation by Factorized Action-Scene Network for Action Recognition

AU - Hou, Jingyi

AU - Wu, Xinxiao

AU - Sun, Yuchao

AU - Jia, Yunde

PY - 2018/6

Y1 - 2018/6

N2 - During action recognition in videos, irrelevant motions in the background can greatly degrade the performance of recognizing specific actions with which we actually concern ourself here. In this paper, a novel deep neural network, called factorized action-scene network (FASNet), is proposed to encode and fuse the most relevant and informative semantic cues for action recognition. Specifically, we decompose the FASNet into two components. One is a newly designed encoding network, named content attention network (CANet), which encodes local spatialoral features to learn the action representations with good robustness to the noise of irrelevant motions. The other is a fusion network, which integrates the pretrained CANet to fuse the encoded spatialoral features with contextual scene feature extracted from the same video, for learning more descriptive and discriminative action representations. Moreover, different from the existing deep learning based tasks for generic action recognition, which applies softmax loss function as the training guidance, we formulate two loss functions for guiding the proposed model to accomplish more specific action recognition tasks, i.e., the multilabel correlation loss for multilabel action recognition and the triplet loss for complex event detection. Extensive experiments on the Hollywood2 dataset and the TRECVID MEDTest 14 dataset show that our method achieves superior performance compared with the state-of-the-art methods.

AB - During action recognition in videos, irrelevant motions in the background can greatly degrade the performance of recognizing specific actions with which we actually concern ourself here. In this paper, a novel deep neural network, called factorized action-scene network (FASNet), is proposed to encode and fuse the most relevant and informative semantic cues for action recognition. Specifically, we decompose the FASNet into two components. One is a newly designed encoding network, named content attention network (CANet), which encodes local spatialoral features to learn the action representations with good robustness to the noise of irrelevant motions. The other is a fusion network, which integrates the pretrained CANet to fuse the encoded spatialoral features with contextual scene feature extracted from the same video, for learning more descriptive and discriminative action representations. Moreover, different from the existing deep learning based tasks for generic action recognition, which applies softmax loss function as the training guidance, we formulate two loss functions for guiding the proposed model to accomplish more specific action recognition tasks, i.e., the multilabel correlation loss for multilabel action recognition and the triplet loss for complex event detection. Extensive experiments on the Hollywood2 dataset and the TRECVID MEDTest 14 dataset show that our method achieves superior performance compared with the state-of-the-art methods.

KW - Deep neural network

KW - complex event detection

KW - multi-label action recognition

UR - http://www.scopus.com/inward/record.url?scp=85033724652&partnerID=8YFLogxK

U2 - 10.1109/TMM.2017.2771462

DO - 10.1109/TMM.2017.2771462

M3 - Article

AN - SCOPUS:85033724652

SN - 1520-9210

VL - 20

SP - 1537

EP - 1547

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

IS - 6

ER -

Content-Attention Representation by Factorized Action-Scene Network for Action Recognition

摘要

访问文件

其它文件与链接

指纹

引用此