Predicting image caption by a unified hierarchical model

Lin Bai; Kan Li

doi:10.1109/ICME.2015.7177427

Predicting image caption by a unified hierarchical model

Lin Bai, Kan Li

计算机学院

Beijing Institute of Technology

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

4 引用（Scopus）

摘要

Automatically describing the content of an image is a challenging task in artificial intelligence. The difficulty is particularly pronounced in activity recognition and the image caption revealed by the relationship analysis of the activities involved in the image. This paper presents a unified hierarchical model to model the interaction activity between human and nearby object, and then speculates the image content by analyzing the logical relationship among the interaction activities. In our model, the first-layer factored three-way interaction machine models the 3D spatial context between human and the relevant object to straightly aid the prediction of human-object interaction activities. Then, the activities are further processed through the top-layer factored three-way interaction machine to learn the image content with the help of 3D spatial context among the activities. Experiments on joint dataset show that our unified hierarchical model outperforms state-of-the-arts in predicting human-object interaction activities and describing the image caption.

源语言	英语
主期刊名	2015 IEEE International Conference on Multimedia and Expo, ICME 2015
出版商	IEEE Computer Society
ISBN（电子版）	9781479970827
DOI	https://doi.org/10.1109/ICME.2015.7177427
出版状态	已出版 - 4 8月 2015
活动	IEEE International Conference on Multimedia and Expo, ICME 2015 - Turin, 意大利期限: 29 6月 2015 → 3 7月 2015

出版系列

姓名	Proceedings - IEEE International Conference on Multimedia and Expo
卷	2015-August
ISSN（印刷版）	1945-7871
ISSN（电子版）	1945-788X

会议

会议	IEEE International Conference on Multimedia and Expo, ICME 2015
国家/地区	意大利
市	Turin
时期	29/06/15 → 3/07/15

访问文件

10.1109/ICME.2015.7177427

其它文件与链接

链接到 Scopus 的出版物

引用此

@inproceedings{3a5fb355cf6b48a8b56b842401961d44,

title = "Predicting image caption by a unified hierarchical model",

abstract = "Automatically describing the content of an image is a challenging task in artificial intelligence. The difficulty is particularly pronounced in activity recognition and the image caption revealed by the relationship analysis of the activities involved in the image. This paper presents a unified hierarchical model to model the interaction activity between human and nearby object, and then speculates the image content by analyzing the logical relationship among the interaction activities. In our model, the first-layer factored three-way interaction machine models the 3D spatial context between human and the relevant object to straightly aid the prediction of human-object interaction activities. Then, the activities are further processed through the top-layer factored three-way interaction machine to learn the image content with the help of 3D spatial context among the activities. Experiments on joint dataset show that our unified hierarchical model outperforms state-of-the-arts in predicting human-object interaction activities and describing the image caption.",

keywords = "3D spatial context, Factored three-way interaction, Human-object interaction activity, Image caption, Unified hierarchical model",

author = "Lin Bai and Kan Li",

note = "Publisher Copyright: {\textcopyright} 2015 IEEE.; IEEE International Conference on Multimedia and Expo, ICME 2015 ; Conference date: 29-06-2015 Through 03-07-2015",

year = "2015",

month = aug,

day = "4",

doi = "10.1109/ICME.2015.7177427",

language = "English",

series = "Proceedings - IEEE International Conference on Multimedia and Expo",

publisher = "IEEE Computer Society",

booktitle = "2015 IEEE International Conference on Multimedia and Expo, ICME 2015",

address = "United States",

}

Bai, L & Li, K 2015, Predicting image caption by a unified hierarchical model. 在 2015 IEEE International Conference on Multimedia and Expo, ICME 2015., 7177427, Proceedings - IEEE International Conference on Multimedia and Expo, 卷 2015-August, IEEE Computer Society, IEEE International Conference on Multimedia and Expo, ICME 2015, Turin, 意大利, 29/06/15. https://doi.org/10.1109/ICME.2015.7177427

TY - GEN

T1 - Predicting image caption by a unified hierarchical model

AU - Bai, Lin

AU - Li, Kan

PY - 2015/8/4

Y1 - 2015/8/4

N2 - Automatically describing the content of an image is a challenging task in artificial intelligence. The difficulty is particularly pronounced in activity recognition and the image caption revealed by the relationship analysis of the activities involved in the image. This paper presents a unified hierarchical model to model the interaction activity between human and nearby object, and then speculates the image content by analyzing the logical relationship among the interaction activities. In our model, the first-layer factored three-way interaction machine models the 3D spatial context between human and the relevant object to straightly aid the prediction of human-object interaction activities. Then, the activities are further processed through the top-layer factored three-way interaction machine to learn the image content with the help of 3D spatial context among the activities. Experiments on joint dataset show that our unified hierarchical model outperforms state-of-the-arts in predicting human-object interaction activities and describing the image caption.

AB - Automatically describing the content of an image is a challenging task in artificial intelligence. The difficulty is particularly pronounced in activity recognition and the image caption revealed by the relationship analysis of the activities involved in the image. This paper presents a unified hierarchical model to model the interaction activity between human and nearby object, and then speculates the image content by analyzing the logical relationship among the interaction activities. In our model, the first-layer factored three-way interaction machine models the 3D spatial context between human and the relevant object to straightly aid the prediction of human-object interaction activities. Then, the activities are further processed through the top-layer factored three-way interaction machine to learn the image content with the help of 3D spatial context among the activities. Experiments on joint dataset show that our unified hierarchical model outperforms state-of-the-arts in predicting human-object interaction activities and describing the image caption.

KW - 3D spatial context

KW - Factored three-way interaction

KW - Human-object interaction activity

KW - Image caption

KW - Unified hierarchical model

UR - http://www.scopus.com/inward/record.url?scp=84946029027&partnerID=8YFLogxK

U2 - 10.1109/ICME.2015.7177427

DO - 10.1109/ICME.2015.7177427

M3 - Conference contribution

AN - SCOPUS:84946029027

T3 - Proceedings - IEEE International Conference on Multimedia and Expo

BT - 2015 IEEE International Conference on Multimedia and Expo, ICME 2015

PB - IEEE Computer Society

T2 - IEEE International Conference on Multimedia and Expo, ICME 2015

Y2 - 29 June 2015 through 3 July 2015

ER -

Predicting image caption by a unified hierarchical model

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此