TY - GEN
T1 - Predicting image caption by a unified hierarchical model
AU - Bai, Lin
AU - Li, Kan
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/8/4
Y1 - 2015/8/4
N2 - Automatically describing the content of an image is a challenging task in artificial intelligence. The difficulty is particularly pronounced in activity recognition and the image caption revealed by the relationship analysis of the activities involved in the image. This paper presents a unified hierarchical model to model the interaction activity between human and nearby object, and then speculates the image content by analyzing the logical relationship among the interaction activities. In our model, the first-layer factored three-way interaction machine models the 3D spatial context between human and the relevant object to straightly aid the prediction of human-object interaction activities. Then, the activities are further processed through the top-layer factored three-way interaction machine to learn the image content with the help of 3D spatial context among the activities. Experiments on joint dataset show that our unified hierarchical model outperforms state-of-the-arts in predicting human-object interaction activities and describing the image caption.
AB - Automatically describing the content of an image is a challenging task in artificial intelligence. The difficulty is particularly pronounced in activity recognition and the image caption revealed by the relationship analysis of the activities involved in the image. This paper presents a unified hierarchical model to model the interaction activity between human and nearby object, and then speculates the image content by analyzing the logical relationship among the interaction activities. In our model, the first-layer factored three-way interaction machine models the 3D spatial context between human and the relevant object to straightly aid the prediction of human-object interaction activities. Then, the activities are further processed through the top-layer factored three-way interaction machine to learn the image content with the help of 3D spatial context among the activities. Experiments on joint dataset show that our unified hierarchical model outperforms state-of-the-arts in predicting human-object interaction activities and describing the image caption.
KW - 3D spatial context
KW - Factored three-way interaction
KW - Human-object interaction activity
KW - Image caption
KW - Unified hierarchical model
UR - http://www.scopus.com/inward/record.url?scp=84946029027&partnerID=8YFLogxK
U2 - 10.1109/ICME.2015.7177427
DO - 10.1109/ICME.2015.7177427
M3 - Conference contribution
AN - SCOPUS:84946029027
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2015 IEEE International Conference on Multimedia and Expo, ICME 2015
PB - IEEE Computer Society
T2 - IEEE International Conference on Multimedia and Expo, ICME 2015
Y2 - 29 June 2015 through 3 July 2015
ER -