Parsing video events with goal inference and intent prediction

Mingtao Pei; Yunde Jia; Song Chun Zhu

doi:10.1109/ICCV.2011.6126279

Parsing video events with goal inference and intent prediction

Mingtao Pei^*, Yunde Jia, Song Chun Zhu

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

126 引用（Scopus）

摘要

In this paper, we present an event parsing algorithm based on Stochastic Context Sensitive Grammar (SCSG) for understanding events, inferring the goal of agents, and predicting their plausible intended actions. The SCSG represents the hierarchical compositions of events and the temporal relations between the sub-events. The alphabets of the SCSG are atomic actions which are defined by the poses of agents and their interactions with objects in the scene. The temporal relations are used to distinguish events with similar structures, interpolate missing portions of events, and are learned from the training data. In comparison with existing methods, our paper makes the following contributions. i) We define atomic actions by a set of relations based on the fluents of agents and their interactions with objects in the scene. ii) Our algorithm handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework; iii) The algorithm infers the goal of the agents and predicts their intents by a top-down process; iv) The algorithm improves the detection of atomic actions by event contexts. We show satisfactory results of event recognition and atomic action detection on the data set we captured which contains 12 event categories in both indoor and outdoor videos.

源语言	英语
主期刊名	2011 International Conference on Computer Vision, ICCV 2011
页	487-494
页数	8
DOI	https://doi.org/10.1109/ICCV.2011.6126279
出版状态	已出版 - 2011
活动	2011 IEEE International Conference on Computer Vision, ICCV 2011 - Barcelona, 西班牙期限: 6 11月 2011 → 13 11月 2011

出版系列

姓名	Proceedings of the IEEE International Conference on Computer Vision

会议

会议	2011 IEEE International Conference on Computer Vision, ICCV 2011
国家/地区	西班牙
市	Barcelona
时期	6/11/11 → 13/11/11

访问文件

10.1109/ICCV.2011.6126279

其它文件与链接

链接到 Scopus 的出版物

引用此

Pei, M., Jia, Y., & Zhu, S. C. (2011). Parsing video events with goal inference and intent prediction. 在 2011 International Conference on Computer Vision, ICCV 2011 (页码 487-494). 文章 6126279 (Proceedings of the IEEE International Conference on Computer Vision). https://doi.org/10.1109/ICCV.2011.6126279

@inproceedings{20acbdd6fc4f4efe8658f09f327726fd,

title = "Parsing video events with goal inference and intent prediction",

abstract = "In this paper, we present an event parsing algorithm based on Stochastic Context Sensitive Grammar (SCSG) for understanding events, inferring the goal of agents, and predicting their plausible intended actions. The SCSG represents the hierarchical compositions of events and the temporal relations between the sub-events. The alphabets of the SCSG are atomic actions which are defined by the poses of agents and their interactions with objects in the scene. The temporal relations are used to distinguish events with similar structures, interpolate missing portions of events, and are learned from the training data. In comparison with existing methods, our paper makes the following contributions. i) We define atomic actions by a set of relations based on the fluents of agents and their interactions with objects in the scene. ii) Our algorithm handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework; iii) The algorithm infers the goal of the agents and predicts their intents by a top-down process; iv) The algorithm improves the detection of atomic actions by event contexts. We show satisfactory results of event recognition and atomic action detection on the data set we captured which contains 12 event categories in both indoor and outdoor videos.",

author = "Mingtao Pei and Yunde Jia and Zhu, {Song Chun}",

year = "2011",

doi = "10.1109/ICCV.2011.6126279",

language = "English",

isbn = "9781457711015",

series = "Proceedings of the IEEE International Conference on Computer Vision",

pages = "487--494",

booktitle = "2011 International Conference on Computer Vision, ICCV 2011",

note = "2011 IEEE International Conference on Computer Vision, ICCV 2011 ; Conference date: 06-11-2011 Through 13-11-2011",

}

Pei, M, Jia, Y & Zhu, SC 2011, Parsing video events with goal inference and intent prediction. 在 2011 International Conference on Computer Vision, ICCV 2011., 6126279, Proceedings of the IEEE International Conference on Computer Vision, 页码 487-494, 2011 IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, 西班牙, 6/11/11. https://doi.org/10.1109/ICCV.2011.6126279

TY - GEN

T1 - Parsing video events with goal inference and intent prediction

AU - Pei, Mingtao

AU - Jia, Yunde

AU - Zhu, Song Chun

PY - 2011

Y1 - 2011

N2 - In this paper, we present an event parsing algorithm based on Stochastic Context Sensitive Grammar (SCSG) for understanding events, inferring the goal of agents, and predicting their plausible intended actions. The SCSG represents the hierarchical compositions of events and the temporal relations between the sub-events. The alphabets of the SCSG are atomic actions which are defined by the poses of agents and their interactions with objects in the scene. The temporal relations are used to distinguish events with similar structures, interpolate missing portions of events, and are learned from the training data. In comparison with existing methods, our paper makes the following contributions. i) We define atomic actions by a set of relations based on the fluents of agents and their interactions with objects in the scene. ii) Our algorithm handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework; iii) The algorithm infers the goal of the agents and predicts their intents by a top-down process; iv) The algorithm improves the detection of atomic actions by event contexts. We show satisfactory results of event recognition and atomic action detection on the data set we captured which contains 12 event categories in both indoor and outdoor videos.

AB - In this paper, we present an event parsing algorithm based on Stochastic Context Sensitive Grammar (SCSG) for understanding events, inferring the goal of agents, and predicting their plausible intended actions. The SCSG represents the hierarchical compositions of events and the temporal relations between the sub-events. The alphabets of the SCSG are atomic actions which are defined by the poses of agents and their interactions with objects in the scene. The temporal relations are used to distinguish events with similar structures, interpolate missing portions of events, and are learned from the training data. In comparison with existing methods, our paper makes the following contributions. i) We define atomic actions by a set of relations based on the fluents of agents and their interactions with objects in the scene. ii) Our algorithm handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework; iii) The algorithm infers the goal of the agents and predicts their intents by a top-down process; iv) The algorithm improves the detection of atomic actions by event contexts. We show satisfactory results of event recognition and atomic action detection on the data set we captured which contains 12 event categories in both indoor and outdoor videos.

UR - http://www.scopus.com/inward/record.url?scp=84856646751&partnerID=8YFLogxK

U2 - 10.1109/ICCV.2011.6126279

DO - 10.1109/ICCV.2011.6126279

M3 - Conference contribution

AN - SCOPUS:84856646751

SN - 9781457711015

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 487

EP - 494

BT - 2011 International Conference on Computer Vision, ICCV 2011

T2 - 2011 IEEE International Conference on Computer Vision, ICCV 2011

Y2 - 6 November 2011 through 13 November 2011

ER -

Parsing video events with goal inference and intent prediction

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此