Parsing video events with goal inference and intent prediction

Mingtao Pei; Yunde Jia; Song Chun Zhu

doi:10.1109/ICCV.2011.6126279

Parsing video events with goal inference and intent prediction

Mingtao Pei^*, Yunde Jia, Song Chun Zhu

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

120 Citations (Scopus)

Abstract

In this paper, we present an event parsing algorithm based on Stochastic Context Sensitive Grammar (SCSG) for understanding events, inferring the goal of agents, and predicting their plausible intended actions. The SCSG represents the hierarchical compositions of events and the temporal relations between the sub-events. The alphabets of the SCSG are atomic actions which are defined by the poses of agents and their interactions with objects in the scene. The temporal relations are used to distinguish events with similar structures, interpolate missing portions of events, and are learned from the training data. In comparison with existing methods, our paper makes the following contributions. i) We define atomic actions by a set of relations based on the fluents of agents and their interactions with objects in the scene. ii) Our algorithm handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework; iii) The algorithm infers the goal of the agents and predicts their intents by a top-down process; iv) The algorithm improves the detection of atomic actions by event contexts. We show satisfactory results of event recognition and atomic action detection on the data set we captured which contains 12 event categories in both indoor and outdoor videos.

Original language	English
Title of host publication	2011 International Conference on Computer Vision, ICCV 2011
Pages	487-494
Number of pages	8
DOIs	https://doi.org/10.1109/ICCV.2011.6126279
Publication status	Published - 2011
Event	2011 IEEE International Conference on Computer Vision, ICCV 2011 - Barcelona, Spain Duration: 6 Nov 2011 → 13 Nov 2011

Publication series

Name	Proceedings of the IEEE International Conference on Computer Vision

Conference

Conference	2011 IEEE International Conference on Computer Vision, ICCV 2011
Country/Territory	Spain
City	Barcelona
Period	6/11/11 → 13/11/11

Access to Document

10.1109/ICCV.2011.6126279

Cite this

@inproceedings{20acbdd6fc4f4efe8658f09f327726fd,

title = "Parsing video events with goal inference and intent prediction",

abstract = "In this paper, we present an event parsing algorithm based on Stochastic Context Sensitive Grammar (SCSG) for understanding events, inferring the goal of agents, and predicting their plausible intended actions. The SCSG represents the hierarchical compositions of events and the temporal relations between the sub-events. The alphabets of the SCSG are atomic actions which are defined by the poses of agents and their interactions with objects in the scene. The temporal relations are used to distinguish events with similar structures, interpolate missing portions of events, and are learned from the training data. In comparison with existing methods, our paper makes the following contributions. i) We define atomic actions by a set of relations based on the fluents of agents and their interactions with objects in the scene. ii) Our algorithm handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework; iii) The algorithm infers the goal of the agents and predicts their intents by a top-down process; iv) The algorithm improves the detection of atomic actions by event contexts. We show satisfactory results of event recognition and atomic action detection on the data set we captured which contains 12 event categories in both indoor and outdoor videos.",

author = "Mingtao Pei and Yunde Jia and Zhu, {Song Chun}",

year = "2011",

doi = "10.1109/ICCV.2011.6126279",

language = "English",

isbn = "9781457711015",

series = "Proceedings of the IEEE International Conference on Computer Vision",

pages = "487--494",

booktitle = "2011 International Conference on Computer Vision, ICCV 2011",

note = "2011 IEEE International Conference on Computer Vision, ICCV 2011 ; Conference date: 06-11-2011 Through 13-11-2011",

}

Pei, M, Jia, Y & Zhu, SC 2011, Parsing video events with goal inference and intent prediction. in 2011 International Conference on Computer Vision, ICCV 2011., 6126279, Proceedings of the IEEE International Conference on Computer Vision, pp. 487-494, 2011 IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, 6/11/11. https://doi.org/10.1109/ICCV.2011.6126279

TY - GEN

T1 - Parsing video events with goal inference and intent prediction

AU - Pei, Mingtao

AU - Jia, Yunde

AU - Zhu, Song Chun

PY - 2011

Y1 - 2011

N2 - In this paper, we present an event parsing algorithm based on Stochastic Context Sensitive Grammar (SCSG) for understanding events, inferring the goal of agents, and predicting their plausible intended actions. The SCSG represents the hierarchical compositions of events and the temporal relations between the sub-events. The alphabets of the SCSG are atomic actions which are defined by the poses of agents and their interactions with objects in the scene. The temporal relations are used to distinguish events with similar structures, interpolate missing portions of events, and are learned from the training data. In comparison with existing methods, our paper makes the following contributions. i) We define atomic actions by a set of relations based on the fluents of agents and their interactions with objects in the scene. ii) Our algorithm handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework; iii) The algorithm infers the goal of the agents and predicts their intents by a top-down process; iv) The algorithm improves the detection of atomic actions by event contexts. We show satisfactory results of event recognition and atomic action detection on the data set we captured which contains 12 event categories in both indoor and outdoor videos.

AB - In this paper, we present an event parsing algorithm based on Stochastic Context Sensitive Grammar (SCSG) for understanding events, inferring the goal of agents, and predicting their plausible intended actions. The SCSG represents the hierarchical compositions of events and the temporal relations between the sub-events. The alphabets of the SCSG are atomic actions which are defined by the poses of agents and their interactions with objects in the scene. The temporal relations are used to distinguish events with similar structures, interpolate missing portions of events, and are learned from the training data. In comparison with existing methods, our paper makes the following contributions. i) We define atomic actions by a set of relations based on the fluents of agents and their interactions with objects in the scene. ii) Our algorithm handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework; iii) The algorithm infers the goal of the agents and predicts their intents by a top-down process; iv) The algorithm improves the detection of atomic actions by event contexts. We show satisfactory results of event recognition and atomic action detection on the data set we captured which contains 12 event categories in both indoor and outdoor videos.

UR - http://www.scopus.com/inward/record.url?scp=84856646751&partnerID=8YFLogxK

U2 - 10.1109/ICCV.2011.6126279

DO - 10.1109/ICCV.2011.6126279

M3 - Conference contribution

AN - SCOPUS:84856646751

SN - 9781457711015

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 487

EP - 494

BT - 2011 International Conference on Computer Vision, ICCV 2011

T2 - 2011 IEEE International Conference on Computer Vision, ICCV 2011

Y2 - 6 November 2011 through 13 November 2011

ER -

Parsing video events with goal inference and intent prediction

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this