Learning and parsing video events with goal and intent prediction

Mingtao Pei; Zhangzhang Si; Benjamin Z. Yao; Song Chun Zhu

doi:10.1016/j.cviu.2012.12.003

Learning and parsing video events with goal and intent prediction

Mingtao Pei^*, Zhangzhang Si, Benjamin Z. Yao, Song Chun Zhu

^*Corresponding author for this work

School of Computer Science and Technology

University of California at Los Angeles

Research output: Contribution to journal › Article › peer-review

27 Citations (Scopus)

Abstract

In this paper, we present a framework for parsing video events with stochastic Temporal And-Or Graph (T-AOG) and unsupervised learning of the T-AOG from video. This T-AOG represents a stochastic event grammar. The alphabet of the T-AOG consists of a set of grounded spatial relations including the poses of agents and their interactions with objects in the scene. The terminal nodes of the T-AOG are atomic actions which are specified by a number of grounded relations over image frames. An And-node represents a sequence of actions. An Or-node represents a number of alternative ways of such concatenations. The And-Or nodes in the T-AOG can generate a set of valid temporal configurations of atomic actions, which can be equivalently represented as the language of a stochastic context-free grammar (SCFG). For each And-node we model the temporal relations of its children nodes to distinguish events with similar structures but different temporal patterns and interpolate missing portions of events. This makes the T-AOG grammar context-sensitive. We propose an unsupervised learning algorithm to learn the atomic actions, the temporal relations and the And-Or nodes under the information projection principle in a coherent probabilistic framework. We also propose an event parsing algorithm based on the T-AOG which can understand events, infer the goal of agents, and predict their plausible intended actions. In comparison with existing methods, our paper makes the following contributions. (i) We represent events by a T-AOG with hierarchical compositions of events and the temporal relations between the sub-events. (ii) We learn the grammar, including atomic actions and temporal relations, automatically from the video data without manual supervision. (iii) Our algorithm infers the goal of agents and predicts their intents by a top-down process, handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework. (iv) The algorithm uses event context to improve the detection of atomic actions, segment and recognize objects in the scene. Extensive experiments, including indoor and out door scenes, single and multiple agents events, are conducted to validate the effectiveness of the proposed approach.

Original language	English
Pages (from-to)	1369-1383
Number of pages	15
Journal	Computer Vision and Image Understanding
Volume	117
Issue number	10
DOIs	https://doi.org/10.1016/j.cviu.2012.12.003
Publication status	Published - 2013

Keywords

Event parsing
Goal prediction
Information projection
Temporal And-Or Graph (T-AOG)
Unsupervised learning

Access to Document

10.1016/j.cviu.2012.12.003

Cite this

@article{ff129710506e4ed39511c4b9b0b747ec,

title = "Learning and parsing video events with goal and intent prediction",

abstract = "In this paper, we present a framework for parsing video events with stochastic Temporal And-Or Graph (T-AOG) and unsupervised learning of the T-AOG from video. This T-AOG represents a stochastic event grammar. The alphabet of the T-AOG consists of a set of grounded spatial relations including the poses of agents and their interactions with objects in the scene. The terminal nodes of the T-AOG are atomic actions which are specified by a number of grounded relations over image frames. An And-node represents a sequence of actions. An Or-node represents a number of alternative ways of such concatenations. The And-Or nodes in the T-AOG can generate a set of valid temporal configurations of atomic actions, which can be equivalently represented as the language of a stochastic context-free grammar (SCFG). For each And-node we model the temporal relations of its children nodes to distinguish events with similar structures but different temporal patterns and interpolate missing portions of events. This makes the T-AOG grammar context-sensitive. We propose an unsupervised learning algorithm to learn the atomic actions, the temporal relations and the And-Or nodes under the information projection principle in a coherent probabilistic framework. We also propose an event parsing algorithm based on the T-AOG which can understand events, infer the goal of agents, and predict their plausible intended actions. In comparison with existing methods, our paper makes the following contributions. (i) We represent events by a T-AOG with hierarchical compositions of events and the temporal relations between the sub-events. (ii) We learn the grammar, including atomic actions and temporal relations, automatically from the video data without manual supervision. (iii) Our algorithm infers the goal of agents and predicts their intents by a top-down process, handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework. (iv) The algorithm uses event context to improve the detection of atomic actions, segment and recognize objects in the scene. Extensive experiments, including indoor and out door scenes, single and multiple agents events, are conducted to validate the effectiveness of the proposed approach.",

keywords = "Event parsing, Goal prediction, Information projection, Temporal And-Or Graph (T-AOG), Unsupervised learning",

author = "Mingtao Pei and Zhangzhang Si and Yao, {Benjamin Z.} and Zhu, {Song Chun}",

year = "2013",

doi = "10.1016/j.cviu.2012.12.003",

language = "English",

volume = "117",

pages = "1369--1383",

journal = "Computer Vision and Image Understanding",

issn = "1077-3142",

publisher = "Academic Press Inc.",

number = "10",

}

TY - JOUR

T1 - Learning and parsing video events with goal and intent prediction

AU - Pei, Mingtao

AU - Si, Zhangzhang

AU - Yao, Benjamin Z.

AU - Zhu, Song Chun

PY - 2013

Y1 - 2013

N2 - In this paper, we present a framework for parsing video events with stochastic Temporal And-Or Graph (T-AOG) and unsupervised learning of the T-AOG from video. This T-AOG represents a stochastic event grammar. The alphabet of the T-AOG consists of a set of grounded spatial relations including the poses of agents and their interactions with objects in the scene. The terminal nodes of the T-AOG are atomic actions which are specified by a number of grounded relations over image frames. An And-node represents a sequence of actions. An Or-node represents a number of alternative ways of such concatenations. The And-Or nodes in the T-AOG can generate a set of valid temporal configurations of atomic actions, which can be equivalently represented as the language of a stochastic context-free grammar (SCFG). For each And-node we model the temporal relations of its children nodes to distinguish events with similar structures but different temporal patterns and interpolate missing portions of events. This makes the T-AOG grammar context-sensitive. We propose an unsupervised learning algorithm to learn the atomic actions, the temporal relations and the And-Or nodes under the information projection principle in a coherent probabilistic framework. We also propose an event parsing algorithm based on the T-AOG which can understand events, infer the goal of agents, and predict their plausible intended actions. In comparison with existing methods, our paper makes the following contributions. (i) We represent events by a T-AOG with hierarchical compositions of events and the temporal relations between the sub-events. (ii) We learn the grammar, including atomic actions and temporal relations, automatically from the video data without manual supervision. (iii) Our algorithm infers the goal of agents and predicts their intents by a top-down process, handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework. (iv) The algorithm uses event context to improve the detection of atomic actions, segment and recognize objects in the scene. Extensive experiments, including indoor and out door scenes, single and multiple agents events, are conducted to validate the effectiveness of the proposed approach.

AB - In this paper, we present a framework for parsing video events with stochastic Temporal And-Or Graph (T-AOG) and unsupervised learning of the T-AOG from video. This T-AOG represents a stochastic event grammar. The alphabet of the T-AOG consists of a set of grounded spatial relations including the poses of agents and their interactions with objects in the scene. The terminal nodes of the T-AOG are atomic actions which are specified by a number of grounded relations over image frames. An And-node represents a sequence of actions. An Or-node represents a number of alternative ways of such concatenations. The And-Or nodes in the T-AOG can generate a set of valid temporal configurations of atomic actions, which can be equivalently represented as the language of a stochastic context-free grammar (SCFG). For each And-node we model the temporal relations of its children nodes to distinguish events with similar structures but different temporal patterns and interpolate missing portions of events. This makes the T-AOG grammar context-sensitive. We propose an unsupervised learning algorithm to learn the atomic actions, the temporal relations and the And-Or nodes under the information projection principle in a coherent probabilistic framework. We also propose an event parsing algorithm based on the T-AOG which can understand events, infer the goal of agents, and predict their plausible intended actions. In comparison with existing methods, our paper makes the following contributions. (i) We represent events by a T-AOG with hierarchical compositions of events and the temporal relations between the sub-events. (ii) We learn the grammar, including atomic actions and temporal relations, automatically from the video data without manual supervision. (iii) Our algorithm infers the goal of agents and predicts their intents by a top-down process, handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework. (iv) The algorithm uses event context to improve the detection of atomic actions, segment and recognize objects in the scene. Extensive experiments, including indoor and out door scenes, single and multiple agents events, are conducted to validate the effectiveness of the proposed approach.

KW - Event parsing

KW - Goal prediction

KW - Information projection

KW - Temporal And-Or Graph (T-AOG)

KW - Unsupervised learning

UR - http://www.scopus.com/inward/record.url?scp=84885389309&partnerID=8YFLogxK

U2 - 10.1016/j.cviu.2012.12.003

DO - 10.1016/j.cviu.2012.12.003

M3 - Article

AN - SCOPUS:84885389309

SN - 1077-3142

VL - 117

SP - 1369

EP - 1383

JO - Computer Vision and Image Understanding

JF - Computer Vision and Image Understanding

IS - 10

ER -

Learning and parsing video events with goal and intent prediction

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this