Unsupervised learning of event AND-OR grammar and semantics from video

Zhangzhang Si; Mingtao Pei; Benjamin Yao; Song Chun Zhu

doi:10.1109/ICCV.2011.6126223

Unsupervised learning of event AND-OR grammar and semantics from video

Zhangzhang Si^*, Mingtao Pei, Benjamin Yao, Song Chun Zhu

^*Corresponding author for this work

School of Computer Science and Technology

University of California at Los Angeles

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

68 Citations (Scopus)

Abstract

We study the problem of automatically learning event AND-OR grammar from videos of a certain environment, e.g. an office where students conduct daily activities. We propose to learn the event grammar under the information projection and minimum description length principles in a coherent probabilistic framework, without manual supervision about what events happen and when they happen. Firstly a predefined set of unary and binary relations are detected for each video frame: e.g. agent's position, pose and interaction with environment. Then their co-occurrences are clustered into a dictionary of simple and transient atomic actions. Recursively these actions are grouped into longer and complexer events, resulting in a stochastic event grammar. By modeling time constraints of successive events, the learned grammar becomes context-sensitive. We introduce a new dataset of surveillance-style video in office, and present a prototype system for video analysis integrating bottom-up detection, grammatical learning and parsing. On this dataset, the learning algorithm is able to automatically discover important events and construct a stochastic grammar, which can be used to accurately parse newly observed video. The learned grammar can be used as a prior to improve the noisy bottom-up detection of atomic actions. It can also be used to infer semantics of the scene. In general, the event grammar is an efficient way for common knowledge acquisition from video.

Original language	English
Title of host publication	2011 International Conference on Computer Vision, ICCV 2011
Pages	41-48
Number of pages	8
DOIs	https://doi.org/10.1109/ICCV.2011.6126223
Publication status	Published - 2011
Event	2011 IEEE International Conference on Computer Vision, ICCV 2011 - Barcelona, Spain Duration: 6 Nov 2011 → 13 Nov 2011

Publication series

Name	Proceedings of the IEEE International Conference on Computer Vision

Conference

Conference	2011 IEEE International Conference on Computer Vision, ICCV 2011
Country/Territory	Spain
City	Barcelona
Period	6/11/11 → 13/11/11

Access to Document

10.1109/ICCV.2011.6126223

Cite this

@inproceedings{abd929fb276e4cee9a765f2440ecdd0a,

title = "Unsupervised learning of event AND-OR grammar and semantics from video",

abstract = "We study the problem of automatically learning event AND-OR grammar from videos of a certain environment, e.g. an office where students conduct daily activities. We propose to learn the event grammar under the information projection and minimum description length principles in a coherent probabilistic framework, without manual supervision about what events happen and when they happen. Firstly a predefined set of unary and binary relations are detected for each video frame: e.g. agent's position, pose and interaction with environment. Then their co-occurrences are clustered into a dictionary of simple and transient atomic actions. Recursively these actions are grouped into longer and complexer events, resulting in a stochastic event grammar. By modeling time constraints of successive events, the learned grammar becomes context-sensitive. We introduce a new dataset of surveillance-style video in office, and present a prototype system for video analysis integrating bottom-up detection, grammatical learning and parsing. On this dataset, the learning algorithm is able to automatically discover important events and construct a stochastic grammar, which can be used to accurately parse newly observed video. The learned grammar can be used as a prior to improve the noisy bottom-up detection of atomic actions. It can also be used to infer semantics of the scene. In general, the event grammar is an efficient way for common knowledge acquisition from video.",

author = "Zhangzhang Si and Mingtao Pei and Benjamin Yao and Zhu, {Song Chun}",

year = "2011",

doi = "10.1109/ICCV.2011.6126223",

language = "English",

isbn = "9781457711015",

series = "Proceedings of the IEEE International Conference on Computer Vision",

pages = "41--48",

booktitle = "2011 International Conference on Computer Vision, ICCV 2011",

note = "2011 IEEE International Conference on Computer Vision, ICCV 2011 ; Conference date: 06-11-2011 Through 13-11-2011",

}

Si, Z, Pei, M, Yao, B & Zhu, SC 2011, Unsupervised learning of event AND-OR grammar and semantics from video. in 2011 International Conference on Computer Vision, ICCV 2011., 6126223, Proceedings of the IEEE International Conference on Computer Vision, pp. 41-48, 2011 IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, 6/11/11. https://doi.org/10.1109/ICCV.2011.6126223

Unsupervised learning of event AND-OR grammar and semantics from video. / Si, Zhangzhang; Pei, Mingtao; Yao, Benjamin et al.
2011 International Conference on Computer Vision, ICCV 2011. 2011. p. 41-48 6126223 (Proceedings of the IEEE International Conference on Computer Vision).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Unsupervised learning of event AND-OR grammar and semantics from video

AU - Si, Zhangzhang

AU - Pei, Mingtao

AU - Yao, Benjamin

AU - Zhu, Song Chun

PY - 2011

Y1 - 2011

N2 - We study the problem of automatically learning event AND-OR grammar from videos of a certain environment, e.g. an office where students conduct daily activities. We propose to learn the event grammar under the information projection and minimum description length principles in a coherent probabilistic framework, without manual supervision about what events happen and when they happen. Firstly a predefined set of unary and binary relations are detected for each video frame: e.g. agent's position, pose and interaction with environment. Then their co-occurrences are clustered into a dictionary of simple and transient atomic actions. Recursively these actions are grouped into longer and complexer events, resulting in a stochastic event grammar. By modeling time constraints of successive events, the learned grammar becomes context-sensitive. We introduce a new dataset of surveillance-style video in office, and present a prototype system for video analysis integrating bottom-up detection, grammatical learning and parsing. On this dataset, the learning algorithm is able to automatically discover important events and construct a stochastic grammar, which can be used to accurately parse newly observed video. The learned grammar can be used as a prior to improve the noisy bottom-up detection of atomic actions. It can also be used to infer semantics of the scene. In general, the event grammar is an efficient way for common knowledge acquisition from video.

AB - We study the problem of automatically learning event AND-OR grammar from videos of a certain environment, e.g. an office where students conduct daily activities. We propose to learn the event grammar under the information projection and minimum description length principles in a coherent probabilistic framework, without manual supervision about what events happen and when they happen. Firstly a predefined set of unary and binary relations are detected for each video frame: e.g. agent's position, pose and interaction with environment. Then their co-occurrences are clustered into a dictionary of simple and transient atomic actions. Recursively these actions are grouped into longer and complexer events, resulting in a stochastic event grammar. By modeling time constraints of successive events, the learned grammar becomes context-sensitive. We introduce a new dataset of surveillance-style video in office, and present a prototype system for video analysis integrating bottom-up detection, grammatical learning and parsing. On this dataset, the learning algorithm is able to automatically discover important events and construct a stochastic grammar, which can be used to accurately parse newly observed video. The learned grammar can be used as a prior to improve the noisy bottom-up detection of atomic actions. It can also be used to infer semantics of the scene. In general, the event grammar is an efficient way for common knowledge acquisition from video.

UR - http://www.scopus.com/inward/record.url?scp=84856636962&partnerID=8YFLogxK

U2 - 10.1109/ICCV.2011.6126223

DO - 10.1109/ICCV.2011.6126223

M3 - Conference contribution

AN - SCOPUS:84856636962

SN - 9781457711015

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 41

EP - 48

BT - 2011 International Conference on Computer Vision, ICCV 2011

T2 - 2011 IEEE International Conference on Computer Vision, ICCV 2011

Y2 - 6 November 2011 through 13 November 2011

ER -

Unsupervised learning of event AND-OR grammar and semantics from video

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this