Abstract
This paper proposes a hierarchical approach for recognizing person-to-person interaction in indoor scenario from a single view, which is based on spatial-temporal feature extraction and representation. The dense space-time interest points detected from videos are divided into two sets exclusively according to the history information along the evolvement and the connectivity of the two human silhouettes. Then K-means clustering performs on points in the training set and learns the spatial-temporal codebook. For a given set of interest points, a spatial-temporal word is built by allowing each point to vote softly into the few centers nearest to it and accumulating the scores of all the points. The Conditional Random Field (CRF) whose inputs are the spatial-temporal words is used to modeling the primitive actions for each person, and common sense domain knowledge and first order logic production rules with weights are employed to learn the structure and the parameters of Markov Logic Network (MLN). The MLN can naturally integrate common sense reasoning with uncertain analysis, which is capable to deal with the uncertainty produced by CRF. Experiment results on the interaction dataset are provided to demonstrate the effectiveness and the robustness.
Original language | English |
---|---|
Pages (from-to) | 776-784 |
Number of pages | 9 |
Journal | Jisuanji Xuebao/Chinese Journal of Computers |
Volume | 33 |
Issue number | 4 |
DOIs | |
Publication status | Published - Apr 2010 |
Keywords
- Action recognition
- Conditional random field
- Interaction analysis
- Markov logic network
- Spatial-temporal feature