A Survey of Human-Object Interaction Detection With Deep Learning

Geng Han, Jiachen Zhao, Lele Zhang, Fang Deng*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Human-object interaction (HOI) detection has attracted significant attention due to its wide applications, including human-robot interactions, security monitoring, automatic sports commentary, etc. HOI detection aims to detect humans, objects, and their interactions in a given image or video, so it needs a higher-level semantic understanding of the image than regular object recognition or detection tasks. It is also more challenging technically because of some unique difficulties, such as multi-object interactions, long-tail distribution of interaction categories, etc. Currently, deep learning methods have achieved great performance in HOI detection, but there are few reviews describing the recent advance of deep learning-based HOI detection. Moreover, the current stage-based category of HOI detection methods is causing confusion in community discussion and beginner learning. To fill this gap, this paper summarizes, categorizes, and compares methods using deep learning for HOI detection over the last nine years. Firstly, we summarize the pipeline of HOI detection methods. Then, we divide existing methods into three categories (two-stage, one-stage, and transformer-based), distinguish them in formulas and schematics, and qualitatively compare their advantages and disadvantages. After that, we review each category of methods in detail, focusing on HOI detection methods for images. Moreover, we explore the development process of using foundation models for HOI detection. We also quantitatively compare the performance of existing methods on public HOI datasets. At last, we point out the future research direction of HOI detection.

Original languageEnglish
Pages (from-to)3-26
Number of pages24
JournalIEEE Transactions on Emerging Topics in Computational Intelligence
Volume9
Issue number1
DOIs
Publication statusPublished - 2025

Keywords

  • attention mechanism
  • Deep learning
  • foundation models
  • GNN
  • human-object interaction
  • transformer
  • visual relationship detection

Cite this