E3-MOT: An Extended End-to-End Multiple Object Tracking Framework with Camera-LiDAR Fusion

  • Yang Xu
  • , Chao Wei*
  • , Jibin Hu
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

3D multi-object tracking (MOT) is essential for providing stable and reliable motion states of surrounding obstacles in autonomous driving. Existing methods primarily focus on motion-based and appearance similarity matching approaches. However, the nature of post-process limits the exploitation of multi-modal perception data, hindering the effectiveness of these methods. In this work, we propose E3-MOT, an extended end-to-end multi-modal tracking framework that integrates camera and LiDAR information within a shared BEV representation. A two-stage mechanism is designed, where the first stage performs end-to-end joint detection and tracking. The track query is then introduced to represent the tracked instances, which is transferred and updated across consecutive frames, enabling iterative prediction throughout the tracking process. In the second stage, we design a motion-based tracking filter, which enhances the robustness through the stage-2 association between unmatched detections and trajectories on image plane to address the long-tail distribution challenge. Extensive experiments on the nuScenes dataset demonstrate the effectiveness of the proposed method. E3-MOT achieves 67.4% AMOTA, and under sensor-failure conditions it still maintains 62.5% AMOTA, outperforming multiple representative baselines. Real-world tests on a UGV platform further validate the practicality and robustness of the framework. The source code is available at https://github.com/HITXCI/w-trk.

Original languageEnglish
JournalIEEE Sensors Journal
DOIs
Publication statusAccepted/In press - 2025
Externally publishedYes

Keywords

  • Multi-object tracking
  • end-to-end framework
  • sensor fusion
  • track association

Fingerprint

Dive into the research topics of 'E3-MOT: An Extended End-to-End Multiple Object Tracking Framework with Camera-LiDAR Fusion'. Together they form a unique fingerprint.

Cite this