TY - JOUR
T1 - E3-MOT
T2 - An Extended End-to-End Multiple Object Tracking Framework with Camera-LiDAR Fusion
AU - Xu, Yang
AU - Wei, Chao
AU - Hu, Jibin
N1 - Publisher Copyright:
© 2001-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - 3D multi-object tracking (MOT) is essential for providing stable and reliable motion states of surrounding obstacles in autonomous driving. Existing methods primarily focus on motion-based and appearance similarity matching approaches. However, the nature of post-process limits the exploitation of multi-modal perception data, hindering the effectiveness of these methods. In this work, we propose E3-MOT, an extended end-to-end multi-modal tracking framework that integrates camera and LiDAR information within a shared BEV representation. A two-stage mechanism is designed, where the first stage performs end-to-end joint detection and tracking. The track query is then introduced to represent the tracked instances, which is transferred and updated across consecutive frames, enabling iterative prediction throughout the tracking process. In the second stage, we design a motion-based tracking filter, which enhances the robustness through the stage-2 association between unmatched detections and trajectories on image plane to address the long-tail distribution challenge. Extensive experiments on the nuScenes dataset demonstrate the effectiveness of the proposed method. E3-MOT achieves 67.4% AMOTA, and under sensor-failure conditions it still maintains 62.5% AMOTA, outperforming multiple representative baselines. Real-world tests on a UGV platform further validate the practicality and robustness of the framework. The source code is available at https://github.com/HITXCI/w-trk.
AB - 3D multi-object tracking (MOT) is essential for providing stable and reliable motion states of surrounding obstacles in autonomous driving. Existing methods primarily focus on motion-based and appearance similarity matching approaches. However, the nature of post-process limits the exploitation of multi-modal perception data, hindering the effectiveness of these methods. In this work, we propose E3-MOT, an extended end-to-end multi-modal tracking framework that integrates camera and LiDAR information within a shared BEV representation. A two-stage mechanism is designed, where the first stage performs end-to-end joint detection and tracking. The track query is then introduced to represent the tracked instances, which is transferred and updated across consecutive frames, enabling iterative prediction throughout the tracking process. In the second stage, we design a motion-based tracking filter, which enhances the robustness through the stage-2 association between unmatched detections and trajectories on image plane to address the long-tail distribution challenge. Extensive experiments on the nuScenes dataset demonstrate the effectiveness of the proposed method. E3-MOT achieves 67.4% AMOTA, and under sensor-failure conditions it still maintains 62.5% AMOTA, outperforming multiple representative baselines. Real-world tests on a UGV platform further validate the practicality and robustness of the framework. The source code is available at https://github.com/HITXCI/w-trk.
KW - Multi-object tracking
KW - end-to-end framework
KW - sensor fusion
KW - track association
UR - https://www.scopus.com/pages/publications/105025748375
U2 - 10.1109/JSEN.2025.3645777
DO - 10.1109/JSEN.2025.3645777
M3 - Article
AN - SCOPUS:105025748375
SN - 1530-437X
JO - IEEE Sensors Journal
JF - IEEE Sensors Journal
ER -