TY - JOUR
T1 - End-to-End Video Text Spotting with Transformer
AU - Wu, Weijia
AU - Cai, Yuanqiang
AU - Shen, Chunhua
AU - Zhang, Debing
AU - Fu, Ying
AU - Zhou, Hong
AU - Luo, Ping
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
PY - 2024/9
Y1 - 2024/9
N2 - Recent video text spotting methods usually require the three-staged pipeline, i.e., detecting text in individual images, recognizing localized text, tracking text streams with post-processing to generate final results. The previous methods typically follow the tracking-by-match paradigm and develop sophisticated pipelines, which is an not effective solution. In this paper, rooted in Transformer sequence modeling, we propose a simple, yet effective end-to-end trainable video text DEtection, Tracking, and Recognition framework (TransDeTR), which views the VTS task as a direct long-range temporal modeling problem. TransDeTR mainly includes two advantages: (1) Different from the explicit match paradigm in the adjacent frame, the proposed TransDeTR tracks and recognizes each text implicitly by the different query termed ‘text query’ over long-range temporal sequence (more than 7 frames). (2) TransDeTR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text detection, tracking, recognition). Extensive experiments on four video text datasets (e.g., ICDAR2013 Video, ICDAR2015 Video) are conducted to demonstrate that TransDeTR achieves state-of-the-art performance with up to 11.0% improvements on detection, tracking, and spotting tasks. Code can be found at: https://github.com/weijiawu/TransDETR.
AB - Recent video text spotting methods usually require the three-staged pipeline, i.e., detecting text in individual images, recognizing localized text, tracking text streams with post-processing to generate final results. The previous methods typically follow the tracking-by-match paradigm and develop sophisticated pipelines, which is an not effective solution. In this paper, rooted in Transformer sequence modeling, we propose a simple, yet effective end-to-end trainable video text DEtection, Tracking, and Recognition framework (TransDeTR), which views the VTS task as a direct long-range temporal modeling problem. TransDeTR mainly includes two advantages: (1) Different from the explicit match paradigm in the adjacent frame, the proposed TransDeTR tracks and recognizes each text implicitly by the different query termed ‘text query’ over long-range temporal sequence (more than 7 frames). (2) TransDeTR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text detection, tracking, recognition). Extensive experiments on four video text datasets (e.g., ICDAR2013 Video, ICDAR2015 Video) are conducted to demonstrate that TransDeTR achieves state-of-the-art performance with up to 11.0% improvements on detection, tracking, and spotting tasks. Code can be found at: https://github.com/weijiawu/TransDETR.
KW - E2E
KW - Temporal modeling
KW - Text spotting
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85198383280&partnerID=8YFLogxK
U2 - 10.1007/s11263-024-02063-1
DO - 10.1007/s11263-024-02063-1
M3 - Article
AN - SCOPUS:85198383280
SN - 0920-5691
VL - 132
SP - 4019
EP - 4035
JO - International Journal of Computer Vision
JF - International Journal of Computer Vision
IS - 9
ER -