TY - GEN
T1 - From Data Deluge to Data Curation
T2 - 34th ACM Web Conference, WWW 2025
AU - Sun, Jintao
AU - Fei, Hao
AU - Ding, Gangyi
AU - Zheng, Zhedong
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/4/28
Y1 - 2025/4/28
N2 - In text-based person search endeavors, data generation has emerged as a prevailing practice, addressing concerns over privacy preservation and the arduous task of manual annotation. Although the number of synthesized data can be infinite in theory, the scientific conundrum persists that how much generated data optimally fuels subsequent model training. We observe that only a subset of the data in these constructed datasets plays a decisive role. Therefore, we introduce a new Filtering-WoRA paradigm, which contains a filtering algorithm to identify this crucial data subset and WoRA (Weighted Low-Rank Adaptation) learning strategy for light fine-tuning. The filtering algorithm is based on the cross-modality relevance to remove the lots of coarse matching synthesis pairs. As the number of data decreases, we do not need to fine-tune the entire model. Therefore, we propose a WoRA learning strategy to efficiently update a minimal portion of model parameters. WoRA streamlines the learning process, enabling heightened efficiency in extracting knowledge from fewer, yet potent, data instances. Extensive experimentation validates the efficacy of pretraining, where our model achieves advanced and efficient retrieval performance on challenging real-world benchmarks. Notably, on the CUHK-PEDES dataset, we have achieved a competitive mAP of 67.02% while reducing model training time by 19.82%.
AB - In text-based person search endeavors, data generation has emerged as a prevailing practice, addressing concerns over privacy preservation and the arduous task of manual annotation. Although the number of synthesized data can be infinite in theory, the scientific conundrum persists that how much generated data optimally fuels subsequent model training. We observe that only a subset of the data in these constructed datasets plays a decisive role. Therefore, we introduce a new Filtering-WoRA paradigm, which contains a filtering algorithm to identify this crucial data subset and WoRA (Weighted Low-Rank Adaptation) learning strategy for light fine-tuning. The filtering algorithm is based on the cross-modality relevance to remove the lots of coarse matching synthesis pairs. As the number of data decreases, we do not need to fine-tune the entire model. Therefore, we propose a WoRA learning strategy to efficiently update a minimal portion of model parameters. WoRA streamlines the learning process, enabling heightened efficiency in extracting knowledge from fewer, yet potent, data instances. Extensive experimentation validates the efficacy of pretraining, where our model achieves advanced and efficient retrieval performance on challenging real-world benchmarks. Notably, on the CUHK-PEDES dataset, we have achieved a competitive mAP of 67.02% while reducing model training time by 19.82%.
KW - Data-centric Learning
KW - Low-Rank Adaptation
KW - Text-based Person Search
KW - Visual-language Pre-training
UR - http://www.scopus.com/inward/record.url?scp=105005157515&partnerID=8YFLogxK
U2 - 10.1145/3696410.3714788
DO - 10.1145/3696410.3714788
M3 - Conference contribution
AN - SCOPUS:105005157515
T3 - WWW 2025 - Proceedings of the ACM Web Conference
SP - 2341
EP - 2351
BT - WWW 2025 - Proceedings of the ACM Web Conference
PB - Association for Computing Machinery, Inc
Y2 - 28 April 2025 through 2 May 2025
ER -