TY - GEN
T1 - STEP
T2 - 11th International Conference on Advanced Cloud and Big Data, CBD 2023
AU - Cao, Wenqiang
AU - Li, Qing
AU - Zhang, Siying
AU - Xu, Rixin
AU - Li, Youqi
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - In recent years, semantic embeddings for text has played a bigger role in the field of natural language processing (NLP), additionally, it has shown great potential in real-life applications like search and recommendation systems. Therefore, models for generating semantic text embeddings have received extensive study. State-of-the-art solutions for text embeddings have evolved from traditional methods (like Word2Vec, Glove, etc.) to deep neural network based solutions (such as LSTM, Transformer, and pre-trained models like BERT and RoBERTa, etc), besides, frameworks like Sentence Transformer have already lowered the bar of training models for semantic text representation using customized models and datasets. In this paper, we investigated several well trained models according to Massive Text Embedding Benchmark (MTEB) in Huggingface website. Enlighted by the extensive use of prompt engineering in large language models like Llama or GPT3, we proposed STEP: a novel method using prompt to improve performance of text embeddings on downstream tasks, making it applicable to almost any pre-trained language models for text embeddings. Besides, STEP does not need to modify base model structure. In the experiment, we applied STEP to five pre-trained models chosen from MTEB, trained and evaluated our approach on two separated datasets, final results indicated that our approach could improve performance of tasks related to semantic text similarity.
AB - In recent years, semantic embeddings for text has played a bigger role in the field of natural language processing (NLP), additionally, it has shown great potential in real-life applications like search and recommendation systems. Therefore, models for generating semantic text embeddings have received extensive study. State-of-the-art solutions for text embeddings have evolved from traditional methods (like Word2Vec, Glove, etc.) to deep neural network based solutions (such as LSTM, Transformer, and pre-trained models like BERT and RoBERTa, etc), besides, frameworks like Sentence Transformer have already lowered the bar of training models for semantic text representation using customized models and datasets. In this paper, we investigated several well trained models according to Massive Text Embedding Benchmark (MTEB) in Huggingface website. Enlighted by the extensive use of prompt engineering in large language models like Llama or GPT3, we proposed STEP: a novel method using prompt to improve performance of text embeddings on downstream tasks, making it applicable to almost any pre-trained language models for text embeddings. Besides, STEP does not need to modify base model structure. In the experiment, we applied STEP to five pre-trained models chosen from MTEB, trained and evaluated our approach on two separated datasets, final results indicated that our approach could improve performance of tasks related to semantic text similarity.
KW - NLP
KW - embedding
KW - prompt
KW - semantic
UR - http://www.scopus.com/inward/record.url?scp=85193245517&partnerID=8YFLogxK
U2 - 10.1109/CBD63341.2023.00040
DO - 10.1109/CBD63341.2023.00040
M3 - Conference contribution
AN - SCOPUS:85193245517
T3 - Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023
SP - 180
EP - 185
BT - Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 18 December 2023 through 19 December 2023
ER -