STEP: Generating Semantic Text Embeddings with Prompt

Wenqiang Cao; Qing Li; Siying Zhang; Rixin Xu; Youqi Li

doi:10.1109/CBD63341.2023.00040

STEP: Generating Semantic Text Embeddings with Prompt

Wenqiang Cao, Qing Li, Siying Zhang, Rixin Xu^*, Youqi Li

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

In recent years, semantic embeddings for text has played a bigger role in the field of natural language processing (NLP), additionally, it has shown great potential in real-life applications like search and recommendation systems. Therefore, models for generating semantic text embeddings have received extensive study. State-of-the-art solutions for text embeddings have evolved from traditional methods (like Word2Vec, Glove, etc.) to deep neural network based solutions (such as LSTM, Transformer, and pre-trained models like BERT and RoBERTa, etc), besides, frameworks like Sentence Transformer have already lowered the bar of training models for semantic text representation using customized models and datasets. In this paper, we investigated several well trained models according to Massive Text Embedding Benchmark (MTEB) in Huggingface website. Enlighted by the extensive use of prompt engineering in large language models like Llama or GPT3, we proposed STEP: a novel method using prompt to improve performance of text embeddings on downstream tasks, making it applicable to almost any pre-trained language models for text embeddings. Besides, STEP does not need to modify base model structure. In the experiment, we applied STEP to five pre-trained models chosen from MTEB, trained and evaluated our approach on two separated datasets, final results indicated that our approach could improve performance of tasks related to semantic text similarity.

Original language	English
Title of host publication	Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	180-185
Number of pages	6
ISBN (Electronic)	9798350345346
DOIs	https://doi.org/10.1109/CBD63341.2023.00040
Publication status	Published - 2023
Event	11th International Conference on Advanced Cloud and Big Data, CBD 2023 - Hainan, China Duration: 18 Dec 2023 → 19 Dec 2023

Publication series

Name	Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023

Conference

Conference	11th International Conference on Advanced Cloud and Big Data, CBD 2023
Country/Territory	China
City	Hainan
Period	18/12/23 → 19/12/23

Keywords

NLP
embedding
prompt
semantic

Access to Document

10.1109/CBD63341.2023.00040

Cite this

Cao, W., Li, Q., Zhang, S., Xu, R., & Li, Y. (2023). STEP: Generating Semantic Text Embeddings with Prompt. In Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023 (pp. 180-185). (Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CBD63341.2023.00040

@inproceedings{85d3c609d02946d7abf3b2c9e24aafb0,

title = "STEP: Generating Semantic Text Embeddings with Prompt",

abstract = "In recent years, semantic embeddings for text has played a bigger role in the field of natural language processing (NLP), additionally, it has shown great potential in real-life applications like search and recommendation systems. Therefore, models for generating semantic text embeddings have received extensive study. State-of-the-art solutions for text embeddings have evolved from traditional methods (like Word2Vec, Glove, etc.) to deep neural network based solutions (such as LSTM, Transformer, and pre-trained models like BERT and RoBERTa, etc), besides, frameworks like Sentence Transformer have already lowered the bar of training models for semantic text representation using customized models and datasets. In this paper, we investigated several well trained models according to Massive Text Embedding Benchmark (MTEB) in Huggingface website. Enlighted by the extensive use of prompt engineering in large language models like Llama or GPT3, we proposed STEP: a novel method using prompt to improve performance of text embeddings on downstream tasks, making it applicable to almost any pre-trained language models for text embeddings. Besides, STEP does not need to modify base model structure. In the experiment, we applied STEP to five pre-trained models chosen from MTEB, trained and evaluated our approach on two separated datasets, final results indicated that our approach could improve performance of tasks related to semantic text similarity.",

keywords = "NLP, embedding, prompt, semantic",

author = "Wenqiang Cao and Qing Li and Siying Zhang and Rixin Xu and Youqi Li",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 11th International Conference on Advanced Cloud and Big Data, CBD 2023 ; Conference date: 18-12-2023 Through 19-12-2023",

year = "2023",

doi = "10.1109/CBD63341.2023.00040",

language = "English",

series = "Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "180--185",

booktitle = "Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023",

address = "United States",

}

Cao, W, Li, Q, Zhang, S, Xu, R & Li, Y 2023, STEP: Generating Semantic Text Embeddings with Prompt. in Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023. Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023, Institute of Electrical and Electronics Engineers Inc., pp. 180-185, 11th International Conference on Advanced Cloud and Big Data, CBD 2023, Hainan, China, 18/12/23. https://doi.org/10.1109/CBD63341.2023.00040

STEP: Generating Semantic Text Embeddings with Prompt. / Cao, Wenqiang; Li, Qing; Zhang, Siying et al.
Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023. Institute of Electrical and Electronics Engineers Inc., 2023. p. 180-185 (Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - STEP

T2 - 11th International Conference on Advanced Cloud and Big Data, CBD 2023

AU - Cao, Wenqiang

AU - Li, Qing

AU - Zhang, Siying

AU - Xu, Rixin

AU - Li, Youqi

PY - 2023

Y1 - 2023

N2 - In recent years, semantic embeddings for text has played a bigger role in the field of natural language processing (NLP), additionally, it has shown great potential in real-life applications like search and recommendation systems. Therefore, models for generating semantic text embeddings have received extensive study. State-of-the-art solutions for text embeddings have evolved from traditional methods (like Word2Vec, Glove, etc.) to deep neural network based solutions (such as LSTM, Transformer, and pre-trained models like BERT and RoBERTa, etc), besides, frameworks like Sentence Transformer have already lowered the bar of training models for semantic text representation using customized models and datasets. In this paper, we investigated several well trained models according to Massive Text Embedding Benchmark (MTEB) in Huggingface website. Enlighted by the extensive use of prompt engineering in large language models like Llama or GPT3, we proposed STEP: a novel method using prompt to improve performance of text embeddings on downstream tasks, making it applicable to almost any pre-trained language models for text embeddings. Besides, STEP does not need to modify base model structure. In the experiment, we applied STEP to five pre-trained models chosen from MTEB, trained and evaluated our approach on two separated datasets, final results indicated that our approach could improve performance of tasks related to semantic text similarity.

AB - In recent years, semantic embeddings for text has played a bigger role in the field of natural language processing (NLP), additionally, it has shown great potential in real-life applications like search and recommendation systems. Therefore, models for generating semantic text embeddings have received extensive study. State-of-the-art solutions for text embeddings have evolved from traditional methods (like Word2Vec, Glove, etc.) to deep neural network based solutions (such as LSTM, Transformer, and pre-trained models like BERT and RoBERTa, etc), besides, frameworks like Sentence Transformer have already lowered the bar of training models for semantic text representation using customized models and datasets. In this paper, we investigated several well trained models according to Massive Text Embedding Benchmark (MTEB) in Huggingface website. Enlighted by the extensive use of prompt engineering in large language models like Llama or GPT3, we proposed STEP: a novel method using prompt to improve performance of text embeddings on downstream tasks, making it applicable to almost any pre-trained language models for text embeddings. Besides, STEP does not need to modify base model structure. In the experiment, we applied STEP to five pre-trained models chosen from MTEB, trained and evaluated our approach on two separated datasets, final results indicated that our approach could improve performance of tasks related to semantic text similarity.

KW - NLP

KW - embedding

KW - prompt

KW - semantic

UR - http://www.scopus.com/inward/record.url?scp=85193245517&partnerID=8YFLogxK

U2 - 10.1109/CBD63341.2023.00040

DO - 10.1109/CBD63341.2023.00040

M3 - Conference contribution

AN - SCOPUS:85193245517

T3 - Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023

SP - 180

EP - 185

BT - Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 18 December 2023 through 19 December 2023

ER -

Cao W, Li Q, Zhang S, Xu R, Li Y. STEP: Generating Semantic Text Embeddings with Prompt. In Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023. Institute of Electrical and Electronics Engineers Inc. 2023. p. 180-185. (Proceedings - 2023 11th International Conference on Advanced Cloud and Big Data, CBD 2023). doi: 10.1109/CBD63341.2023.00040

STEP: Generating Semantic Text Embeddings with Prompt

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this