A Modified Speaking Rate Estimation Based on Frame-Level LSTM

Yanhong Xiao, Shixuan Du, Xiang Xie, Jing Wang, Qingran Zhan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

Speaking rate has various applications in many domains such as speech recognition, speaker verification, emotion recognition, etc. It conveys long-term information in speech and changes over time which can be seen as a kind of time sequence. This paper proposes a frame-level LSTM speaking rate estimation method. Instead of taking the whole utterance as a sequence, the frame-level LSTM exploits the sequence information in each segment and brings a more precise segmented speaking rate estimation. We also evaluate the influence of fixed-length segmentation and voice activity detection(vad) segmentation on speaking rate estimation. Results show that the proposed frame-level LSTM method yields a high correlation between the estimated speaking rate and the ground truth. It achieves a relative improvement of 13.0% compared to the state of the art statistical learning method and 16.3% over the support vector regression(SVR) evaluated on the same TIMIT corpus.

Original languageEnglish
Title of host publication2018 14th IEEE International Conference on Signal Processing Proceedings, ICSP 2018
EditorsYuan Baozong, Ruan Qiuqi, Zhao Yao, An Gaoyun
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages600-603
Number of pages4
ISBN (Electronic)9781538646724
DOIs
Publication statusPublished - 2 Feb 2019
Event14th IEEE International Conference on Signal Processing, ICSP 2018 - Beijing, China
Duration: 12 Aug 201816 Aug 2018

Publication series

NameInternational Conference on Signal Processing Proceedings, ICSP
Volume2018-August

Conference

Conference14th IEEE International Conference on Signal Processing, ICSP 2018
Country/TerritoryChina
CityBeijing
Period12/08/1816/08/18

Keywords

  • Frame-level LSTM
  • Segmentation
  • Speaking rate estimation

Fingerprint

Dive into the research topics of 'A Modified Speaking Rate Estimation Based on Frame-Level LSTM'. Together they form a unique fingerprint.

Cite this