Live Speech Recognition via Earphone Motion Sensors

Yetong Cao, Fan Li*, Huijie Chen, Xiaochen Liu, Shengchun Zhai, Song Yang, Yu Wang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Recent literature advances motion sensors mounted on smartphones and AR/VR headsets to speech eavesdropping due to their sensitivity to subtle vibrations. The popularity of motion sensors in earphones has fueled a rise in their sampling rate, which enables various enhanced features. This paper investigates a new threat of eavesdropping via motion sensors of earphones by developing EarSpy, which builds on our observation that the earphone's accelerometer can capture bone conduction vibrations (BCVs) and ear canal dynamic motions (ECDMs) associated with speaking; they enable EarSpy to derive unique information about the wearer's speech. Leveraging a study on the motion sensor measurements captured from earphones, EarSpy gains abilities to disentangle the wearer's live speech from interference caused by body motions and vibrations generated when the earphone's speaker plays audio. To enable user-independent attacks, EarSpy involves novel efforts, including a trajectory instability reduction method to calibrate the waveform of ECDMs and a data augmentation method to enrich the diversity of BCVs. Moreover, EarSpy explores effective representations from BCVs and ECDMs, and develops a neural network model with character-level and word-level speech recognition models to realize speech recognition. Extensive experiments involving 14 participants demonstrate that EarSpy reaches a promising recognition for the wearer's speech.

Original languageEnglish
Pages (from-to)7284-7300
Number of pages17
JournalIEEE Transactions on Mobile Computing
Volume23
Issue number6
DOIs
Publication statusPublished - 1 Jun 2024

Keywords

  • Earphone
  • motion sensor
  • neural network
  • speech recognition

Fingerprint

Dive into the research topics of 'Live Speech Recognition via Earphone Motion Sensors'. Together they form a unique fingerprint.

Cite this