Enhancing Bone-Conduction Sensor Signals via Self-Supervised Acoustic Priors and Key-Value Memory

Abstract

We address the challenge of enhancing bone-conduction (BC) sensor signals, which are robust to environmental noise but suffer from a muffled quality due to severe attenuation of high-frequency components. Our model harnesses powerful self-supervised learning (SSL) models to provide generalized and informative acoustic priors. Guided by these priors, our approach effectively reconstructs the missing high-frequency content in the BC signal, significantly improving speech clarity and spectral richness. The method outperforms recent state-of-the-art approaches, particularly in recovering fine-grained spectral details.

Audio Examples

Example 1: Male Speaker 1

Input (BC)
DPT-EGNet Model
Our Model
Ground Truth (AC)

Example 2: Male Speaker 2

Input (BC)
DPT-EGNet Model
Our Model
Ground Truth (AC)

Example 3: Female Speaker 1

Input (BC)
DPT-EGNet Model
Our Model
Ground Truth (AC)

Example 4: Female Speaker 2

Input (BC)
DPT-EGNet Model
Our Model
Ground Truth (AC)

ESMB Dataset Split

We utilized the ESMB dataset, which contains 128 hours of speech from 287 speakers.
As it lacks an official split, we randomly partitioned the speakers to ensure speaker-independent evaluation:

240
Train Speakers
24
Dev Speakers
23
Test Speakers

Download the file lists (.txt) used in our experiments:

⬇️ Train Split ⬇️ Dev Split ⬇️ Test Split