We address the challenge of enhancing bone-conduction (BC) sensor signals, which are robust to environmental noise but suffer from a muffled quality due to severe attenuation of high-frequency components. Our model harnesses powerful self-supervised learning (SSL) models to provide generalized and informative acoustic priors. Guided by these priors, our approach effectively reconstructs the missing high-frequency content in the BC signal, significantly improving speech clarity and spectral richness. The method outperforms recent state-of-the-art approaches, particularly in recovering fine-grained spectral details.
We utilized the ESMB dataset, which contains 128 hours of speech from 287 speakers.
As it lacks an official split, we randomly partitioned the speakers to ensure speaker-independent evaluation: