Audio Demo · Speech Voice Conversion

Electrolaryngeal Speech Voice Conversion with Sequence-to-Sequence Modeling

Abstract

Electrolaryngeal (EL) speech is produced using an electrolarynx device that substitutes vocal fold vibrations. Due to limitations in the excitation signals, EL speech suffers from reduced naturalness and intelligibility compared to natural speech (NL speech). This work proposes a novel ELVC system based on sequence-to-sequence (seq2seq) modeling with text-to-speech (TTS) pretraining. The seq2seq model employs an attention mechanism to concurrently perform representation learning and alignment, while TTS pretraining enables efficient training with limited data. Experimental results demonstrate notable improvements over a well-known frame-wise ELVC baseline.

Models

MT-CLDNN
Multi-task CLDNN baseline (frame-wise)
TFS
Seq2seq trained from scratch (no pretraining)
PT
Seq2seq with TTS pretraining (proposed)
EL Speech — Input
NL Speech — Reference
MT-CLDNN
TFS
PT (Proposed)

Objective Metrics

EL01 → NL01
Model Pretrain MCD (dB) ↓ F0 RMSE ↓ F0 CORR ↑ DDUR ↓ SER (%) ↓
TFS 8.86 24.44 0.202 0.156 93.3
PT 7.10 24.72 0.212 0.167 67.5
MT-CLDNN 7.38 24.38 0.167 0.680 76.5
EL01 → NL02
Model Pretrain MCD (dB) ↓ F0 RMSE ↓ F0 CORR ↑ DDUR ↓ SER (%) ↓
TFS 11.17 34.41 0.365 0.178 99.0
PT 8.18 33.50 0.458 0.192 75.0
MT-CLDNN 7.77 35.58 0.336 0.914 85.0

Audio Samples

Pair EL01 → NL01
Sample 1 他捐了很多衣物給災區 tā juān le hěn duō yī wù gěi zāi qū
EL Speech
MT-CLDNN
TFS
PT
NL Speech
Sample 2 我把不用的家具送人了 wǒ bǎ bù yòng de jiā jù sòng rén le
EL Speech
MT-CLDNN
TFS
PT
NL Speech
Sample 3 那個牆上掛著一幅油畫 nà ge qiáng shàng guà zhe yī fú yóu huà
EL Speech
MT-CLDNN
TFS
PT
NL Speech
Pair EL01 → NL02
Sample 1 他捐了很多衣物給災區 tā juān le hěn duō yī wù gěi zāi qū
EL Speech
MT-CLDNN
TFS
PT
NL Speech
Sample 2 我幫他把雞蛋放入冰箱 wǒ bāng tā bǎ jī dàn fàng rù bīng xiāng
EL Speech
MT-CLDNN
TFS
PT
NL Speech
Sample 3 那個牆上掛著一幅油畫 nà ge qiáng shàng guà zhe yī fú yóu huà
EL Speech
MT-CLDNN
TFS
PT
NL Speech