Audio Demo · Speech Voice Conversion

Electrolaryngeal Speech Voice Conversion with Sequence-to-Sequence Modeling

Abstract

Electrolaryngeal (EL) speech is produced using an electrolarynx device that substitutes vocal fold vibrations. Due to limitations in the excitation signals, EL speech suffers from reduced naturalness and intelligibility compared to natural speech (NL speech). This work proposes a novel ELVC system based on sequence-to-sequence (seq2seq) modeling with text-to-speech (TTS) pretraining. The seq2seq model employs an attention mechanism to concurrently perform representation learning and alignment, while TTS pretraining enables efficient training with limited data. Experimental results demonstrate notable improvements over a well-known frame-wise ELVC baseline.

System

Models

MT-CLDNN

Multi-task CLDNN baseline (frame-wise)

TFS

Seq2seq trained from scratch (no pretraining)

PT

Seq2seq with TTS pretraining (proposed)

EL Speech — Input

NL Speech — Reference

MT-CLDNN

TFS

PT (Proposed)

Evaluation

Objective Metrics

EL01 → NL01

Model	Pretrain	MCD (dB) ↓	F0 RMSE ↓	F0 CORR ↑	DDUR ↓	SER (%) ↓
TFS	✗	8.86	24.44	0.202	0.156	93.3
PT	✓	7.10	24.72	0.212	0.167	67.5
MT-CLDNN	—	7.38	24.38	0.167	0.680	76.5

EL01 → NL02

Model	Pretrain	MCD (dB) ↓	F0 RMSE ↓	F0 CORR ↑	DDUR ↓	SER (%) ↓
TFS	✗	11.17	34.41	0.365	0.178	99.0
PT	✓	8.18	33.50	0.458	0.192	75.0
MT-CLDNN	—	7.77	35.58	0.336	0.914	85.0

Listening Test

Audio Samples

Pair EL01 → NL01

Sample 1 他捐了很多衣物給災區 tā juān le hěn duō yī wù gěi zāi qū

EL Speech

MT-CLDNN

TFS

PT

NL Speech

Sample 2 我把不用的家具送人了 wǒ bǎ bù yòng de jiā jù sòng rén le

EL Speech

MT-CLDNN

TFS

PT

NL Speech

Sample 3 那個牆上掛著一幅油畫 nà ge qiáng shàng guà zhe yī fú yóu huà

EL Speech

MT-CLDNN

TFS

PT

NL Speech

Pair EL01 → NL02

Sample 1 他捐了很多衣物給災區 tā juān le hěn duō yī wù gěi zāi qū

EL Speech

MT-CLDNN

TFS

PT

NL Speech

Sample 2 我幫他把雞蛋放入冰箱 wǒ bāng tā bǎ jī dàn fàng rù bīng xiāng

EL Speech

MT-CLDNN

TFS

PT

NL Speech

Sample 3 那個牆上掛著一幅油畫 nà ge qiáng shàng guà zhe yī fú yóu huà

EL Speech

MT-CLDNN

TFS

PT

NL Speech