Abstract
Electrolaryngeal (EL) speech is produced using an electrolarynx device that substitutes vocal fold vibrations. Due to limitations in the excitation signals, EL speech suffers from reduced naturalness and intelligibility compared to natural speech (NL speech). This work proposes a novel ELVC system based on sequence-to-sequence (seq2seq) modeling with text-to-speech (TTS) pretraining. The seq2seq model employs an attention mechanism to concurrently perform representation learning and alignment, while TTS pretraining enables efficient training with limited data. Experimental results demonstrate notable improvements over a well-known frame-wise ELVC baseline.