Abstract. Speech disentanglement is crucial for tasks like controllable speech synthesis, voice conversion, and speech emotion conversion. While many self-supervised or weak supervised speech pre-trained models have shown remarkable performance in speech recognition and speaker identification, their potential for speech disentanglement remains underexplored. Typically, speech pre-trained models used in speech synthesis focus on extracting strong task-specific representations, often overlooking their capacity for speech disentanglement in controllable speech generation. To address this gap, we propose a four-stage decoder model that integrates a speech disentanglement module, progressive generator, acoustic compensator, and flow predictor, leveraging the layer-wise task characteristics of pre-trained models and enabling controllable emotion and voice conversion with any pre-trained speech encoder. We evaluate this framework using six established speech pre-trained models, and experimental results demonstrate that several pre-trained models significantly outperform baseline methods within our framework on emotion and voice conversion. Moreover, our framework serves as a valuable benchmark for evaluating pre-trained models' capabilities in disentangling emotional state, speaker, and content from speech. We also release our code at https://github.com/wangtianrui/PM-EVC.
Our four-stage emotional voice conversion framework, trained on self-supervised mel-spectrogram reconstruction, enables emotion and voice conversion (EVC), emotion conversion (EC), and voice conversion (VC) by altering emotional state or speaker representation during inference.
Our framework enables simultaneous conversion of emotion and speaker, with the emotion and speaker originating from different speech samples.
| Pre-trained Model Under Our Framework | Baseline methods | ||||||
|---|---|---|---|---|---|---|---|
| Source Speech (Happy) | Target Speaker (Fear) | Target Emotion (Natural) | ProgRE | Wav2vec2.0 | ConsistencyEVC | FACodec | Wav2vec2.0-EVC |
| This task is not supported | |||||||
|
Ablation Study 1 (Without AC)
|
|||||||
|
Ablation Study 2 (Without Progressive Manner)
|
|||||||
Our framework enables stable voice conversion while preserving the original emotional state.
| Pre-trained Model Under Our Framework | Baseline methods | |||||
|---|---|---|---|---|---|---|
| Source Speech | Target Speaker | ProgRE | Wav2vec2.0 | ConsistencyEVC | FACodec | Wav2vec2.0-EVC |
|
Ablation Study 1 (Without AC)
|
||||||
|
Ablation Study 2 (Without Progressive Manner)
|
||||||
Our framework enables stable global emotion state conversion while preserving the original speaker identification.
Please note that our definition of emotional state is at the utterance level and does not affect speech duration. Instead, it is primarily reflected in stress and pitch variations.
| Pre-trained Model Under Our Framework | Baseline methods | |||||
|---|---|---|---|---|---|---|
| Source Speech (Happy) | Target Emotion (Natural) | ProgRE | Wav2vec2.0 | ConsistencyEVC | FACodec | Wav2vec2.0-EVC |
|
Ablation Study 1 (Without AC)
|
||||||
|
Ablation Study 2 (Without Progressive Manner)
|
||||||