A Controllable Emotion Voice Conversion Framework with Pre-trained Speech Representations

    Abstract. Speech disentanglement is crucial for tasks like controllable speech synthesis, voice conversion, and speech emotion conversion. While many self-supervised or weak supervised speech pre-trained models have shown remarkable performance in speech recognition and speaker identification, their potential for speech disentanglement remains underexplored. Typically, speech pre-trained models used in speech synthesis focus on extracting strong task-specific representations, often overlooking their capacity for speech disentanglement in controllable speech generation. To address this gap, we propose a four-stage decoder model that integrates a speech disentanglement module, progressive generator, acoustic compensator, and flow predictor, leveraging the layer-wise task characteristics of pre-trained models and enabling controllable emotion and voice conversion with any pre-trained speech encoder. We evaluate this framework using six established speech pre-trained models, and experimental results demonstrate that several pre-trained models significantly outperform baseline methods within our framework on emotion and voice conversion. Moreover, our framework serves as a valuable benchmark for evaluating pre-trained models' capabilities in disentangling emotional state, speaker, and content from speech. We also release our code at https://github.com/wangtianrui/PM-EVC.

    Model Overview

    Audio Image 2

    Our four-stage emotional voice conversion framework, trained on self-supervised mel-spectrogram reconstruction, enables emotion and voice conversion (EVC), emotion conversion (EC), and voice conversion (VC) by altering emotional state or speaker representation during inference.



    Expressive Voice Conversion (Simultaneous Conversion of Emotion and Speaker)

    Our framework enables simultaneous conversion of emotion and speaker, with the emotion and speaker originating from different speech samples.

    Pre-trained Model Under Our Framework Baseline methods
    Source Speech (Happy) Target Speaker (Fear) Target Emotion (Natural) ProgRE Wav2vec2.0 ConsistencyEVC FACodec Wav2vec2.0-EVC
    This task is not supported
    Ablation Study 1 (Without AC)

    Ablation Study 2 (Without Progressive Manner)



    Voice Conversion

    Our framework enables stable voice conversion while preserving the original emotional state.

    Pre-trained Model Under Our Framework Baseline methods
    Source Speech Target Speaker ProgRE Wav2vec2.0 ConsistencyEVC FACodec Wav2vec2.0-EVC
    Ablation Study 1 (Without AC)

    Ablation Study 2 (Without Progressive Manner)



    Emotion Conversion

    Our framework enables stable global emotion state conversion while preserving the original speaker identification.
    Please note that our definition of emotional state is at the utterance level and does not affect speech duration. Instead, it is primarily reflected in stress and pitch variations.

    Pre-trained Model Under Our Framework Baseline methods
    Source Speech (Happy) Target Emotion (Natural) ProgRE Wav2vec2.0 ConsistencyEVC FACodec Wav2vec2.0-EVC
    Ablation Study 1 (Without AC)

    Ablation Study 2 (Without Progressive Manner)