This folder primarily contains auxiliary materials for our paper

Abstract. Speech disentanglement is crucial for tasks like controllable speech synthesis, voice conversion, and speech emotion conversion. While many self-supervised or weak supervised speech pre-trained models have shown remarkable performance in speech recognition and speaker identification, their potential for speech disentanglement remains underexplored. Typically, speech pre-trained models used in speech synthesis focus on extracting strong task-specific representations, often overlooking their capacity for speech disentanglement in controllable speech generation. To address this gap, we propose a four-stage decoder model that integrates a speech disentanglement module, progressive generator, acoustic compensator, and flow predictor, leveraging the layer-wise task characteristics of pre-trained models and enabling controllable emotion and voice conversion with any pre-trained speech encoder. We evaluate this framework using six established speech pre-trained models, and experimental results demonstrate that several pre-trained models significantly outperform baseline methods within our framework on emotion and voice conversion. Moreover, our framework serves as a valuable benchmark for evaluating pre-trained models' capabilities in disentangling emotional state, speaker, and content from speech. We also release our code at https://github.com/wangtianrui/PM-EVC.

Model Overview

Our four-stage emotional voice conversion framework, trained on self-supervised mel-spectrogram reconstruction, enables emotion and voice conversion (EVC), emotion conversion (EC), and voice conversion (VC) by altering emotional state or speaker representation during inference.

Expressive Voice Conversion (Simultaneous Conversion of Emotion and Speaker)

Our framework enables simultaneous conversion of emotion and speaker, with the emotion and speaker originating from different speech samples.

			Pre-trained Model Under Our Framework		Baseline methods
Source Speech (Happy)	Target Speaker (Fear)	Target Emotion (Natural)	ProgRE	Wav2vec2.0	ConsistencyEVC	FACodec	Wav2vec2.0-EVC
					This task is not supported
			Ablation Study 1 (Without AC)
			Ablation Study 2 (Without Progressive Manner)

Voice Conversion

Our framework enables stable voice conversion while preserving the original emotional state.

		Pre-trained Model Under Our Framework		Baseline methods
Source Speech	Target Speaker	ProgRE	Wav2vec2.0	ConsistencyEVC	FACodec	Wav2vec2.0-EVC

		Ablation Study 1 (Without AC)
		Ablation Study 2 (Without Progressive Manner)

Emotion Conversion

Our framework enables stable global emotion state conversion while preserving the original speaker identification.
Please note that our definition of emotional state is at the utterance level and does not affect speech duration. Instead, it is primarily reflected in stress and pitch variations.

		Pre-trained Model Under Our Framework		Baseline methods
Source Speech (Happy)	Target Emotion (Natural)	ProgRE	Wav2vec2.0	ConsistencyEVC	FACodec	Wav2vec2.0-EVC

		Ablation Study 1 (Without AC)
		Ablation Study 2 (Without Progressive Manner)

A Controllable Emotion Voice Conversion Framework with Pre-trained Speech Representations

Model Overview

Expressive Voice Conversion (Simultaneous Conversion of Emotion and Speaker)

Voice Conversion

Emotion Conversion