Ablation Study

Due to space limitations, Table III in the paper only presents the results of the last column, Average. To eliminate any potential misunderstanding for readers, we provide here the complete results for all four tasks. Please note that the structure of our generation module is designed to better utilize the information extracted by the pre-trained disentanglement module, enabling the generation of controllable, high-quality speech. The ability to disentangle speech characteristics is what we aim to evaluate, and we expect this capability to reside in the pre-trained model itself, rather than being the goal of the generation module.

Name EC VC EVC Reconstruction Average
ERSSSWERDNS ERSSSWERDNS ERSSSWERDNS ERSSSWERDNS ERSSSWERDNS
whole framework (24-layer ProgRE) 68.0372.494.573.13 94.0267.954.023.05 68.9969.175.212.89 95.9186.854.213.04 81.7474.124.503.03
- Acoustic Compensator (Comp) 69.1073.265.183.00 93.9569.134.992.95 70.4372.445.682.84 93.3882.246.252.95 81.7274.275.532.93
- FlowPredictor (Flow) 69.2372.612.202.98 93.7270.682.512.80 69.2472.282.392.58 96.2487.062.003.01 82.1175.662.282.84
- Flow - Comp 70.9473.992.472.95 94.0471.822.742.76 70.5272.852.612.54 92.8783.693.712.89 82.0975.592.882.79
- Flow - Comp - Progressive (in generator) 70.1473.462.592.96 93.9771.012.812.77 69.7272.662.772.55 93.6883.443.862.90 81.8875.143.012.80