Wavebender GAN

An architecture for phonetically meaningful speech manipulation

Gustavo Teodoro Döhler Beck, Ulme Wennberg, Zofia Malisz, Gustav Eje Henter

Summary

The goal of this work is to develop new speech technology to meet the needs of speech-sciences research. Specifically, our paper presents Wavebender GAN, a deep-learning architecture for manipulating phonetically-relevant speech parameters whilst remaining perceptually close to natural speech.

Our example system demonstrated on this webpage was trained on the LJ Speech dataset and uses the HiFi-GAN neural vocoder to produce waveforms, but the proposed method applies to other training data, vocoders, and speech parameters as well.

For more information, please see our paper at ICASSP 2022.

Visual overview

Wavebender GAN

Code

Code will be made available in our GitHub repository shortly.

Copy synthesis

The following audio stimuli illustrate the effects of copy synthesis (i.e., speech reconstruction) using different system components. No speech manipulation is performed at this point. All files are taken directly from the listening test in Sec. 5.3 of the paper.

Type	Recorded natural speech	Wavebender GAN→HiFi-GAN	HiFi-GAN
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
Sentence 6
Sentence 7
Sentence 8
Sentence 9
Sentence 10

Speech manipulation

The following proof-of-concept examples illustrate the effects of using Wavebender GAN to manipulate different core speech parameters. For consistency, all manipulations were performed on the same LJ Speech utterance from the test set, namely LJ026-0014.

Pitch

Wavebender GAN is capable of reducing and increasing the pitch (f0 contour) of the speech. These examples demonstrate global scaling of the f0 parameter.

-30%	-15%	LJ026-0014	+15%	+30%

Formants

These examples illustrate the effects of locally scaling the first formant, F1, for the last word of the utterance (i.e., “forms”). Since F1 and F2 are strongly correlated, we use the method described in Sec. 4.2 of the paper to predict new F2 values when manipulating F1.

	“Fools” (-30% of F1)	“Forms”	“Frogs” (+30% of F1)
Last word only
Manipulation in context

Spectral centroid

The following example suggests that global manipulation of spectral centroid (here multiplied by 1.3), although accurate in terms of relative MSE, is not particularly meaningful. The most notable effect is a kind of lisp, wherein [ʃ] becomes [θ]. This is to be expected, since spectral moments acoustically define places of articulation in English fricatives; see, e.g., A. Jongman, R. Wayland, and S. Wong, “Acoustic characteristics of English fricatives,” J. Acoust. Soc. Am., vol. 108, pp. 1252–1263, 2000, among others.

Spectral slope

Global scaling of the spectral slope mainly appears to affect signal gain, as shown in the below example (spectral slope multiplied by 0.2). This effect might be a consequence of the significant correlation between speech loudness and spectral slope in human speech production.

Disentanglement

To quantify the extent of speech-parameter disentanglement with Wavebender GAN, this matrix shows the relative MSE for all speech parameters (vertical axis) as an effect of globally scaling each speech parameter (horizontal axis) by a factor 1.3, whilst keeping all other speech parameters fixed. As a baseline, the first column shows relative MSE values during copy-synthesis (no manipulation).

Disentanglement matrix

Citation information

@inproceedings{beck2022wavebender,
   title={Wavebender {GAN}: {A}n architecture for phonetically meaningful speech manipulation},
   author={Döhler Beck, Gustavo Teodoro and Wennberg, Ulme and Malisz, Zofia and Henter, Gustav Eje},
   booktitle={Proc. ICASSP},
   year={2022}
 }