Skip to content
List of papers
GithubTwitter

A unified accent estimation method based on multi-task learning for Japanese text-to-speech

Accepted to Interspeech 2022

Authors

  • Byeongseon Park
  • Ryuichi Yamamoto
  • Kentaro Tachibana

Abstract

We propose a unified accent estimation method for Japanese text-to-speech (TTS). Unlike the conventional two-stage methods, which separately train two models for predicting accent phrase boundaries and accent nucleus positions, our method merges the two models and jointly optimizes the entire model in a multi-task learning framework. Furthermore, considering the hierarchical linguistic structure of intonation phrases (IPs), accent phrases, and accent nuclei, we generalize the proposed approach to simultaneously model the IP boundaries with accent information. Objective evaluation results reveal that the proposed method achieves an accent estimation accuracy of 80.4%, which is 6.67% higher than the conventional two-stage method. When the proposed method is incorporated into a neural TTS framework, the system achieves a 4.29 mean opinion score with respect to prosody naturalness.

model

Demo

TTS setup

  • Acoustic model : FastSpeech2 [1]
  • Vocoder : Parallel WaveGAN [2]

The detailed model structure and training conditions of these two models were the same as those in [3].

Target tasks

accent
  • IPs : Intonation phrases
  • APs : Accent phrases
  • ANs : Accent nuclei

Systems used for comparision

Models
ModelEncoderDecoderTask
(a)[4]-CRFAP
(b)[4]-CRFAN
(c)Bi-LSTMCRFIP
(d)Bi-LSTMCRFAP
(e)Bi-LSTMARAN
(f)Bi-LSTMCRF, ARAP+AN
(g)Bi-LSTMCRF, CRF, ARIP+AP+AN
Systems
SystemComponents
Reference (Recording) 1-
Reference (TTS) 2-
(A)(a), (b), (c)
(B)(c), (d), (e)
(C)(c), (f)
(D)(g)
Audio samples (Japanese)
Tags for simplified output
+ accent with high pitch
- accent with low pitch
/ accent phrase boundary
# intonation phrase boundary
(blue tag) is correct (e.g., +: correct accent)
(red tag) is incorrect (e.g., +: incorrect accent)
_ phrase boundary missing

Sample 1: "日大アメフト反則問題。"
System Audio Simplified output
Reference(TTS) ni- chi+ da+ i+ / a- me+ fu+ to+ / ha- N+ so+ ku+ mo+ N- da- i- .
System(A) ni- chi+ da+ i+ _ a+ me- fu- to- / ha- N+ so+ ku+ mo+ N- da- i- .
System(B) ni- chi+ da+ i+ _ a- me- fu- to- / ha- N+ so+ ku+ mo+ N+ da+ i+ .
System(C) ni- chi+ da+ i+ / a- me+ fu+ to+ / ha- N+ so+ ku+ mo+ N- da- i- .
System(D) ni- chi+ da+ i+ / a- me+ fu+ to+ / ha- N+ so+ ku+ mo+ N- da- i- .

Sample 2: "あなたとおはなししている時以外のスケジュールは、内緒ですよ。"
System Audio Simplified output
Reference(Recording) -
Reference(TTS) a- na+ ta- to- / o- ha+ na+ shi+ shi+ te+ i+ ru+ / to- ki+ i+ ga- i- no- / su- ke+ jyuu+ ru- wa- # na- i+ shyo+ de+ su- yo- .
System(A) a- na+ ta- to- / o- ha+ na+ shi+ shi+ te+ i+ ru+ / to+ ki- / i+ ga- i- no- / su- ke+ jyuu+ ru- wa- # na- i+ shyo+ de+ su- yo- .
System(B) a- na+ ta- to- / o- ha+ na+ shi+ shi+ te+ i+ ru+ / to+ ki- i- ga- i- no- / su+ ke- jyuu- ru- wa- # na- i+ shyo+ de- su- yo- .
System(C) a- na+ ta- to- / o- ha+ na+ shi+ shi+ te+ i+ ru+ / to+ ki- / i+ ga- i- no- / su- ke+ jyuu+ ru- wa- # na- i+ shyo+ de+ su+ yo+ .
System(D) a- na+ ta- to- / o- ha+ na+ shi+ shi+ te+ i+ ru+ / to+ ki- / i+ ga- i- no- / su- ke+ jyuu+ ru- wa- # na- i+ shyo+ de+ su+ yo+ .

Sample 3: "後あとトラブルになる可能性が高いですよ。"
System Audio Simplified output
Reference(Recording) -
Reference(TTS) a- to+ a+ to+ / to- ra+ bu- ru- ni- / na+ ru- / ka- noo+ see+ ga+ / ta- ka+ i- de- su- yo- .
System(A) a- to+ a+ to- / to- ra+ bu- ru- ni- / na+ ru- / ka- noo+ see+ ga+ / ta- ka+ i- de- su- yo- .
System(B) a+ to- a- to- _ to- ra- bu- ru- ni- / na+ ru- / ka- noo+ see+ ga+ / ta- ka+ i- de- su- yo- .
System(C) a+ to- a- to- / to- ra+ bu- ru- ni- / na+ ru- / ka- noo+ see+ ga+ / ta- ka+ i- de- su- yo- .
System(D) a+ to- a- to- / to- ra+ bu- ru- ni- / na+ ru- / ka- noo+ see+ ga+ / ta- ka+ i- de- su- yo- .

Sample 4: "「今世紀に入ってから1番楽しみな食事会」だと期待を寄せた。"
System Audio Simplified output
Reference(Recording) -
Reference(TTS) ko- N+ see+ ki- ni- / ha+ i- Q- te- ka- ra- # i- chi+ ba+ N+ / ta- no+ shi+ mi- na- / shyo- ku+ ji+ ka+ i+ da+ to- # ki- ta+ i+ o+ / yo- se+ ta+ .
System(A) ko- N+ see+ ki- ni- / ha+ i- Q- te- ka- ra- # i- chi+ ba+ N+ / ta- no+ shi+ mi- na- / shyo- ku+ ji+ ka+ i- da- to- / ki- ta+ i+ o+ / yo- se+ ta+ .
System(B) ko+ N- see- ki- ni- / ha+ i- Q- te- ka- ra- # i+ chi- ba- N- / ta- no+ shi+ mi+ na+ / shyo- ku+ ji+ ka- i- da- to- / ki- ta+ i+ o- / yo- se+ ta+ .
System(C) ko- N+ see+ ki- ni- / ha+ i- Q- te- ka- ra- # i- chi+ ba+ N+ / ta- no+ shi+ mi- na- / shyo- ku+ ji+ ka- i- da- to- / ki- ta+ i+ o+ / yo- se+ ta+ .
System(D) ko+ N- / see+ ki- ni- / ha+ i- Q- te- ka- ra- # i- chi+ ba+ N+ / ta- no+ shi+ mi- na- / shyo- ku+ ji+ ka- i- da- to- # ki- ta+ i+ o+ / yo- se+ ta+ .

Sample 5: "23日に心臓発作で倒れ、ロス市内の病院に入院していた女優のキャリー・フィッシャーさんが現地時間の27日朝、60歳で死去。"
System Audio Simplified output
Reference(Recording) -
Reference(TTS) ni+ jyuu- / sa+ N- ni- chi- ni- # shi- N+ zoo+ ho+ Q- sa- de- / ta- o+ re- # ro- su+ shi+ na- i- no- / byoo- i+ N+ ni+ / nyuu- i+ N+ shi+ te+ i+ ta+ # jyo- yuu+ no+ / kya- rii+ fi+ Q- shyaa- sa- N- ga- # ge- N+ chi+ ji+ ka- N- no- / ni+ jyuu- / shi- chi+ ni+ chi+ / a+ sa- # ro- ku+ jyu+ Q- sa- i- de- / shi+ kyo- .
System(A) ni+ jyuu- / sa+ N- ni- chi- ni- # shi- N+ zoo+ ho+ Q- sa- de- / ta- o+ re- # ro- su+ shi+ na- i- no- / byoo- i+ N+ ni+ / nyuu- i+ N+ shi+ te+ i+ ta+ # jyo- yuu+ no+ # kya- rii+ fi+ Q- shyaa- sa- N- ga- # ge- N+ chi+ ji+ ka- N- no- / ni+ jyuu- / shi- chi+ ni+ chi+ / a+ sa- # ro- ku+ jyu+ Q- sa- i- de- / shi+ kyo- .
System(B) ni+ jyuu- / sa+ N- ni- chi- ni- # shi- N+ zoo+ ho+ Q- sa- de- / ta- o+ re- # ro- su+ shi+ na- i- no- / byoo- i+ N+ ni+ / nyuu- i+ N+ shi+ te+ i+ ta+ # jyo- yuu+ no+ # kya- rii+ fi+ Q- shyaa- sa- N- ga- # ge+ N- chi- ji- ka- N- no- / ni- jyuu+ / shi+ chi- ni- chi- / a- sa+ # ro+ ku- jyu- Q- sa- i- de- / shi- kyo+ .
System(C) ni+ jyuu- / sa+ N- ni- chi- ni- # shi- N+ zoo+ ho+ Q- sa- de- / ta- o+ re- # ro- su+ shi+ na- i- no- / byoo- i+ N+ ni+ / nyuu- i+ N+ shi+ te+ i+ ta+ # jyo- yuu+ no+ # kya- rii+ fi+ Q- shyaa- sa- N- ga- # ge- N+ chi+ ji+ ka- N- no- / ni+ jyuu- / shi- chi+ ni+ chi+ / a+ sa- # ro- ku+ jyu+ Q- sa- i- de- / shi+ kyo- .
System(D) ni+ jyuu- / sa+ N- ni- chi- ni- # shi- N+ zoo+ ho+ Q- sa- de- / ta- o+ re- # ro- su+ shi+ na- i- no- / byoo- i+ N+ ni+ / nyuu- i+ N+ shi+ te+ i+ ta+ # jyo- yuu+ no+ # kya- rii+ fi+ Q- shyaa- sa- N- ga- # ge- N+ chi+ ji+ ka- N- no- / ni+ jyuu- / shi- chi+ ni+ chi+ / a+ sa- # ro- ku+ jyu+ Q- sa- i- de- / shi+ kyo- .


States used for comparision

  • Recording : Recorded speech.
  • X: The speech synthesised using wrong IP, AP, and AN information.
  • IP: The speech synthesised using correct IP information, with wrong AP and AN.
  • AP: The speech synthesised using correct AP information, with wrong IP and AN.
  • AN: The speech synthesised using correct AN information, with wrong IP and AP.
  • IP+AP: The speech synthesised using correct IP and AP information, with wrong AN.
  • IP+AN: The speech synthesised using correct IP and AN information, with wrong AP.
  • AP+AN: The speech synthesised using correct AP and AN information, with wrong IP.
  • IP+AP+AN: The speech synthesised using correct IP, AP, and AN information.

Audio samples (Japanese)
Sample 1: "最初にすることは早起き、次にするのが二度寝、その次が後悔。"
State Audio Simplified output
Recording -
X sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- # ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- / ni+ do- / ne- # so- no+ / tsu- gi+ ga- / koo+ ka- i- .
IP sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- # ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- # ni+ do- / ne- # so- no+ / tsu- gi+ ga- # koo+ ka- i- .
AP sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- # ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- / ni+ do- ne- # so- no+ / tsu- gi+ ga- / koo+ ka- i- .
AN sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- # ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- / ni- do+ / ne+ # so- no+ / tsu- gi+ ga- / koo+ ka- i- .
IP+AP sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- # ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- # ni+ do- ne- # so- no+ / tsu- gi+ ga- # koo+ ka- i- .
IP+AN sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- # ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- # ni- do+ / ne+ # so- no+ / tsu- gi+ ga- # koo+ ka- i- .
AP+AN sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- / ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- / ni- do+ ne+ # so- no+ / tsu- gi+ ga- / koo+ ka- i- .
IP+AP+AN sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- # ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- # ni- do+ ne+ # so- no+ / tsu- gi+ ga- # koo+ ka- i- .

Sample 2: "メキシコ留学中の、エーケービーフォーティーエイト入山杏奈が一時帰国。"
State Audio Simplified output
Recording -
X me- ki+ shi+ ko+ _ ryuu+ ga+ ku+ chyuu+ no+ # ee- kee+ bii+ _ foo- tii- e- i- to- # i- ri+ ya+ ma+ / a+ N- na- ga- # i- chi+ ji- / ki- ko+ ku+ .
IP me- ki+ shi+ ko+ _ ryuu+ ga+ ku+ chyuu+ no+ # ee- kee+ bii+ _ foo- tii- e- i- to- # i- ri+ ya+ ma+ / a+ N- na- ga- / i- chi+ ji- / ki- ko+ ku+ .
AP me- ki+ shi+ ko+ / ryuu+ ga- ku- chyuu- no- # ee- kee+ bii+ / foo+ tii- e- i- to- # i- ri+ ya+ ma+ / a+ N- na- ga- # i- chi+ ji- / ki- ko+ ku+ .
AN me- ki+ shi+ ko+ _ ryuu- ga- ku- chyuu- no- # ee- kee+ bii+ _ foo+ tii+ e+ i+ to+ # i- ri+ ya+ ma+ / a+ N- na- ga- # i- chi+ ji- / ki- ko+ ku+ .
IP+AP me- ki+ shi+ ko+ / ryuu+ ga- ku- chyuu- no- # ee- kee+ bii+ / foo+ tii- e- i- to- # i- ri+ ya+ ma+ / a+ N- na- ga- / i- chi+ ji- / ki- ko+ ku+ .
IP+AN me- ki+ shi+ ko+ _ ryuu- ga- ku- chyuu- no- # ee- kee+ bii+ _ foo+ tii+ e+ i+ to+ # i- ri+ ya+ ma+ / a+ N- na- ga- / i- chi+ ji- / ki- ko+ ku+ .
AP+AN me- ki+ shi+ ko+ / ryuu- ga+ ku+ chyuu+ no+ # ee- kee+ bii+ / foo- tii+ e+ i- to- # i- ri+ ya+ ma+ / a+ N- na- ga- # i- chi+ ji- / ki- ko+ ku+ .
IP+AP+AN me- ki+ shi+ ko+ / ryuu- ga+ ku+ chyuu+ no+ # ee- kee+ bii+ / foo- tii+ e+ i- to- # i- ri+ ya+ ma+ / a+ N- na- ga- / i- chi+ ji- / ki- ko+ ku+ .

Sample 3: "私なんて、若い頃の貯金食いつぶし生活で、40代になってからはあんまり働いてないですもん。"
State Audio Simplified output
Recording -
X wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ # ku- i+ tsu+ bu+ shi- / see- ka+ tsu+ de+ # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i+ de+ su+ mo+ N+ .
IP wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ / ku- i+ tsu+ bu+ shi- / see- ka+ tsu+ de+ # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i+ de+ su+ mo+ N+ .
AP wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ # ku- i+ tsu+ bu+ shi- see- ka- tsu- de- # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i+ de+ su+ mo+ N+ .
AN wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ # ku- i+ tsu+ bu+ shi+ / see+ ka- tsu- de- # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i- de- su- mo- N- .
IP+AP wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ / ku- i+ tsu+ bu+ shi- see- ka- tsu- de- # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i+ de+ su+ mo+ N+ .
IP+AN wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ / ku- i+ tsu+ bu+ shi+ / see+ ka- tsu- de- # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i- de- su- mo- N- .
AP+AN wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ # ku- i+ tsu+ bu+ shi+ see+ ka- tsu- de- # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i- de- su- mo- N- .
IP+AP+AN wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ / ku- i+ tsu+ bu+ shi+ see+ ka- tsu- de- # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i- de- su- mo- N- .

References

  • [1]: Y. Ren, C. Hu, X. Tan, S. Zhao, Z. Zhao, and T.Y. Liu, “FastSpeech 2: Fast and high-quality end-to-end text-to-speech,” in Proc. ICLR, 2021 (arXiv).
  • [2]: R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, 2020, pp. 6199–6203 (arXiv).
  • [3]: R. Yamamoto, E. Song, M.-J. Hwang, and J.-M. Kim, “Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators,” in Proc. ICASSP, 2021, pp. 6039–6043 (arXiv).
  • [4]: M. Suzuki, R. Kuroiwa, K. Innami, S. Kobayashi, S. Shimizu, N. Minematsu, and K. Hirose, “Accent sandhi estimation of Tokyo dialect of Japanese using conditional random fields,” IEICE Trans., vol. E100-D, no. 4, pp. 655–661, 2017 (IEICE).

Acknowledgements

This work was supported by Clova Voice, NAVER Corp., Seongnam, Korea. The authors would like to thank Yuma Shirahata and Kosuke Futamata at LINE Corp., Tokyo, Japan, for their support.


  1. Recorded speech.
  2. Synthesized speech using human-annotated prosody information.