期刊文献+

一种基于特征融合的耳语音向正常音的转换方法

Method for Transforming Whisper to Normal Speech with Feature Fusion
在线阅读 下载PDF
导出
摘要 使用耳语音的频谱包络来预估正常音的基频特征,这类算法在对正常音基频预测的准确性上存在一定不足,在合成语音自然度方面存在着明显欠缺,有时会出现音调失常等问题。本文提出一种声学特征融合的方法,通过双向长短期记忆(Bi‑long short‑term memory,BLSTM)深度网络来逐帧预测正常音基频。首先,使用STRAIGHT模型和相关代码,分别对耳语音和正常音语料进行预处理,提取耳语音的梅尔倒谱系数(Mel‑scale frequency cepstral coefficient,MFCC)、韵律及谱包络特征,正常音的基频与谱包络特征。然后使用BLSTM深度网络,分别建立耳语音和正常音谱包络特征之间映射关系,以及耳语音MFCC、韵律及谱包络特征对正常音基频F0的映射关系。最后根据耳语音的MFCC、韵律及谱包络特征获得对应的正常音基频和谱包络,使用STRAIGHT模型合成正常音。实验结果表明,相较于仅使用谱包络估计基频,采用此种方法引入语音韵律和MFCC的融合特征是对基频特征的良好补充,解决了音调失常的现象,转换后的语音在韵律上更加接近正常发音。 Currently,in reconstruction of normal speech from whispered speech based on neural network,the spectral envelope of the whisper is often used to estimate F0 characteristics of the normal speech.Such algorithms have certain deficiencies in the accuracy of F0.There is a clear lack of naturalness,and sometimes the pitch distortion occurs.This paper proposes a method for predicting the F0 of normal speech frame by frame using the Bi‑long short‑term memory(BLSTM)deep network with the acoustic fusion feature of normal speech.Firstly,the STRAIGHT model and related codes are used to preprocess the whisper and the normal speech corpus.Respectively,extract the Mel‑scale frequency cepstral coefficient(MFCC),rhythm and spectral envelope of the whisper speech and the F0 and spectral envelope of the normal speech.Secondly,the BLSTM deep network is used to establish a mapping relationship between spectrums of whisper and normal speech,and a mapping relationship between MFCC,rhythm and spectral envelope features of whisper speech and F0 of normal speech.Finally,according to MFCC,rhythm and spectral envelope features of whisper speech,the F0 and spectral envelope of the corresponding normal speech are obtained,and the normal speech is synthesized using the STRAIGHT model.The experimental results show that compared with the estimation of the F0 using only the spectral envelope,the introduction of fusion features of phonetic rhythm and MFCC is a good complement to the F0 features,which solves the phenomenon of pitch disorders and the converted speech is closer to normal speech in rhythm.
作者 庞聪 连海伦 周健 王华彬 陶亮 PANG Cong;LIAN Hailun;ZHOU Jian;WANG Huabin;TAO Liang(Key Laboratory of Computational Intelligence and Signal Processing,Ministry of Education,Anhui University,Hefei,230039,China)
出处 《南京航空航天大学学报》 EI CAS CSCD 北大核心 2020年第5期777-782,共6页 Journal of Nanjing University of Aeronautics & Astronautics
基金 国家自然科学基金(61301295)资助项目 安徽省自然科学基金(1708085MF151)资助项目 安徽高校自然科学基金(KJ2018A0018)资助项目 安徽大学科研训练计划(J10118520444)资助项目。
关键词 语音转换 特征融合 韵律模型 STRAIGHT模型 双向长短期记忆 voice conversion feature fusion prosodic model STRAIGHT model bi‑long short‑term memory
作者简介 通信作者:周健,男,副教授,E-mail:jzhou@ahu.edu.cn。
  • 相关文献

参考文献4

二级参考文献19

  • 1康永国,双志伟,陶建华,张维.基于混合映射模型的语音转换算法研究[J].声学学报,2006,31(6):555-562. 被引量:13
  • 2邵艳秋,韩纪庆,王卓然,刘挺.韵律参数和频谱包络修改相结合的情感语音合成技术研究[J].信号处理,2007,23(4):526-530. 被引量:7
  • 3Taisuke Itoh, Kazuya Takeda, Fumitada Itakura. Acoustic Analysis and Recognition of Whispered Speech[J]. ICASSP,2002: 389-392.
  • 4Robert W. Morris, Mark A. Clements. Reconstruction of Speech from Whispers [J]. Medical Engineering & Physics, 200'2,24: 515-520.
  • 5Qian-Jie Fu,Fan-Gang Zeng. Identification of Temporal Envelope Cues in Chinese Tone Recognition [J]. Asia Pacific Journal of Speech, Language and Hearing,2000,(5) :45-57.
  • 6Man Gao. Tones in Whispered Chinese:Articulatory and PerceptualCues. [Master], University of Victoria,2002.
  • 7W Meyer Eppler. Realization of Prosodic Features in Whispered Speech [J]. Journal of Acoustical Society of America, 1957, 29( 1 ) : 104-106.
  • 8林茂灿.普通话声调的声学特性和知觉征兆[J].中国语文,1988,(2):182-193.
  • 9梁之安.汉语普通话声调的听觉辨认依据[J].生理学报,1963,26(2):85-91.
  • 10徐宁,杨震.高合成质量的语音转换系统[J].应用科学学报,2008,26(4):378-383. 被引量:1

共引文献31

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部