摘要
使用耳语音的频谱包络来预估正常音的基频特征,这类算法在对正常音基频预测的准确性上存在一定不足,在合成语音自然度方面存在着明显欠缺,有时会出现音调失常等问题。本文提出一种声学特征融合的方法,通过双向长短期记忆(Bi‑long short‑term memory,BLSTM)深度网络来逐帧预测正常音基频。首先,使用STRAIGHT模型和相关代码,分别对耳语音和正常音语料进行预处理,提取耳语音的梅尔倒谱系数(Mel‑scale frequency cepstral coefficient,MFCC)、韵律及谱包络特征,正常音的基频与谱包络特征。然后使用BLSTM深度网络,分别建立耳语音和正常音谱包络特征之间映射关系,以及耳语音MFCC、韵律及谱包络特征对正常音基频F0的映射关系。最后根据耳语音的MFCC、韵律及谱包络特征获得对应的正常音基频和谱包络,使用STRAIGHT模型合成正常音。实验结果表明,相较于仅使用谱包络估计基频,采用此种方法引入语音韵律和MFCC的融合特征是对基频特征的良好补充,解决了音调失常的现象,转换后的语音在韵律上更加接近正常发音。
Currently,in reconstruction of normal speech from whispered speech based on neural network,the spectral envelope of the whisper is often used to estimate F0 characteristics of the normal speech.Such algorithms have certain deficiencies in the accuracy of F0.There is a clear lack of naturalness,and sometimes the pitch distortion occurs.This paper proposes a method for predicting the F0 of normal speech frame by frame using the Bi‑long short‑term memory(BLSTM)deep network with the acoustic fusion feature of normal speech.Firstly,the STRAIGHT model and related codes are used to preprocess the whisper and the normal speech corpus.Respectively,extract the Mel‑scale frequency cepstral coefficient(MFCC),rhythm and spectral envelope of the whisper speech and the F0 and spectral envelope of the normal speech.Secondly,the BLSTM deep network is used to establish a mapping relationship between spectrums of whisper and normal speech,and a mapping relationship between MFCC,rhythm and spectral envelope features of whisper speech and F0 of normal speech.Finally,according to MFCC,rhythm and spectral envelope features of whisper speech,the F0 and spectral envelope of the corresponding normal speech are obtained,and the normal speech is synthesized using the STRAIGHT model.The experimental results show that compared with the estimation of the F0 using only the spectral envelope,the introduction of fusion features of phonetic rhythm and MFCC is a good complement to the F0 features,which solves the phenomenon of pitch disorders and the converted speech is closer to normal speech in rhythm.
作者
庞聪
连海伦
周健
王华彬
陶亮
PANG Cong;LIAN Hailun;ZHOU Jian;WANG Huabin;TAO Liang(Key Laboratory of Computational Intelligence and Signal Processing,Ministry of Education,Anhui University,Hefei,230039,China)
出处
《南京航空航天大学学报》
EI
CAS
CSCD
北大核心
2020年第5期777-782,共6页
Journal of Nanjing University of Aeronautics & Astronautics
基金
国家自然科学基金(61301295)资助项目
安徽省自然科学基金(1708085MF151)资助项目
安徽高校自然科学基金(KJ2018A0018)资助项目
安徽大学科研训练计划(J10118520444)资助项目。
作者简介
通信作者:周健,男,副教授,E-mail:jzhou@ahu.edu.cn。