期刊文献+

基于自身注意力时空特征的语音情感识别算法 被引量:5

Speech emotion recognition algorithm based on self-attention spatio-temporal features
在线阅读 下载PDF
导出
摘要 针对语音情感识别中无法对关键的时空依赖关系进行建模,导致识别率低的问题,提出一种基于自身注意力(self-attention)时空特征的语音情感识别算法,利用双线性卷积神经网络、长短期记忆网络和多组注意力(multi-head attention)机制去自动学习语音信号的最佳时空表征。首先提取语音信号的对数梅尔(log-Mel)特征、一阶差分和二阶差分特征合成3D log-Mel特征集作为卷积神经网络的输入;然后综合考虑空间特征和时间依赖性关系,将双线性池化和双向长短期记忆网络的输出融合得到空间-时间特征表征,利用多组注意力机制捕获判别性强的特征;最后利用softmax函数进行分类。在IEMOCAP和EMO-DB数据库上进行实验,结果表明两种数据库的识别率分别为63.12%和87.09%,证明了此方法的有效性。 To solve the problem that the key spatio-temporal dependencies can not be modeled in speech emotion recogni-tion(SER),which leads to the low recognition rate,a speech emotion recognition algorithm based on self-attention spatio-temporal features is proposed.Bilinear convolution neural network,short-term memory network and multi-head attention mechanism are used to automatically learn the best spatio-temporal representation of speech signal.Firstly,the log-Mel feature,the first-order difference and second-order difference of speech signal are extracted to synthesize 3D log-Mel fea-ture set as the input of CNN network.Then,considering the relation of spatial feature and temporal dependence,the output of bilinear pooling and bidirectional long short-term memory network is fused to obtain spatio-temporal feature representation,and the multi-head attention mechanism is used to capture the discriminative feature.Finally,the softmax function is used to classify.Experiments on IEMOCAP and EMO-DB databases are carried out,and the results show that the recognition rates of the two databases are 63.12%and 87.09%respectively,which proves the effectiveness of the method.
作者 徐华南 周晓彦 姜万 李大鹏 XU Huanan;ZHOU Xiaoyan;JIANG Wan;LI Dapeng(School of Electronic and Information Engineering,Nanjing University of Information Science&Technology,Nanjing 210044,Jiangsu,China)
出处 《声学技术》 CSCD 北大核心 2021年第6期807-814,共8页 Technical Acoustics
基金 国家自然科学基金资助项目(6207022273)。
关键词 语音情感识别 3D log-Mel 双线性卷积神经网络 长短期记忆网络 多组注意力 speech emotion recognition 3D log-Mel bilinear convolutional neural network long short-term memory multi-head attention
作者简介 徐华南(1995-),女,江苏省南京人,硕士研究生,研究方向为自然语言处理,语音情感识别;通信作者:周晓彦,E-mail:18326167806@163.com。
  • 相关文献

参考文献2

二级参考文献85

  • 1van Bezooijen R,Otto SA,Heenan TA. Recognition of vocal expressions of emotion:A three-nation study to identify universal characteristics[J].{H}JOURNAL OF CROSS-CULTURAL PSYCHOLOGY,1983,(04):387-406.
  • 2Tolkmitt FJ,Scherer KR. Effect of experimentally induced stress on vocal parameters[J].Journal of Experimental Psychology Human Perception Performance,1986,(03):302-313.
  • 3Cahn JE. The generation of affect in synthesized speech[J].Journal of the American Voice Input/Output Society,1990.1-19.
  • 4Moriyama T,Ozawa S. Emotion recognition and synthesis system on speech[A].Florence:IEEE Computer Society,1999.840-844.
  • 5Cowie R,Douglas-Cowie E,Savvidou S,McMahon E,Sawey M,Schro. Feeltrace:An instrument for recording perceived emotion in real time[A].Belfast:ISCA,2000.19-24.
  • 6Grimm M,Kroschel K. Evaluation of natural emotions using self assessment manikins[A].Cancun,2005.381-385.
  • 7Grimm M,Kroschel K,Narayanan S. Support vector regression for automatic recognition of spontaneous emotions in speech[A].IEEE Computer Society,2007.1085-1088.
  • 8Eyben F,Wollmer M,Graves A,Schuller B Douglas-Cowie E Cowie R. On-Line emotion recognition in a 3-D activation-valencetime continuum using acoustic and linguistic cues[J].Journal on Multimodal User Interfaces,2010,(1-2):7-19.
  • 9Giannakopoulos T,Pikrakis A,Theodoridis S. A dimensional approach to emotion recognition of speech from movies[A].Taibe:IEEE Computer Society,2009.65-68.
  • 10Wu DR,Parsons TD,Mower E,Narayanan S. Speech emotion estimation in 3d space[A].Singapore:IEEE Computer Society,2010.737-742.

共引文献176

同被引文献28

引证文献5

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部