摘要
针对语音情感识别中无法对关键的时空依赖关系进行建模,导致识别率低的问题,提出一种基于自身注意力(self-attention)时空特征的语音情感识别算法,利用双线性卷积神经网络、长短期记忆网络和多组注意力(multi-head attention)机制去自动学习语音信号的最佳时空表征。首先提取语音信号的对数梅尔(log-Mel)特征、一阶差分和二阶差分特征合成3D log-Mel特征集作为卷积神经网络的输入;然后综合考虑空间特征和时间依赖性关系,将双线性池化和双向长短期记忆网络的输出融合得到空间-时间特征表征,利用多组注意力机制捕获判别性强的特征;最后利用softmax函数进行分类。在IEMOCAP和EMO-DB数据库上进行实验,结果表明两种数据库的识别率分别为63.12%和87.09%,证明了此方法的有效性。
To solve the problem that the key spatio-temporal dependencies can not be modeled in speech emotion recogni-tion(SER),which leads to the low recognition rate,a speech emotion recognition algorithm based on self-attention spatio-temporal features is proposed.Bilinear convolution neural network,short-term memory network and multi-head attention mechanism are used to automatically learn the best spatio-temporal representation of speech signal.Firstly,the log-Mel feature,the first-order difference and second-order difference of speech signal are extracted to synthesize 3D log-Mel fea-ture set as the input of CNN network.Then,considering the relation of spatial feature and temporal dependence,the output of bilinear pooling and bidirectional long short-term memory network is fused to obtain spatio-temporal feature representation,and the multi-head attention mechanism is used to capture the discriminative feature.Finally,the softmax function is used to classify.Experiments on IEMOCAP and EMO-DB databases are carried out,and the results show that the recognition rates of the two databases are 63.12%and 87.09%respectively,which proves the effectiveness of the method.
作者
徐华南
周晓彦
姜万
李大鹏
XU Huanan;ZHOU Xiaoyan;JIANG Wan;LI Dapeng(School of Electronic and Information Engineering,Nanjing University of Information Science&Technology,Nanjing 210044,Jiangsu,China)
出处
《声学技术》
CSCD
北大核心
2021年第6期807-814,共8页
Technical Acoustics
基金
国家自然科学基金资助项目(6207022273)。
作者简介
徐华南(1995-),女,江苏省南京人,硕士研究生,研究方向为自然语言处理,语音情感识别;通信作者:周晓彦,E-mail:18326167806@163.com。