摘要
针对音视频多模态数据的差异性和抑郁症数据集样本较少但单个样本较长的情况,提出一种基于挤压激励注意力的分段时空特征提取和多模态聚合网络的抑郁症识别方法。将视频和音频分段,使用多级注意力的双流特征提取网络分别提取视频人脸面部信息和音频语音信息,设计多尺度时间聚合模块将分段特征聚合成完整特征,不同模态的分段特征和完整特征再跨模态融合,进行抑郁值的评估。该方法在数据集AVEC2013和AVEC2014上进行验证,取得了优异的结果。
Aiming at the differences between audio and video multimodal data as well as the fewer samples with long duration of depression datasets,a segmented spatio-temporal feature extraction and multimodal aggregation network based on squeezed incentive attention was proposed for depression recognition.The video and audio segments were segmented,and the video face and audio voice information were extracted respectively using the double-stream feature extraction network of multi-level attention.A multi-scale time aggregation module was designed to aggregate the segmented features into complete features,and the segmented features and complete features of different modes were integrated across modes,and the depression value was evaluated.The method was evaluated on the data sets AVEC2013 and AVEC2014,achieving excellent results.
作者
师硕
杜春辉
韩宝明
耿宇霄
SHI Shuo;DU Chun-hui;HAN Bao-ming;GENG Yu-xiao(School of Artificial Intelligence,Hebei University of Technology,Tianjin 300401,China)
出处
《计算机工程与设计》
北大核心
2025年第5期1480-1486,共7页
Computer Engineering and Design
基金
国家自然科学基金项目(62276088、62102129)
北省高等学校科学技术研究基金项目(QN2019207)
河北省自然科学基金面上基金项目(F2019202464)
京津冀基础研究合作专项基金项目(J230040)。
关键词
抑郁症识别
音视频
多模态
注意力机制
时空网络
分段聚合
多模态融合
depression recognition
audio-visual
multi-modal
attention
spatio-temporal network
segment aggregation
multimodal fusion
作者简介
师硕(1981),女,河北保定人,博士,副教授,CCF会员,研究方向为情感计算、人脸表情识别、行人再识别;杜春辉(1998),男,河北张家口人,硕士研究生,研究方向为抑郁症识别;韩宝明(2000),男,河北石家庄人,硕士研究生,研究方向为抑郁症识别;耿宇霄(1999),女,河北张家口人,硕士研究生,研究方向为情感计算。E-mail:shishuo@hebut.edu.cn。