摘要
多模态对话系统采用Transformer、交叉注意力机制和预训练模型等方式融合不同粒度的文本、语音和视频模态,提取出跨模态特征,然而现有研究忽略了不同模态特征对分类任务的敏感差异性,造成过度融合及带来的信息冗余。针对多模态融合的顺序特征对分类结果的影响,提出了基于模态敏感注意力机制的多模态对话模型MDM-MSAM,分为主从模态筛选、双模态跨模态融合和三模态跨模态融合三部分,通过确定主从模态并提取跨双模态特征,与三模态融合特征再融合,形成模态敏感的层次化跨多模态特征。在MintRec和CMU-MOSI数据集上的分类准确率分别比目前性能最好的模型提升了3.15%和3.5%。MDM-MSAM模型部署应用在流程引擎式的多轮对话系统中,取得了良好的应用效果。
The multimodal dialogue system adopts methods such as Transformer,cross-attention mechanism and pre-trained models to fuse text,speech and video modalities of different granularities and extracts cross-modal features.However,the existing research ignores the sensitive differences of different modal features on classification tasks,resulting in excessive fusion and information redundancy.Regarding the influence of sequential features of multimodal fusion on classification results,this paper proposed the multimodal dialogue model MDM-MSAM(multimodal dialogue model based on modality sensitive attention mechanism).The model was divided into three parts:master-slave mode screening,dual-modal cross-modal fusion,and tri-modal cross-modal fusion.By determining the master-slave modalities and extracting cross-dual-modal features,the model re-fused them with the tri-modal fusion features,then formed the modality-sensitive hierarchical cross-multimodal features.The classification accuracy on MintRec and CMU-MOSI datasets increase by 3.15%and 3.5%respectively compared with the currently best-performing model.The deployment and application of the MDM-MSAM in flow engine-based multi-round dialogue system achieve good application results.
作者
杜维
朱晓瑛
许方敏
郑建生
朱福喜
龚鸣敏
李紫玉
Du Wei;Zhu Xiaoying;Xu Fangmin;Zheng Jiansheng;Zhu Fuxi;Gong Mingmin;Li Ziyu(School of Information Engineering,Wuhan College,Wuhan 430212,China;School of Cyberspace Security,Beijing University of Posts&Telecommunications,Beijing 100876,China;School of Information and Communication Engineering,Beijing University of Posts&Telecommunications,Beijing 100876,China;School of Electronic Information,Wuhan University,Wuhan 430072,China)
出处
《计算机应用研究》
北大核心
2025年第9期2590-2598,共9页
Application Research of Computers
基金
国家自然科学基金资助项目(42374013)
北京市自然科学基金资助项目(L234080)
武汉学院科研基金年度计划资助项目(JJA202304)
中国高校产学研创新基金—腾讯科技创新教育专项资助项目(2022TX007)。
关键词
多模态对话系统
跨模态特征
敏感差异性
模态敏感注意力机制
主从模态
multimodal dialogue system
cross-modal features
sensitive differences
modality-sensitive attention mechanism
master-slave modality
作者简介
杜维(1984-),男,湖北武汉人,助教,硕士,主要研究方向为多模态学习;朱晓瑛(1982-),女,湖北武汉人,博士研究生,主要研究方向为信息安全;许方敏(1982-),男,湖南人,副教授,硕导,博士,主要研究方向为大数据分析与应用;郑建生(1959-),男,福建寿宁人,教授,博导,主要研究方向为通信与导航技术;朱福喜(1960-),男,湖北武汉人,教授,博导,主要研究方向为人工智能和分布式计算;龚鸣敏(1977-),女,湖北人,副教授,硕士,主要研究方向为人工智能;通信作者:李紫玉(2002-),女,湖北随州人,主要研究方向为软件工程(22202170516@whxy.edu).