摘要
针对现有的大多数跨模态视频检索算法忽略了数据中丰富的语义线索,使得生成特征的表现能力较差的问题,设计了一种基于多语义线索的跨模态视频检索模型,该模型通过多头目自注意力机制捕捉视频模态内部对语义起到重要作用的数据帧,有选择性地关注视频数据的重要信息,获取数据的全局特征;采用双向门控循环单元(GRU)捕捉多模态数据内部上下文之间的交互特征;通过对局部数据之间的细微差别进行联合编码挖掘出视频和文本数据中的局部信息。通过数据的全局特征、上下文交互特征和局部特征构成多模态数据的多语义线索,更好地挖掘数据中的语义信息,进而提高检索效果。在此基础上,提出了一种改进的三元组距离度量损失函数,采用了基于相似性排序的困难负样本挖掘方法,提升了跨模态特征的学习效果。在MSR-VTT数据集上的实验表明:与当前最先进的方法比较,所提算法在文本检索视频任务上提高了11.1%;在MSVD数据集上的实验表明:与当前先进的方法比较,所提算法在文本检索视频任务上总召回率提高了5.0%。
Most of the existing cross-modal video retrieval algorithms map heterogeneous data to a space,so that semantically similar data are close to each other and semantically dissimilar data are far from each other,that is,the global similarity relationship of different modal data is established.However,these methods ignore the rich semantic clues in the data,which makes the performance of feature generation poor.To solve this problem,we propose a cross-modal retrieval model based on multi-semantic clues.This model captures the data frames that play an important role in semantics within video model through multi-head self-attention mechanism,and pays attention to the important information of video data to obtain the global characteristics of the data.Bidirectional Gate Recurrent Unit(GRU)is used to capture the interaction characteristics between contexts within multimodal data.Our method can also mine the local information in video and text data through the joint coding of the slight differences between the local data.Through the global features,context interaction features and local features of the data,the multi-semantic clues of the multi-modal data are formed to better mine the semantic information in the data and improve the retrieval effect.Besides this,an improved triplet distance measurement loss function is proposed,which adopts the difficult negative sample mining method based on similarity sorting and improves the learning effect of cross-modal characteristics.Experiments on MSR-VTT dataset show that the proposed method improves the text retrieval video task by 11.1%compared with the state-of-the-art methods.Experiments on MSVD dataset show that the proposed method improves the text retrieval video task by 5.0%compared with the state-of-the-art methods.
作者
丁洛
李逸凡
于成龙
刘洋
王轩
漆舒汉
DING Luo;LI Yifan;YU Chenglong;LIU Yang;WANG Xuan;QI Shuhan(School of Computer Science and Technology,Harbin Institute of Technology(Shenzhen),Shenzhen 518055,China;School of Digital Media,Shenzhen Institute of Information Technology,Shenzhen 518172,China;Peng Cheng Laboratory,Shenzhen 518055,China)
出处
《北京航空航天大学学报》
EI
CAS
CSCD
北大核心
2021年第3期596-604,共9页
Journal of Beijing University of Aeronautics and Astronautics
基金
国家自然科学基金(61902093)
广东省自然科学基金(2020A1515010652)。
关键词
跨模态视频检索
多语义线索
多头目注意力机制
距离度量损失函数
多模态
cross-modal video retrieval
multi-semantic clues
multi-leader attention mechanism
distance measurement loss function
multi-modal
作者简介
丁洛,男,硕士研究生。主要研究方向:多模态检索、目标检测;李逸凡,男,博士研究生。主要研究方向:视觉问答、目标识别技术;通讯作者:漆舒汉,男,博士,教授,硕士生导师。主要研究方向:计算机视觉、多媒体信息检索和机器博弈,E-mail:shuhanqi@cs.hitsz.edu.cn。