期刊文献+

基于多语义线索的跨模态视频检索算法 被引量:2

Cross-modal video retrieval algorithm based on multi-semantic clues
在线阅读 下载PDF
导出
摘要 针对现有的大多数跨模态视频检索算法忽略了数据中丰富的语义线索,使得生成特征的表现能力较差的问题,设计了一种基于多语义线索的跨模态视频检索模型,该模型通过多头目自注意力机制捕捉视频模态内部对语义起到重要作用的数据帧,有选择性地关注视频数据的重要信息,获取数据的全局特征;采用双向门控循环单元(GRU)捕捉多模态数据内部上下文之间的交互特征;通过对局部数据之间的细微差别进行联合编码挖掘出视频和文本数据中的局部信息。通过数据的全局特征、上下文交互特征和局部特征构成多模态数据的多语义线索,更好地挖掘数据中的语义信息,进而提高检索效果。在此基础上,提出了一种改进的三元组距离度量损失函数,采用了基于相似性排序的困难负样本挖掘方法,提升了跨模态特征的学习效果。在MSR-VTT数据集上的实验表明:与当前最先进的方法比较,所提算法在文本检索视频任务上提高了11.1%;在MSVD数据集上的实验表明:与当前先进的方法比较,所提算法在文本检索视频任务上总召回率提高了5.0%。 Most of the existing cross-modal video retrieval algorithms map heterogeneous data to a space,so that semantically similar data are close to each other and semantically dissimilar data are far from each other,that is,the global similarity relationship of different modal data is established.However,these methods ignore the rich semantic clues in the data,which makes the performance of feature generation poor.To solve this problem,we propose a cross-modal retrieval model based on multi-semantic clues.This model captures the data frames that play an important role in semantics within video model through multi-head self-attention mechanism,and pays attention to the important information of video data to obtain the global characteristics of the data.Bidirectional Gate Recurrent Unit(GRU)is used to capture the interaction characteristics between contexts within multimodal data.Our method can also mine the local information in video and text data through the joint coding of the slight differences between the local data.Through the global features,context interaction features and local features of the data,the multi-semantic clues of the multi-modal data are formed to better mine the semantic information in the data and improve the retrieval effect.Besides this,an improved triplet distance measurement loss function is proposed,which adopts the difficult negative sample mining method based on similarity sorting and improves the learning effect of cross-modal characteristics.Experiments on MSR-VTT dataset show that the proposed method improves the text retrieval video task by 11.1%compared with the state-of-the-art methods.Experiments on MSVD dataset show that the proposed method improves the text retrieval video task by 5.0%compared with the state-of-the-art methods.
作者 丁洛 李逸凡 于成龙 刘洋 王轩 漆舒汉 DING Luo;LI Yifan;YU Chenglong;LIU Yang;WANG Xuan;QI Shuhan(School of Computer Science and Technology,Harbin Institute of Technology(Shenzhen),Shenzhen 518055,China;School of Digital Media,Shenzhen Institute of Information Technology,Shenzhen 518172,China;Peng Cheng Laboratory,Shenzhen 518055,China)
出处 《北京航空航天大学学报》 EI CAS CSCD 北大核心 2021年第3期596-604,共9页 Journal of Beijing University of Aeronautics and Astronautics
基金 国家自然科学基金(61902093) 广东省自然科学基金(2020A1515010652)。
关键词 跨模态视频检索 多语义线索 多头目注意力机制 距离度量损失函数 多模态 cross-modal video retrieval multi-semantic clues multi-leader attention mechanism distance measurement loss function multi-modal
作者简介 丁洛,男,硕士研究生。主要研究方向:多模态检索、目标检测;李逸凡,男,博士研究生。主要研究方向:视觉问答、目标识别技术;通讯作者:漆舒汉,男,博士,教授,硕士生导师。主要研究方向:计算机视觉、多媒体信息检索和机器博弈,E-mail:shuhanqi@cs.hitsz.edu.cn。
  • 相关文献

参考文献2

二级参考文献32

  • 1施智平,胡宏,李清勇,史忠植,段禅伦.基于纹理谱描述子的图像检索[J].软件学报,2005,16(6):1039-1045. 被引量:44
  • 2张静,路红,薛向阳.基于索引结构的高效运动视频检索[J].计算机研究与发展,2006,43(11):1953-1958. 被引量:3
  • 3庄毅,庄越挺,吴飞.Composite Distance Transformation for Indexing and κ-Nearest-Neighbor Searching in High-Dimensional Spaces[J].Journal of Computer Science & Technology,2007,22(2):208-217. 被引量:3
  • 4Yong Rui,Thomas S Huang,Shih-Fu Chang.Image retrieval:Current techniques,promising directions and open Issues[J].Journal of Visual Communication and Image Representation,1999,10(1):39-62
  • 5H McGurk,J MacDonald.Heating lips and seeing voices[J].Nature,1976,264(5588):746-748
  • 6A Calvert.Cross-modal processing in the human brain:insights from functional neuron imaging studie[J].Cerebral Cortex,2001,11(12):1120-1123
  • 7J Foote.An overview of audio information retrieval[J].ACM Multimedia Systems,1999,7(1):2-11
  • 8Xinjing Wang,Weiying Ma,Guirong Xue,et al.Multimodel similarity propagation and its application for Web image retrieval[C].The 12th ACM Int'l Conf on Multimedia,New York,2004
  • 9T Westerveld.Probabilistic multimedia retrieval[C].The 25th Int'l ACM SIGIR Conf on Research and Development in Intormation Retrieval,Tampere,Finland,2002
  • 10R K Srihari,A B Rao,B Han,et al.A model for multimodal information Retrieval[C].IEEE Int'l Conf on Multimedia and Expo,New York,2000

共引文献23

同被引文献40

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部