摘要
人物交互(human-object interaction,HOI)检测在复杂场景理解中发挥着至关重要的作用。目前的大多数方法都以一阶段的方式将参数交互查询直接映射到一组HOI预测中,这导致丰富的交互结构没有被充分挖掘和利用。对此可以通过多模态数据获取更多维度的信息,从而更全面地理解人物之间的交互行为。为此设计了一种Transformer风格的HOI检测器,该检测器基于查询的方式检索对比语言图像预训练(CLIP)知识,然后执行交互建议生成,通过结构感知网络将非参数交互建议转换为HOI预测。本文创新性地将CLIP知识迁移到HOI检测中,并通过对整体语义结构和局部空间结构进行额外编码提高了预测结果的准确性。实验结果表明,所提模型在公共数据集V-COCO上的准确率达到了64.83%,在HICO-DET数据集上的准确率达到了28.78%,与现有的HOI检测算法相比展现出优越的性能,证明了该算法的有效性。
Human-object interation(HOI)plays a crucial role in understanding complex scenes.Recently,con-trastive language image pre-training has shown great potential in providing prior knowledge about interactions in HOI detectors through knowledge extraction.However,this method typically relies on large-scale training data,and most methods directly map parameter interaction queries to a set of HOI predictions in a one-stage manner.This leads to a lack of sufficient exploration and utilization of the rich interaction structures.The use of multimodal data allows more dimensional information to be extracted and offers a more comprehensive understanding of the interaction behavior between human and object.In this study we designed a Transformer style HOI detector.This process scheme first retrieves comparative language image pre-training(CLIP)knowledge based on queries,then performs interactive suggestion generation,and finally converts non-parametric interactive suggestions into HOI predictions through a structure aware network.Structural awareness networks improve the accuracy of prediction results by encoding the overall semantic structure and local spatial structure with additional encoding.The accu-racy of this model on the public dataset V-COCO reached 64.83%,and the accuracy on HICO-DET reached 28.78%.Compared with existing HOI detection algorithms,this algorithm exhibits superior performance,demon-strating its effectiveness.
作者
陈妍
高永彬
CHEN Yan;GAO YongBin(School of Electronic and Electrical Engineering,Shanghai University of Engineering Sciences,Shanghai 201620,China)
出处
《北京化工大学学报(自然科学版)》
北大核心
2025年第1期113-121,共9页
Journal of Beijing University of Chemical Technology(Natural Science Edition)
基金
国家自然科学基金(61802253)
上海市地方能力建设项目(21010501500)
上海市“科技创新行动计划”社会发展科技攻关项目(21DZ1204900)。
关键词
人-物体交互检测
计算机视觉
深度学习
目标检测
视觉关系
human-object interaction(HOI)detection
computer vision
deep learning
object detection
visual relationship
作者简介
第一作者:陈妍,女,1999年生,硕士生;通信联系人:高永彬,E-mail:gaoyongbin@sues.edu.cn。