Dense captioning aims to simultaneously localize and describe regions-of-interest(RoIs)in images in natural language.Specifically,we identify three key problems:1)dense and highly overlapping RoIs,making accurate loca...Dense captioning aims to simultaneously localize and describe regions-of-interest(RoIs)in images in natural language.Specifically,we identify three key problems:1)dense and highly overlapping RoIs,making accurate localization of each target region challenging;2)some visually ambiguous target regions which are hard to recognize each of them just by appearance;3)an extremely deep image representation which is of central importance for visual recognition.To tackle these three challenges,we propose a novel end-to-end dense captioning framework consisting of a joint localization module,a contextual reasoning module and a deep convolutional neural network(CNN).We also evaluate five deep CNN structures to explore the benefits of each.Extensive experiments on visual genome(VG)dataset demonstrate the effectiveness of our approach,which compares favorably with the state-of-the-art methods.展开更多
针对拥挤场景下的尺度变化导致人群计数任务中精度较低的问题,提出一种基于多尺度注意力网络(MANet)的密集人群计数模型。通过构建多列模型以捕获多尺度特征,促进尺度信息融合;使用双注意力模块获取上下文依赖关系,增强多尺度特征图的信...针对拥挤场景下的尺度变化导致人群计数任务中精度较低的问题,提出一种基于多尺度注意力网络(MANet)的密集人群计数模型。通过构建多列模型以捕获多尺度特征,促进尺度信息融合;使用双注意力模块获取上下文依赖关系,增强多尺度特征图的信息;采用密集连接重用多尺度特征图,生成高质量的密度图,之后对密度图积分得到计数。此外,提出一种新的损失函数,直接使用点注释图进行训练,以减少由高斯滤波生成新的密度图而带来的额外的误差。在公开人群数据集ShanghaiTech Part A/B、UCF-CC-50、UCF-QNRF上的实验结果均达到了最优,表明该网络可以有效处理拥挤场景下的目标多尺度,并且生成高质量的密度图。展开更多
基金Project(2020A1515010718)supported by the Basic and Applied Basic Research Foundation of Guangdong Province,China。
文摘Dense captioning aims to simultaneously localize and describe regions-of-interest(RoIs)in images in natural language.Specifically,we identify three key problems:1)dense and highly overlapping RoIs,making accurate localization of each target region challenging;2)some visually ambiguous target regions which are hard to recognize each of them just by appearance;3)an extremely deep image representation which is of central importance for visual recognition.To tackle these three challenges,we propose a novel end-to-end dense captioning framework consisting of a joint localization module,a contextual reasoning module and a deep convolutional neural network(CNN).We also evaluate five deep CNN structures to explore the benefits of each.Extensive experiments on visual genome(VG)dataset demonstrate the effectiveness of our approach,which compares favorably with the state-of-the-art methods.
文摘针对拥挤场景下的尺度变化导致人群计数任务中精度较低的问题,提出一种基于多尺度注意力网络(MANet)的密集人群计数模型。通过构建多列模型以捕获多尺度特征,促进尺度信息融合;使用双注意力模块获取上下文依赖关系,增强多尺度特征图的信息;采用密集连接重用多尺度特征图,生成高质量的密度图,之后对密度图积分得到计数。此外,提出一种新的损失函数,直接使用点注释图进行训练,以减少由高斯滤波生成新的密度图而带来的额外的误差。在公开人群数据集ShanghaiTech Part A/B、UCF-CC-50、UCF-QNRF上的实验结果均达到了最优,表明该网络可以有效处理拥挤场景下的目标多尺度,并且生成高质量的密度图。