期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning 被引量:3
1
作者 WEI Tingting YUAN Weilin +2 位作者 LUO Junren ZHANG Wanpeng LU Lina 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2023年第1期9-18,共10页
In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a visi... In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote(DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention(VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a crossmodal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-toend Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions. 展开更多
关键词 remote sensing image captioning(RSIC) vision-language representation remote sensing image caption dataset attention mechanism
在线阅读 下载PDF
A deep dense captioning framework with joint localization and contextual reasoning
2
作者 KONG Rui XIE Wei 《Journal of Central South University》 SCIE EI CAS CSCD 2021年第9期2801-2813,共13页
Dense captioning aims to simultaneously localize and describe regions-of-interest(RoIs)in images in natural language.Specifically,we identify three key problems:1)dense and highly overlapping RoIs,making accurate loca... Dense captioning aims to simultaneously localize and describe regions-of-interest(RoIs)in images in natural language.Specifically,we identify three key problems:1)dense and highly overlapping RoIs,making accurate localization of each target region challenging;2)some visually ambiguous target regions which are hard to recognize each of them just by appearance;3)an extremely deep image representation which is of central importance for visual recognition.To tackle these three challenges,we propose a novel end-to-end dense captioning framework consisting of a joint localization module,a contextual reasoning module and a deep convolutional neural network(CNN).We also evaluate five deep CNN structures to explore the benefits of each.Extensive experiments on visual genome(VG)dataset demonstrate the effectiveness of our approach,which compares favorably with the state-of-the-art methods. 展开更多
关键词 dense captioning joint localization contextual reasoning deep convolutional neural network
在线阅读 下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部