摘要
【目的/意义】科学数据作为科学研究成果的表现形式之一,多以非正式引用的形式隐藏于学术论文之中。从学术论文中自动识别数据引用信息从而提取数据要素,为科学数据要素的组织提供了新思路。【方法/过程】为提高正例文本占比进而提升数据引用句的识别效果,基于生物信息学领域论文全文数据,采用篇章结构识别和数据增强、随机欠采样、特征词筛选3种不平衡语料采样方法构建语料集,再分别结合5种文本分类模型构建数据引用识别流程。【结果/结论】研究发现,从学术论文中识别数据引用句是细化数据要素组织的有效环节;篇章结构识别和不平衡语料采样方法可以有效提升数据引用句的识别性能;较之传统的机器学习模型,BERT类深度学习模型在数据引用文本分类中性能更优。【创新/局限】从学术论文中识别非正式数据引用句为数据要素组织带来新的视角,是收集高价值数据要素的高效方法。然而,由于论文中数据引用不规范且数量稀疏,分类精确率仍有提升空间。
【Purpose/significance】Scientific data,as one of the expressions of scientific research achievements,is often hidden in aca⁃demic papers in the form of informal citations.It provides a new idea for organizing scientific data elements by identifying data refer⁃ence information from academic papers.【Method/process】To improve the proportion of positive text and enhance the identification ef⁃fect of data citation sentences,based on the full-text data of papers in the field of bioinformatics,three methods for sampling unbal⁃anced corpora were used:chapter structure recognition and data augmentation,random undersampling,and feature word filtering.Then,five text classification models were combined to build a data citation recognition process.【Result/conclusion】It is found that identifying data citations from academic papers is an effective link to refine the organization of data elements.Text structure recogni⁃tion and unbalanced corpus sampling can effectively improve the performance of data reference recognition.Compared with traditional machine learning models,BERT-like deep learning model has better performance in data reference text classification.【Innovation/limitation】Identifying informal data quotes from academic papers brings a new perspective to the organization of data elements and is an efficient method to collect high-value data elements.However,because the data cited in the paper is irregular and sparse,there is still room for improvement in the classification accuracy rate.
作者
刘禹彤
刘茹
杨波
LIU Yutong;LIU Ru;YANG Bo(College of Information Management,Nanjing Agricultural University,Nanjing 210095,China)
出处
《情报科学》
北大核心
2025年第3期146-156,共11页
Information Science
基金
国家社会科学基金项目“科学数据集的自组织模式和质量评价研究”(18BTQ077)。
关键词
数据要素
数据引用
文本分类
深度学习
科学数据管理
data elements
data citation
text classification
deep learning
scientific data management
作者简介
刘禹彤(1996-),女,吉林吉林人,博士研究生,主要从事信息计量研究;刘茹(1996-),女,安徽蚌埠人,硕士,主要从事信息计量研究;杨波(1981-),男,陕西宝鸡人,教授,博士生导师,主要从事信息计量和科学数据管理研究。