摘要
改进索引术语质量的衡量方法可以有效提高IR系统的检索效率,但术语的固有属性易受文档长度影响,难以全面衡量术语质量。对此,本文从术语内在的区分性出发,借鉴词袋模型的基本思想,提出了术语区分能力(term discriminative capacity,TDC)这一理论及3种不同的计算方法。本文还采集了Web of Science的3个子数据库中包含4个著录项的900条记录作为实验数据,来实现TDC的大规模计算,并观察3种算法在实践中的差异。经过实验分析得出,计算术语区分能力的最佳方法为TDC-T,该算法在多个方面表现稳定,且不受DF值的影响,可以作为衡量术语质量的全新指标,记为TDC。但是本研究所选取的A&HCI数据库的记录较少,这或许会造成另两个领域计算结果的失衡。
Improving the quality of indexing terms can effectively improve the retrieval efficiency of the IR system,but the inherent properties of the term are susceptible to the length of the document,making it difficult to fully measure the quality of the term.In this regard,this paper starts from the intrinsic property of the term’s discrimination and proposes the theory of term discriminative capacity(TDC)and three different calculation methods based on the idea of the bag-ofwords model.In this paper,900 records containing 4 entries from three sub-databases of Web of Science were collected as experimental data to realize large-scale calculation of TDC and observe the differences between the three algorithms in practice.Through experimental analysis,the best method for calculating the term discriminative capacity is determined to be TDC-T.Its algorithm is stable in many respects and is not affected by the DF value.Therefore,as a new indicator to measure the quality of the term,it is recorded as TDC.However,the A&HCI database selected in this study has fewer records,which may cause an imbalance in the calculation results of the other two fields.
作者
王昊
唐慧慧
张海潮
张进
张紫玄
Wang Hao;Tang Huihui;Zhang Haichao;Zhang Jin;Zhang Zixuan(School of Information Management,Nanjing University,Nanjing 210023;Jiangsu Key Laboratory of Data Engineering and Knowledge Service,Nanjing 210023;School of Information Studies,University of Wisconsin-Milwaukee,Milwaukee 53201)
出处
《情报学报》
CSSCI
CSCD
北大核心
2019年第10期1078-1091,共14页
Journal of the China Society for Scientific and Technical Information
基金
国家自然科学基金青年科学基金项目“面向学术资源的TSD与TDC测度及分析研究”(71503121)
“江苏青年社科英才”人才培养项目
“南京大学仲英青年学者”人才培养项目
关键词
索引术语
词袋模型
术语区分能力
术语空间密度
术语质量评价
indexing term
bag-of-words model
term discriminative capacity
term space density
term quality evaluation
作者简介
王昊,男,1981年生,博士,博士生导师,主要研究方向为自然语言处理、数据挖掘应用、本体学习等;唐慧慧,女,1995年生,硕士,主要研究方向为自然语言处理等,E-mail:mf1714055@smail.nju.edu.cn;张海潮,女,1995年生,硕士,主要研究方向为自然语言处理等;张进,男,1959年生,博士,博士生导师,主要研究方向为信息检索算法,搜索引擎评估等;张紫玄,女,1994年生,硕士,主要研究方向为自然语言处理等。