摘要
关键词1提供了文档内容的概要信息,它们被使用在很多数据挖掘的应用中。在目前的关键词提取算法中,我们发现词汇层面(代表意思的词)和概念层面(意思本身)的差别导致了关键字提取的不准确,比如不同语法的词可能有着相同的意思,而相同语法的词在不同的上下文有着不同的意思。为了解决这个问题,这篇文章提出使用词义代替词并且通过考虑关键候选词的语义信息来提高关键词提取算法性能的方法。与现有的关键词提取方法不同,该方法首先通过使用消歧算法,通过上下文得到候选词的词义;然后在后面的词合并、特征提取和评估的步骤中,候选词义之间的语义相关度被用来提高算法的性能。在评估算法时,我们采用一种更为有效的基于语义的评估方法与著名的Kea系统作比较。在不同领域间的实验中可以发现,当考虑语义信息后,关键词提取算法的性能能够得到很大的提高。在同领域的实验中,我们的算法的性能与Kea++算法的相近。我们的算法没有领域的限制性,因此具有更好的应用前景。
Keyphrases provide semantic metadata producing an overview of the content of a document, they are used in many text-mining applications. In the process of keyphrases generation, we notice that the distinction between lexical level (term for meaning) and conceptual level (the meaning itself) can result in inaccuracy. In order to solve this problem, this paper proposes a new method that improves automatic keyphrase extraction by using semantic information of candidate keyphrases. Our key'phrases extraction method, in contrast to current methods, outputs the senses set instead of terms set by using word sense disambiguation method, as sense has only one unique meaning. Semantic relatedness between senses of candidate keyphrases is taken into consideration in the stage of term conflation, feature calculation, and evaluation. We evaluate our semantically improved method against the well known Kea system by using a more effective semantically enhanced evaluation method. The inter-domain experiment shows that quality of keyphrases extraction can be improved significantly when semantic information is exploited. The intra-domain experiment shows our method is competitive with Kea++ algorithm, and not domain-specific.
出处
《计算机科学》
CSCD
北大核心
2008年第6期148-151,共4页
Computer Science
基金
国家自然科学基金资助项目(60675015)资助
关键词
关键词提取
语义相关度
消歧
Keyphrae extraction, Semantic relatedness, Word sense disambiguation
作者简介
方俊博士生,主要从事语义网和数据挖掘研究;
郭雷博士生导师,主要从事神经网络、模式识别和知识管理等;
王晓东博士生,主要从事语义网和智能检索。