摘要
提出了一种基于大规模标注语料库的词语聚类方法。文中根据专家群体对某一具体问题进行决策的需要,回顾了国内外几种基于分布的词语聚类方法,并给出我们的算法原理及实现步骤。首先人工抽取某一类内词语中的几个,从语料库找到这些词的修饰词,组成修饰词向量,然后对于每一个词语,统计修饰词向量中的每个修饰词和该词语在语料库中同现的频率,组成特征向量,最后进行聚类分析。支持宏观经济决策的试验表明该算法能有效地实现词语的聚类。
This paper proposes a novel approach for word clustering based on large tagged corpus. According to the need of decision-making support for a specific problem, this paper review several algorithms developed by previous works, after that, our algorithm is rendered. Firstly, we manually extract several words from a specified class, and then search the corpus for the modifiers of those words to construct modifier vector, for each of other words, count the frequency of its co-occurrence with each modifier in the modifier vector to construct its characteristic vector, finally, apply clustering algorithm to those characteristic vectors to get the result. Proved by experiment carried out on Decision-making Support for Macro Economics, this algorithm is effective for word clustering.
出处
《系统仿真学报》
CAS
CSCD
2003年第10期1439-1442,共4页
Journal of System Simulation
基金
国家自然科学基金重大项目(79990581)
关键词
语义
聚类
语料库
N元模型
语义相似
语义相关
semantic
clustering
corpus
n-gram model
semantic similarity
semantic relatedness