摘要
词聚类是语言自动处理中一个重要的基础环节。传统的统计方法基于贪婪原则,常以语料的似然函数或困惑度作为评价标准,其主要缺点是:聚类速度慢、初值对结果的影响大、易陷入局部最优。针对这些问题,提出了基于相似度测度和覆盖方法的聚类方法。该方法计算量小、聚类速度快。而且,借助覆盖原理有效减小了初始点选取对聚类的影响程度。实验证明,效果理想。
Word clustering is an important fundamental work in automatic language process. Traditional statistical methods base on greedy principle, often use language materials likelihood function or confusion achievement as their evaluation criteria. They have typical defaults, e. g. , their clustering speed is slow, the initial value affects the result greatly, and they easily fall into local optimum. Pointing to these problems, this paper puts forward a new words automatic clustering method based on similarity measurement and covering algorithm. The clustering speed of this method is fast because the computational complexity is much simple. Also, due to the covering theories, this method reduces the influence of initial selection of point on the clustering. Experiment validates the ideal effect of our design.
出处
《计算机应用与软件》
CSCD
2010年第8期276-278,共3页
Computer Applications and Software
关键词
词聚类
似然函数
覆盖方法
Word clustering Likelihood function Covering method
作者简介
王舵,硕士,主研领域:计算机检测技术,嵌入式系统。