摘要
基于类的统计语言模型是解决统计模型数据稀疏问题的重要方法。传统的统计方法基于贪婪原则,常以语料的似然函数或困惑度(perplexity)作为评价标准。传统的聚类方法的主要缺点是聚类速度慢,初值对结果影响大,易陷入局部最优。本文提出了词相似度定义、词集合相似度定义,一种自下而上的分层聚类算法。这种方法不但能改善聚类效果,而且可根据不同的模型选择不同的相似度定义,从而提高聚类的使用效果。
Cluster-based statistic language model is an important method to solve the problem of sparse data. Conventional statistical clustering methods usually base on greedy principle. The common Metric for evaluating a clustering algorithm is the likelihood function or perplexity of the corpus. Conventional clustering algorithms often converge to a local optimum, so global optimum is not guaranteed, and initial choices can influence final result. The authors try to solve above problems in this paper. This paper presents a novel definition of word similarity. Based on word similarity, this paper gives the definition of word set similarity, and proposes a bottom-up hierarchical clustering algorithm based on similarity. This method not only improves clustering effect, but also can choice different similarity definition for different cluster-based model, such as predictive clustering, conditional clustering, and combined clustering, thus improved the effect of using clusters. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance.
出处
《微电子学与计算机》
CSCD
北大核心
2005年第8期93-95,共3页
Microelectronics & Computer
基金
国家自然科学基金资助项目(69982001)
国家"863计划"资助项目(2001AA114201)
关键词
词相似度
词聚类
统计语言模型
Word similarity, Word clustering, Statistical language model
作者简介
袁里驰 男,(1973-),博士生。研究方向为自然语言处理,网络安全。