期刊文献+

基于相似度的词聚类算法 被引量:4

Word Clustering Based on Similarity
在线阅读 下载PDF
导出
摘要 基于类的统计语言模型是解决统计模型数据稀疏问题的重要方法。传统的统计方法基于贪婪原则,常以语料的似然函数或困惑度(perplexity)作为评价标准。传统的聚类方法的主要缺点是聚类速度慢,初值对结果影响大,易陷入局部最优。本文提出了词相似度定义、词集合相似度定义,一种自下而上的分层聚类算法。这种方法不但能改善聚类效果,而且可根据不同的模型选择不同的相似度定义,从而提高聚类的使用效果。 Cluster-based statistic language model is an important method to solve the problem of sparse data. Conventional statistical clustering methods usually base on greedy principle. The common Metric for evaluating a clustering algorithm is the likelihood function or perplexity of the corpus. Conventional clustering algorithms often converge to a local optimum, so global optimum is not guaranteed, and initial choices can influence final result. The authors try to solve above problems in this paper. This paper presents a novel definition of word similarity. Based on word similarity, this paper gives the definition of word set similarity, and proposes a bottom-up hierarchical clustering algorithm based on similarity. This method not only improves clustering effect, but also can choice different similarity definition for different cluster-based model, such as predictive clustering, conditional clustering, and combined clustering, thus improved the effect of using clusters. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance.
出处 《微电子学与计算机》 CSCD 北大核心 2005年第8期93-95,共3页 Microelectronics & Computer
基金 国家自然科学基金资助项目(69982001) 国家"863计划"资助项目(2001AA114201)
关键词 词相似度 词聚类 统计语言模型 Word similarity, Word clustering, Statistical language model
作者简介 袁里驰 男,(1973-),博士生。研究方向为自然语言处理,网络安全。
  • 相关文献

参考文献7

  • 1Ido Dagan, et al. Context Word Similarity and Estimation From Sparse Data [J]. Computer Speech and Language,2001, 9(2): 123-152.
  • 2Firth, John Rupert. 1957. A Synopsis of Linguistic Theory 1930-1955 [C]. In Philological Society, Editor, Studies in Linguistic Analysis. Blackwell, Oxford, pages 1-32.Reprinted in Selected Papers of J. R. Firth, edited by F.Palmer. Longman, 1968.
  • 3Harris, Zelig S. Mathematical Structures of Language[M].New York: Wiley, 1965.
  • 4Cutting, D R Karger, D R Perdersen, J R Tukey, J W(1992). Scatter/garther: A Cluster-Based Approach to Browsing Large Document Collections[C]. In SIGIR 92.
  • 5Gao J Wang, H F, M Lee, K F (2003b). A Unifed Approach to Statistical Language Modeling for Chinese [C].ICASSP-2000, Istanbul, Turkey, June.
  • 6Lee Lillian. 2001. Similarity-Based approaches to Natural Language Processing. Ph.D. thesis,[D] Harvard University,Cambridge, MA.
  • 7Karov Yael, Shimon Edelman. Learning Similarity-Based Word Sense Disambiguation From Sparse Data.[C] In Proceedings of the Fourth Workshop on Very Large Corpora,Copenhagen, Denmark, 1999: 42-55.

同被引文献25

  • 1李宝敏.基于语义的Internet研究[J].微电子学与计算机,2005,22(9):130-133. 被引量:4
  • 2薛薇.SPSS统计分析方法及应用[M].北京:电子工业出版社,2009:330.
  • 3Agrawal,Gehrke J,Gunopolos D,et al.Automatic subspace clustering of high dimensional data for data mining appalication[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data Seattle,1998:94-105.
  • 4Li Li,Feng Liu,Wu Chou.An Information Theoretic Approach for Using Word Cluster Information in Natural Language Call Routing[C]//EUROSPEECH-2003,2003:2829-2832.
  • 5Hua-Ping Zhang,Qun Liu,etal.Chinese name entity recogni- tion using role model.Special issue Word Formation and Chi- nese Language processing of the International Journal of Com- putational[J]. Linguistics and Chinese Language Processing, 2003,8 (2):29-60.
  • 6Steinbach M. , Karypis G. , and Kumar V. A comparison of document clustering techniques [J].KDD Workshop on Text Mining,2000,(3):53-65.
  • 7C Aone,M Ramos-Santacruz.Rees:A large-scale relation and event extraction system.In Proceddings of the 6th Applied Natural Language Processing Conference,2000:76~83
  • 8Chieu H,H Ng.A maximum entroy approach to information extraction from semi-structured and free text,In Proceedings of the Enghteenth International Conference on Artificial Intelligence (AAAI-02),Edmonton,Canada.2002
  • 9Dmitry Zelenko,Chinatsu Aone,Anthony Richardella.Kernel methods for relation extraction.Journal of Machine Learning Research 3,2003:1083~1106
  • 10Yangarber R,R Grishman,P Tapanainen,S Huttunen.Unsupervised discovery of scenario-level patterns for information extraction.In Proceedings of the Applied Natural Language Processing Conference (ANLP).Seattle,WA,2000

引证文献4

二级引证文献41

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部