期刊文献+

一种快速词自动聚类算法 被引量:3

A NEW ALGORITHM OF WORDS AUTOMATIC CLUSTERING
在线阅读 下载PDF
导出
摘要 词聚类是语言自动处理中一个重要的基础环节。传统的统计方法基于贪婪原则,常以语料的似然函数或困惑度作为评价标准,其主要缺点是:聚类速度慢、初值对结果的影响大、易陷入局部最优。针对这些问题,提出了基于相似度测度和覆盖方法的聚类方法。该方法计算量小、聚类速度快。而且,借助覆盖原理有效减小了初始点选取对聚类的影响程度。实验证明,效果理想。 Word clustering is an important fundamental work in automatic language process. Traditional statistical methods base on greedy principle, often use language materials likelihood function or confusion achievement as their evaluation criteria. They have typical defaults, e. g. , their clustering speed is slow, the initial value affects the result greatly, and they easily fall into local optimum. Pointing to these problems, this paper puts forward a new words automatic clustering method based on similarity measurement and covering algorithm. The clustering speed of this method is fast because the computational complexity is much simple. Also, due to the covering theories, this method reduces the influence of initial selection of point on the clustering. Experiment validates the ideal effect of our design.
出处 《计算机应用与软件》 CSCD 2010年第8期276-278,共3页 Computer Applications and Software
关键词 词聚类 似然函数 覆盖方法 Word clustering Likelihood function Covering method
作者简介 王舵,硕士,主研领域:计算机检测技术,嵌入式系统。
  • 相关文献

参考文献4

  • 1Agrawal,Gehrke J,Gunopolos D,et al.Automatic subspace clustering of high dimensional data for data mining appalication[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data Seattle,1998:94-105.
  • 2Li Li,Feng Liu,Wu Chou.An Information Theoretic Approach for Using Word Cluster Information in Natural Language Call Routing[C]//EUROSPEECH-2003,2003:2829-2832.
  • 3袁里驰,钟义信.基于相似度的词聚类算法[J].微电子学与计算机,2005,22(8):93-95. 被引量:4
  • 4孔德昌,刘蓉.一种概率聚类的新算法[J].计算机应用与软件,2007,24(11):180-182. 被引量:2

二级参考文献15

  • 1Ido Dagan, et al. Context Word Similarity and Estimation From Sparse Data [J]. Computer Speech and Language,2001, 9(2): 123-152.
  • 2Firth, John Rupert. 1957. A Synopsis of Linguistic Theory 1930-1955 [C]. In Philological Society, Editor, Studies in Linguistic Analysis. Blackwell, Oxford, pages 1-32.Reprinted in Selected Papers of J. R. Firth, edited by F.Palmer. Longman, 1968.
  • 3Harris, Zelig S. Mathematical Structures of Language[M].New York: Wiley, 1965.
  • 4Cutting, D R Karger, D R Perdersen, J R Tukey, J W(1992). Scatter/garther: A Cluster-Based Approach to Browsing Large Document Collections[C]. In SIGIR 92.
  • 5Gao J Wang, H F, M Lee, K F (2003b). A Unifed Approach to Statistical Language Modeling for Chinese [C].ICASSP-2000, Istanbul, Turkey, June.
  • 6Lee Lillian. 2001. Similarity-Based approaches to Natural Language Processing. Ph.D. thesis,[D] Harvard University,Cambridge, MA.
  • 7Karov Yael, Shimon Edelman. Learning Similarity-Based Word Sense Disambiguation From Sparse Data.[C] In Proceedings of the Fourth Workshop on Very Large Corpora,Copenhagen, Denmark, 1999: 42-55.
  • 8乔治克勒,郑余战,译.模糊集的理论、应用和新观点[M].北京:北京师范大学出版社,2000.
  • 9Wang Shitong. A new integrated clustering algorithm GFC and switching regressions [ J]. Int J Pattern Recognition and Artificial Intelligence,2002,16 (4) :433 - 447.
  • 10Behara M. Additive and Nonadditive Measures of Entropy[ D]. Chichester:lEEE Press, 1886.

共引文献4

同被引文献36

  • 1陈小荷.从自动句法分析角度看汉语词类问题[J].语言教学与研究,1999(3):63-72. 被引量:23
  • 2司马义.阿不都热依木.现代维吾尔语造词法研究[D].乌鲁木齐:新疆大学,2010.
  • 3Huang z. Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Knowledge, Discovery II 1998, 3(2) : 283-304.
  • 4Boley D, Gini M, Gross R, et al. Partitioning - based Clustering for Web Document Categorization [ J ]. Decision Support Systems, 1999, 27(3) :329 -341.
  • 5Mao J, Jain A K. A Self- organizing Network for Hypellipsoidal Clustering [J]. 1EEE Transactions on Neural Networks, 1996, 7 (1) :16 -29.
  • 6I Cai W, Chen S, Zhang D. Fast and Robust Fuzzy C - means Clus- tering Algorithms Incorporating Local Information for Image Seg- mentation[ J ]. Pattern Recognition, 2007, 40(3 ) :825 - 838.
  • 7Chen H H, Lin C J. A Multilingual News Summarizer[ C]. In: Proceedings of the 18th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguis- tics, 2000 : 159 - 165.
  • 8Leftin L J. Newsblaster Russian - English Clustering Performance Analysis [ R ]. Columbia Computer Science Technical Reports, 2003.
  • 9Evans D K, Klavans J L, McKeown K R. Columbia Newsblaster: Muhilingual News Summarization on the Web Demonstration [ C ]. In : Proceedings of HLT - NAACL 2004. Stroudsburg: Association for Computational Linguistics, 2004 : 1 -4.
  • 10Mathieu B, Besancon R, Fluhr C. Muhilingual Document Clusters Discovery [ C] . In : Proceedings of RIAO 2004. 2004 : 116 - 125.

引证文献3

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部