基于相似度的词聚类算法被引量：4

Word Clustering Based on Similarity

在线阅读下载PDF

导出

摘要基于类的统计语言模型是解决统计模型数据稀疏问题的重要方法。传统的统计方法基于贪婪原则,常以语料的似然函数或困惑度(perplexity)作为评价标准。传统的聚类方法的主要缺点是聚类速度慢,初值对结果影响大,易陷入局部最优。本文提出了词相似度定义、词集合相似度定义,一种自下而上的分层聚类算法。这种方法不但能改善聚类效果,而且可根据不同的模型选择不同的相似度定义,从而提高聚类的使用效果。 Cluster-based statistic language model is an important method to solve the problem of sparse data. Conventional statistical clustering methods usually base on greedy principle. The common Metric for evaluating a clustering algorithm is the likelihood function or perplexity of the corpus. Conventional clustering algorithms often converge to a local optimum, so global optimum is not guaranteed, and initial choices can influence final result. The authors try to solve above problems in this paper. This paper presents a novel definition of word similarity. Based on word similarity, this paper gives the definition of word set similarity, and proposes a bottom-up hierarchical clustering algorithm based on similarity. This method not only improves clustering effect, but also can choice different similarity definition for different cluster-based model, such as predictive clustering, conditional clustering, and combined clustering, thus improved the effect of using clusters. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance.

作者袁里驰钟义信

机构地区北京邮电大学信息工程学院

出处《微电子学与计算机》 CSCD 北大核心 2005年第8期93-95,共3页 Microelectronics & Computer

基金国家自然科学基金资助项目(69982001) 国家"863计划"资助项目(2001AA114201)

关键词词相似度词聚类统计语言模型 Word similarity, Word clustering, Statistical language model

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

作者简介袁里驰男，（1973-），博士生。研究方向为自然语言处理，网络安全。

引文网络
相关文献

参考文献7

1Ido Dagan, et al. Context Word Similarity and Estimation From Sparse Data [J]. Computer Speech and Language,2001, 9(2): 123-152.
2Firth, John Rupert. 1957. A Synopsis of Linguistic Theory 1930-1955 [C]. In Philological Society, Editor, Studies in Linguistic Analysis. Blackwell, Oxford, pages 1-32.Reprinted in Selected Papers of J. R. Firth, edited by F.Palmer. Longman, 1968.
3Harris, Zelig S. Mathematical Structures of Language[M].New York: Wiley, 1965.
4Cutting, D R Karger, D R Perdersen, J R Tukey, J W(1992). Scatter/garther: A Cluster-Based Approach to Browsing Large Document Collections[C]. In SIGIR 92.
5Gao J Wang, H F, M Lee, K F (2003b). A Unifed Approach to Statistical Language Modeling for Chinese [C].ICASSP-2000, Istanbul, Turkey, June.
6Lee Lillian. 2001. Similarity-Based approaches to Natural Language Processing. Ph.D. thesis,[D] Harvard University,Cambridge, MA.
7Karov Yael, Shimon Edelman. Learning Similarity-Based Word Sense Disambiguation From Sparse Data.[C] In Proceedings of the Fourth Workshop on Very Large Corpora,Copenhagen, Denmark, 1999: 42-55.

同被引文献25

1李宝敏.基于语义的Internet研究[J].微电子学与计算机,2005,22(9):130-133. 被引量：4
2薛薇.SPSS统计分析方法及应用[M].北京:电子工业出版社,2009:330.
3Agrawal,Gehrke J,Gunopolos D,et al.Automatic subspace clustering of high dimensional data for data mining appalication[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data Seattle,1998:94-105.
4Li Li,Feng Liu,Wu Chou.An Information Theoretic Approach for Using Word Cluster Information in Natural Language Call Routing[C]//EUROSPEECH-2003,2003:2829-2832.
5Hua-Ping Zhang,Qun Liu,etal.Chinese name entity recogni- tion using role model.Special issue Word Formation and Chi- nese Language processing of the International Journal of Com- putational[J]. Linguistics and Chinese Language Processing, 2003,8 (2):29-60.
6Steinbach M. , Karypis G. , and Kumar V. A comparison of document clustering techniques [J].KDD Workshop on Text Mining,2000,(3):53-65.
7C Aone,M Ramos-Santacruz.Rees:A large-scale relation and event extraction system.In Proceddings of the 6th Applied Natural Language Processing Conference,2000:76～83
8Chieu H,H Ng.A maximum entroy approach to information extraction from semi-structured and free text,In Proceedings of the Enghteenth International Conference on Artificial Intelligence (AAAI-02),Edmonton,Canada.2002
9Dmitry Zelenko,Chinatsu Aone,Anthony Richardella.Kernel methods for relation extraction.Journal of Machine Learning Research 3,2003:1083～1106
10Yangarber R,R Grishman,P Tapanainen,S Huttunen.Unsupervised discovery of scenario-level patterns for information extraction.In Proceedings of the Applied Natural Language Processing Conference (ANLP).Seattle,WA,2000

引证文献4

1张素香,李蕾,秦颖,钟义信.基于Boot Strapping的中文实体关系自动生成[J].微电子学与计算机,2006,23(12):15-18. 被引量：3
2王舵,郄君,张娟,李文斌.一种快速词自动聚类算法[J].计算机应用与软件,2010,27(8):276-278. 被引量：3
3王小华,徐宁,谌志群.基于共词分析的文本主题词聚类与主题发现[J].情报科学,2011,29(11):1621-1624. 被引量：34
4高永兵,周环宇,聂知秘,胡文江.PWSWE:个人微博主题词提取算法的研究[J].计算机应用与软件,2015,32(7):86-89. 被引量：1

二级引证文献41

1胡军光,刘力,车奇.基于词性的文本挖掘算法在IDS日志中的应用[J].计算机与数字工程,2010,38(2):90-93. 被引量：2
2谭勋,吐尔根·依布拉音,艾山·吾买尔,张韦煜.基于相似度计算的维吾尔语词聚类[J].新疆大学学报（自然科学版）,2012,29(1):104-107. 被引量：2
3谌志群,徐宁,王荣波.基于主题演化图的网络论坛热点跟踪[J].情报科学,2013,31(3):147-150. 被引量：22
4王东波,朱丹浩.面向汉语句法功能分布知识库的词汇类别知识挖掘研究[J].现代图书情报技术,2013(3):33-37. 被引量：5
5秦佳佳.学科领域热点主题研究综述[J].时代报告（学术版）,2013(03X):486-487. 被引量：1
6白秋产,金春霞,章慧,周海岩.词共现文本主题聚类算法[J].计算机工程与科学,2013,35(7):164-168. 被引量：13
7杨菲,黄柏雄.词共现网络的遗传聚类在话题发现中的应用[J].计算机工程与应用,2013,49(14):126-129. 被引量：7
8王东波,朱丹浩.基于CABOSFV聚类算法的汉语词汇类别知识挖掘研究[J].计算机科学,2013,40(7):211-215. 被引量：1
9吴建荣,陈洪梅,姚建民,熊思勇.自然语言检索扩展词库的构建方法[J].中国科技资源导刊,2013,45(6):67-71.
10赵一鸣,张进,黎苑楚.基于多维尺度模型的潜在主题可视化研究[J].情报学报,2014,33(1):45-54. 被引量：5

1袁里驰.一种基于互信息的词聚类算法[J].系统工程,2008,26(5):120-122. 被引量：4
2乔亚男,刘跃虎,齐勇.查询词相似度加权的邻近性检索方法[J].模式识别与人工智能,2013,26(2):189-194. 被引量：2
3袁里驰.基于相似度的词聚类算法和可变长语言模型[J].小型微型计算机系统,2009,30(5):912-915. 被引量：7
4袁里驰.几种基于统计的词聚类方法比较[J].中南大学学报（自然科学版）,2016,47(9):3079-3084. 被引量：1
5谌颃.社会化标签语义相似度的协同过滤算法[J].华侨大学学报（自然科学版）,2016,37(1):84-87.
6王静.基于网络日志的用户查询推荐[J].河南科技,2016,35(7):50-51. 被引量：1
7杨锦锋,关毅.基于免疫原理词表示的词相似度计算[J].智能计算机与应用,2015,5(3):61-64.
8陈永强,刘惠颖.一种基于密度的数据流聚类分析算法[J].科技创新导报,2009,6(22):20-20.
9苏进,张佑生.一种分层聚类模型及其在电信行业的应用[J].计算机工程,2005,31(22):110-112.
10卢正鼎,张茂元.一种基于义素的网页信息项语义匹配方法研究[J].计算机科学,2005,32(4):49-51.

微电子学与计算机

2005年第8期

浏览历史

内容加载中请稍等...

基于相似度的词聚类算法被引量：4

参考文献7

同被引文献25

引证文献4

二级引证文献41

相关作者

相关机构

相关主题

浏览历史

基于相似度的词聚类算法 被引量：4

参考文献7

同被引文献25

引证文献4

二级引证文献41

相关作者

相关机构

相关主题

浏览历史

基于相似度的词聚类算法被引量：4