一种快速词自动聚类算法被引量：3

A NEW ALGORITHM OF WORDS AUTOMATIC CLUSTERING

在线阅读下载PDF

导出

摘要词聚类是语言自动处理中一个重要的基础环节。传统的统计方法基于贪婪原则,常以语料的似然函数或困惑度作为评价标准,其主要缺点是:聚类速度慢、初值对结果的影响大、易陷入局部最优。针对这些问题,提出了基于相似度测度和覆盖方法的聚类方法。该方法计算量小、聚类速度快。而且,借助覆盖原理有效减小了初始点选取对聚类的影响程度。实验证明,效果理想。 Word clustering is an important fundamental work in automatic language process. Traditional statistical methods base on greedy principle, often use language materials likelihood function or confusion achievement as their evaluation criteria. They have typical defaults, e. g. , their clustering speed is slow, the initial value affects the result greatly, and they easily fall into local optimum. Pointing to these problems, this paper puts forward a new words automatic clustering method based on similarity measurement and covering algorithm. The clustering speed of this method is fast because the computational complexity is much simple. Also, due to the covering theories, this method reduces the influence of initial selection of point on the clustering. Experiment validates the ideal effect of our design.

作者王舵郄君张娟李文斌

机构地区中共石家庄市委党校河北政法职业学院石家庄经济学院

出处《计算机应用与软件》 CSCD 2010年第8期276-278,共3页 Computer Applications and Software

关键词词聚类似然函数覆盖方法 Word clustering Likelihood function Covering method

分类号 TP391.41 [自动化与计算机技术—计算机应用技术]

作者简介王舵，硕士，主研领域：计算机检测技术，嵌入式系统。

引文网络
相关文献

参考文献4

1Agrawal,Gehrke J,Gunopolos D,et al.Automatic subspace clustering of high dimensional data for data mining appalication[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data Seattle,1998:94-105.
2Li Li,Feng Liu,Wu Chou.An Information Theoretic Approach for Using Word Cluster Information in Natural Language Call Routing[C]//EUROSPEECH-2003,2003:2829-2832.
3袁里驰,钟义信.基于相似度的词聚类算法[J].微电子学与计算机,2005,22(8):93-95. 被引量：4
4孔德昌,刘蓉.一种概率聚类的新算法[J].计算机应用与软件,2007,24(11):180-182. 被引量：2

二级参考文献15

1Ido Dagan, et al. Context Word Similarity and Estimation From Sparse Data [J]. Computer Speech and Language,2001, 9(2): 123-152.
2Firth, John Rupert. 1957. A Synopsis of Linguistic Theory 1930-1955 [C]. In Philological Society, Editor, Studies in Linguistic Analysis. Blackwell, Oxford, pages 1-32.Reprinted in Selected Papers of J. R. Firth, edited by F.Palmer. Longman, 1968.
3Harris, Zelig S. Mathematical Structures of Language[M].New York: Wiley, 1965.
4Cutting, D R Karger, D R Perdersen, J R Tukey, J W(1992). Scatter/garther: A Cluster-Based Approach to Browsing Large Document Collections[C]. In SIGIR 92.
5Gao J Wang, H F, M Lee, K F (2003b). A Unifed Approach to Statistical Language Modeling for Chinese [C].ICASSP-2000, Istanbul, Turkey, June.
6Lee Lillian. 2001. Similarity-Based approaches to Natural Language Processing. Ph.D. thesis,[D] Harvard University,Cambridge, MA.
7Karov Yael, Shimon Edelman. Learning Similarity-Based Word Sense Disambiguation From Sparse Data.[C] In Proceedings of the Fourth Workshop on Very Large Corpora,Copenhagen, Denmark, 1999: 42-55.
8乔治克勒,郑余战,译.模糊集的理论、应用和新观点[M].北京:北京师范大学出版社,2000.
9Wang Shitong. A new integrated clustering algorithm GFC and switching regressions [ J]. Int J Pattern Recognition and Artificial Intelligence,2002,16 (4) :433 - 447.
10Behara M. Additive and Nonadditive Measures of Entropy[ D]. Chichester:lEEE Press, 1886.

共引文献4

1张素香,李蕾,秦颖,钟义信.基于Boot Strapping的中文实体关系自动生成[J].微电子学与计算机,2006,23(12):15-18. 被引量：3
2赵应权,张刘平.聚类分析成图新方法[J].地球物理学进展,2009,24(6):2287-2292. 被引量：3
3王小华,徐宁,谌志群.基于共词分析的文本主题词聚类与主题发现[J].情报科学,2011,29(11):1621-1624. 被引量：34
4高永兵,周环宇,聂知秘,胡文江.PWSWE:个人微博主题词提取算法的研究[J].计算机应用与软件,2015,32(7):86-89. 被引量：1

同被引文献36

1陈小荷.从自动句法分析角度看汉语词类问题[J].语言教学与研究,1999(3):63-72. 被引量：23
2司马义.阿不都热依木.现代维吾尔语造词法研究[D].乌鲁木齐:新疆大学,2010.
3Huang z. Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Knowledge, Discovery II 1998, 3(2) : 283-304.
4Boley D, Gini M, Gross R, et al. Partitioning - based Clustering for Web Document Categorization [ J ]. Decision Support Systems, 1999, 27(3) :329 -341.
5Mao J, Jain A K. A Self- organizing Network for Hypellipsoidal Clustering [J]. 1EEE Transactions on Neural Networks, 1996, 7 (1) :16 -29.
6I Cai W, Chen S, Zhang D. Fast and Robust Fuzzy C - means Clus- tering Algorithms Incorporating Local Information for Image Seg- mentation[ J ]. Pattern Recognition, 2007, 40(3 ) :825 - 838.
7Chen H H, Lin C J. A Multilingual News Summarizer[ C]. In: Proceedings of the 18th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguis- tics, 2000 : 159 - 165.
8Leftin L J. Newsblaster Russian - English Clustering Performance Analysis [ R ]. Columbia Computer Science Technical Reports, 2003.
9Evans D K, Klavans J L, McKeown K R. Columbia Newsblaster: Muhilingual News Summarization on the Web Demonstration [ C ]. In : Proceedings of HLT - NAACL 2004. Stroudsburg: Association for Computational Linguistics, 2004 : 1 -4.
10Mathieu B, Besancon R, Fluhr C. Muhilingual Document Clusters Discovery [ C] . In : Proceedings of RIAO 2004. 2004 : 116 - 125.

引证文献3

1谭勋,吐尔根·依布拉音,艾山·吾买尔,张韦煜.基于相似度计算的维吾尔语词聚类[J].新疆大学学报（自然科学版）,2012,29(1):104-107. 被引量：2
2王东波,朱丹浩.面向汉语句法功能分布知识库的词汇类别知识挖掘研究[J].现代图书情报技术,2013(3):33-37. 被引量：5
3王东波,朱丹浩.基于CABOSFV聚类算法的汉语词汇类别知识挖掘研究[J].计算机科学,2013,40(7):211-215. 被引量：1

二级引证文献8

1许鑫,郭金龙.基于领域本体的专题库构建——以中华烹饪文化知识库为例[J].现代图书情报技术,2013(12):2-9. 被引量：18
2郭金龙,洪韵佳,许鑫.中华烹饪文化领域本体构建及其应用[J].现代图书情报技术,2013(12):10-18. 被引量：7
3洪韵佳,许鑫.基于领域本体的知识库多层次文本聚类研究——以中华烹饪文化知识库为例[J].现代图书情报技术,2013(12):19-26. 被引量：9
4吴珊燕,许鑫.基于案例推理的菜谱推荐系统研究[J].现代图书情报技术,2013(12):34-41. 被引量：5
5张志强,王伟钧,杨晋浩,周晓清,郑加林.一种行业领域词库标识树的正确性检测算法研究[J].现代电子技术,2018,41(18):88-91. 被引量：1
6田亮,吐尔根.依布拉音,艾山.吾买尔,卡哈尔江.阿比的热西提.基于LDA的英汉维文本聚类系统的设计与实现[J].现代电子技术,2019,42(3):122-126. 被引量：2
7王晨,尹静,王红春.基于分拣机器人零售电商订单动态聚类及仿真[J].包装工程,2020,41(3):170-175. 被引量：5
8刘汀,蔡少填,陈小军,章秦.PCP-tuning:面向小样本学习的个性化连续提示调优[J].新疆大学学报（自然科学版）（中英文）,2024,41(1):59-68.

1李俊,周宇葵.数据挖掘在生物医学工程文献检索中的应用[J].图书馆学研究,2008(1):22-24.
2史慧峰,马晓宁.一种自适应的模糊C均值聚类算法[J].无线通信技术,2016,25(3):40-45. 被引量：6
3杜欣,刘大刚,张开活,申远,赵康,倪友聪.基于统一计算设备架构和基因表达式编程的自动聚类算法[J].计算机应用,2013,33(7):1890-1893. 被引量：1
4马云红,王成汗,江腾蛟,张堃.一种基于数据包含度的自动聚类算法[J].西北工业大学学报,2016,34(5):863-866. 被引量：1
5姜代红,张三友.基于基因表达式编程的K均值自动聚类算法[J].计算机仿真,2010,27(12):216-220. 被引量：10
6周娟.基于DKC值的K-means改进聚类算法的研究[J].企业技术开发,2015,34(1):24-26.
7崔尚卿,马秀莉,唐世渭,王文清.基于不均匀密度的自动聚类算法[J].计算机工程,2008,34(23):86-88. 被引量：3
8黄永文,何中市.基于互信息的统计语言模型平滑技术[J].中文信息学报,2005,19(4):46-51. 被引量：8
9陈琰,李康顺,杨磊.加入动态惩罚因子的GEP自动聚类算法[J].系统仿真学报,2016,28(4):806-814. 被引量：1
10钱潮恺,黄德才.基于维度频率相异度和强连通融合的混合数据聚类算法[J].模式识别与人工智能,2016,29(1):82-89. 被引量：5

计算机应用与软件

2010年第8期

浏览历史

内容加载中请稍等...

一种快速词自动聚类算法被引量：3

参考文献4

二级参考文献15

共引文献4

同被引文献36

引证文献3

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

一种快速词自动聚类算法 被引量：3

参考文献4

二级参考文献15

共引文献4

同被引文献36

引证文献3

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

一种快速词自动聚类算法被引量：3