摘要
本文提出了一种利用双字耦合度和t-测试差解决中文分词中交叉歧义的方法:首先利用词典找出所有的交叉歧义,然后用双字耦合度和t-测试差的线性叠加值来判断各歧义位置是否该切分。实验结果表明,双字耦合度和t-测试差的结合要优于互信息和t-测试差的结合,因此,用双字耦合度和t-测试差的线性叠加值来消除交叉歧义是一种简单有效的方法。
In this paper, two statistical measures-Coupling Degree of Double Characters (CDDC) and Difference of t- test (DT), are applied for overlapping ambiguity resolution in Chinese word segmentation. First, all possible overlapping ambiguities are found out by using the segmentation dictionary, and then a simple linear combination of CD- DC and DT is used for ambiguity resolution. The experimental results show that our method performed better than the combination of Mutual Information of Double Characters and DT, which was proved to be a very effective method for overlapping ambiguity resolution in previous work.
出处
《中文信息学报》
CSCD
北大核心
2007年第5期14-17,30,共5页
Journal of Chinese Information Processing
基金
国家973计划资助项目(2004CB318109)
国家自然科学基金资助项目(60603094)
关键词
计算机应用
中文信息处理
中文分词
双字耦合度
t-测试差
computer application
Chinese information processing
Chinese word segmentation
coupling degree of double characters
difference of t-test
作者简介
王思力(1981-),男,硕士生,主要研究方向为自然语言处理和信息检索;
王斌(1972-),男,博士,副研究员,主要研究方向为信息检索和自然语言处理。