摘要
目前大部分句法解析器都忽略标点符号这一重要的句法特征或者只进行非常简单的处理。本文根据标点符号的句法结构特性,提出单独解析块的概念,并且根据标点符号在句子中的特有特征和位置关系,给出了基于决策树算法(Id3)单独解析块识别方法,将标点融入汉语句法分析中。本文所用的实验数据(包括训练集和测试集)均来自中文宾州树库5.0。对句长大于40个词的汉语长句单独进行了实验,句法分析精度和召回率分别提高1.59%和0.93%,同时时间开销降低了近2/3。实验结果表明,标点对汉语长句句法分析非常有利,系统性能获得了较大提高。
So far, most syntactic parsers neglect the punctuations or oversimplify their functions. However, it is actually very important information of syntactic characters. According to the features of punctuation in the syntactic structure, this paper proposes a kind of new concept of separate parsing phrase, and according to the typical character and the position of punctuation in a sentence, this paper also presents one way to identify the separate parsing phrase based on the decision tree algorithm (Id3). In this paper, the punctuation is integrated into syntactic analysis. All the experimental data sets, including the training data and test data, are derived from the Chinese Penn Tree Bank 5.0. The experiments have been done solely using the sentences, the length of which is over 40 Chinese words. The results indicate that the accuracy and the recall rate have been improved by 1.59% and 0.93% respectively, and the time expense has been reduced by nearly 66.6%. The results show that the punctuation is quite useful and effective to parse the long sentences in Chinese.
出处
《中文信息学报》
CSCD
北大核心
2007年第2期29-34,共6页
Journal of Chinese Information Processing
基金
国家863高技术项目资助(2002AA117010-10)
十五攻关教育部科技基础条件平台建设项目资助
关键词
计算机应用
中文信息处理
句法解析器
单独解析块
决策树(Id3)
computer application
Chinese information processing
syntactic parser
separate parsing phrase
decision tree algorithm Id3
作者简介
毛奇(1984-),男,硕士生,主要研究方向为信息检索,自然语言处理。