期刊文献+

中文文本分类研究 被引量:6

Study of Chinese Text Categorization
在线阅读 下载PDF
导出
摘要 使用k近邻、支持向量机和最大熵模型进行中文文本分类的研究,对目前应用较多的k近邻、支持向量机和最大熵模型,分别进行了基于特征词布尔值和基于特征词词频的中文文本分类实验。实验结果显示,在相同的条件下最大熵方法的分类性能最好,支持向量机次之,k近邻稍差。同时发现,在分类过程中引入了词语频率信息时,分类器的性能略有变化,对于最大熵分类准确率下降1%~2%,对于k近邻有所上升,对于支持向量机则相当。除去文本的特殊性影响,这表明不同程度的词语的信息对不同的机器学习算法有不同的影响。 In this paper, we compare the three models of k-nearest neighbor, support vector machines and maximum entropy in text categorization. By using two training data set that has been classified by term selection and remove irrelevant data seperately, we carry out some experiments using the three models. The result of the experiments shows that the maximum entropy is better than the other two classifiers on either Boolean value condition or adding the frequency of words. The maximum entropy performance is the best in the three models. We also find that when we add the information of frequency of words the classifiers' performance has some changes. Despite the influence of the particularity of documents, this result suggests that the different kind of term sets may cause different results to different classifier's performance.
出处 《太原理工大学学报》 CAS 北大核心 2006年第6期710-713,共4页 Journal of Taiyuan University of Technology
关键词 文本分类 K近邻 支持向量机 最大熵 text categorization k-nearest neighbor support vector machines maximum entropy
作者简介 郝晓燕(1970-),女,在读博士生,山西宁武人,主要从事自然语言处理研究,(Tel)0351-6534397,(Email)nameguozw@sina.com.cn 通讯联系人:常晓明(1954-),男,教授,博士生导师。
  • 相关文献

参考文献9

  • 1Y Yang,X Lin.A re-examination of text categorization methods[M].In:The 22nd Annual International ACM SIGIR Conference on Research and Development in the Information Retrieval.New York:ACM Press,1999.
  • 2Thorsten Joachims.Text Categorization with Support Vector:Machines Learning with Many Relevant Features[C].In European Conference on Machine Learning(ECML),Berlin,1998:137-142.
  • 3D D Lewis.Naive (Bayes) at forty:the independence assumption in information retrieval[C].In the 10th European Conference on Machine Learning,New York,1998:4-15.
  • 4R Adwait.Maximum entropy models for natural language ambiguity resolution[D].USA:University of Pennsylvania,1998.
  • 5谷波,刘开瑛.决策树模型和最大熵模型在文本分类中的比较研究工作[C].全国第八届计算语言学联合学术会议,南京,2005:382-387.
  • 6Adam L Berger,Stephen A Della Pietra,Vincent J.Della Pietra.A maximum entropy approach to natural language processing[J].Computational Linguistics,1996,22(1):38-73.
  • 7苑春法,李庆中,王昀,等.统计自然语言处理基础[M].北京:电子工业出版社,338-374.
  • 8V Vapnic.The Nature of Statistical Learning Theory[M].New York:Springer,1995.
  • 9Darroch J N,D Ratcliff.Generalized iterative scaling for log-linear models[J].The Annals of Mathematical Statistics,1972,43:1470-1480.

同被引文献55

引证文献6

二级引证文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部