一种基于Co-Training的海洋文献分类方法被引量：1

A Marine Literature Classification Method Based on Co-training

在线阅读下载PDF

导出

摘要使用有监督机器学习方法进行海洋文献的分类往往存在人工标注量太大的缺点,针对这个问题,提出利用半监督机器学习中的协同训练(Co-training)方法来实现减小人工标注量的目标。该方法从2个View分别训练不同的分类器,在此基础上,根据少量有标注文档从大量无标注文档中获取有用信息,通过协同训练来提升2个分类器的性能,并训练出最终分类模型。实验结果表明,在人工标注仅2篇文献的条件下,该方法最终的分类性能十分接近需人工标注1 500多篇文献的有监督分类器。这说明将Co-training方法应用于海洋文献分类可以大大减小人工标注量,并有着较为良好的分类性能。 It always takes a large number of manual work to label marine papers when using supervised machine learning method. To address this issue, we take advantage of Co-training, which is a kind of semi-supervised learning method, for building the marine paper classification. We train two different clas- sifiers from two views. One view is made up of the feature set of abstract, and the other is made up of the feature sets of title, subject, major and class code. On this basis, we use a small initial labeled set to ob- tain useful information from a large set of unlabeled documents, and boost the performance of two classifi- ers by Co-training. Experiments shows that even if there are only 2 labeled samples in the training set, the F1 value and error rate of the classification system could reach about 85.88% and 14. 35%. They are close to the performance of supervised classifier （90. 20% and 9. 13%） which is trained by more than 1 500 labeled samples. These show that the application of Co-training on marine papers classification can significantly reduce the manual work, and also has well performance. Thus, it is very suitable for practi- cal applications.

作者徐建良姜亦宏张巍王秋红

机构地区中国海洋大学计算机科学与技术系

出处《中国海洋大学学报（自然科学版）》 CAS CSCD 北大核心 2010年第2期105-110,共6页 Periodical of Ocean University of China

基金国家自然科学基金项目(60602017) 教育部"新世纪优秀人才支持计划"基金(NECT-07-0784) 山东省优秀青年科学家科研奖励基金(2008BS01003)资助

关键词海洋文献文本分类机器学习半监督学习协同训练 marine literature text categorization machine learning semi-supervised learning Co-training

分类号 TP393 [自动化与计算机技术—计算机应用技术]

作者简介徐建良（1969-），男，教授，博导，主要研究方向为计算复杂性理论和人工智能。E-mail：cheung．colin@gmail．com 通讯联系人：E-mail：ihcil@ouc,edu．cn

引文网络
相关文献

参考文献11

1邵艳.网络环境下海洋院校图书馆海洋科学知识服务体系模式研究[J].浙江海洋学院学报（人文科学版）,2007,24(3):133-135. 被引量：2
2Ikonomakis M, Kotsiantis S, Tampakas V. Text classification: a recent overview [C]. //Proceedings of the 9th WSEAS International Conference on Computers, Greece: Athens, 2005: 125.
3Tom M Mitchell.曾华军张银奎译.机器学习[M].北京:机械工业出版社,2003..
4牛强,王志晓,陈岱,夏士雄.基于KNN的Web文本分类方法的研究[J].计算机应用与软件,2007,24(10):210-211. 被引量：8
5Markl PV, Kutsch M, Tran PT M, et al. MAXENT: consistent cardinality estimation in action [C]. //Proceedings of the 2006 ACM SIGMOD international conference on Management of data, Chicago: IL, 2006.
6VLADIMIR N.Vapnik.统计学习理论[M].北京:电子工业出版社,2004.
7门昌骞,王文剑.一种基于多学习器标记的半监督SVM学习方法[J].广西师范大学学报（自然科学版）,2008,26(1):186-189. 被引量：9
8Haibin Cheng, Pang-Ning Tan. Semi-supervised learning with data calibration for long-term time series forecasting [C]. //Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, USA: Las Vegas Nevada, 2008.
9Avrim Blum, Tom Mitchell. Combining labeled and unlabeled data with Co-training [C]. //Proceedings of the 11th Annual Conference on Computational Learning Theory, Wiseonsin: MI, 1998: 92-100.
10龙军,殷建平,祝恩,赵文涛.主动学习研究综述[J].计算机研究与发展,2008,45(z1):300-304. 被引量：31

二级参考文献51

1李生琦,徐福缘,倪明.企业知识仓库的构架体系与实现技术[J].现代图书情报技术,2004(10):81-84. 被引量：9
2叶茜.图书馆知识服务及实施对策[J].情报探索,2006(3):37-38. 被引量：9
3杨淑萍.知识创新与高校重点学科信息服务平台的构建[J].河南图书馆学刊,2006,26(1):36-39. 被引量：8
4孔怡青,王士同.半监督学习贝叶斯分类(英文)[J].广西师范大学学报（自然科学版）,2006,24(4):99-102. 被引量：1
5[1]D Cohn,Atlas R Ladner.Improving generalization with active learning.Machine Learning,1994,5(2):201-221
6[2]Y Freund,H S Seung,E Shamir,et al.Selective sampling using the query by committee algorithm.Machine Learning,1997,28(2-3):133-168
7[3]M Kaariainen.Active learning in the non-realizable case.In:Proc of the 17th Int'l Conf on Algorithmic Learning Theory.Berlin:Springer,2006.63-77
8[4]M -F Balcan,A Beygelzimer,J Langford.Agnostic active learning.In:Proc of the 23rd Int'l Conf on Machine Learning.San Francisco,CA:Morgan Kaufmann,2006
9[5]S Dasgupta.Coarse sample complexity bounds for active learning.In:Proc of Advances in Neural Information Processing Systems.Cambridge,MA:MIT Press,2005
10[6]S Dasgupta,A T Kalai,C Monteleoni.Analysis of perceptron-based active learning.In:Proc of the 18th Annual Conf on Learning Theory.Berlin:Springer,2005

共引文献76

1李恬,冯早,朱雪峰.基于主动学习和最优路径森林的管道故障分类识别方法[J].电子测量与仪器学报,2022,36(12):67-76. 被引量：2
2李国伟,周颜,李钜.ID3算法在硕士研究生报名中的应用[J].中原工学院学报,2005,16(3):37-39. 被引量：2
3刘箴.数字娱乐领域中的虚拟人情绪表现模型研究[J].系统仿真学报,2006,18(10):2865-2869. 被引量：5
4王小冷,王斌.一种抗噪音的中文网页分类方法[J].中文信息学报,2007,21(4):48-54. 被引量：1
5常彦伟,王耀才,曹云峰,王致杰.基于误差相关度学习样本选择[J].计算机工程与设计,2007,28(16):3965-3967.
6罗瑜,徐图,何大可,谌新年.基于函数逼近的改进SMO算法研究[J].山西大学学报（自然科学版）,2007,30(3):329-334. 被引量：2
7高博,谭永红,张新良.基于在线LSSVM的超声波电机转速预测器[J].兵工自动化,2007,26(9):64-65. 被引量：1
8安岭丽,彭志平,李铁鹰.MAXQ方法在出租车问题中的应用[J].茂名学院学报,2007,17(1):56-59.
9祁正兴.基于量化参数的吡啶类化合物pKa模型的构建[J].青海师范大学学报（自然科学版）,2008,24(1):38-42. 被引量：1
10杨国鹏,余旭初,刘伟,陈伟.基于支持向量机的高光谱影像分类研究[J].计算机工程与设计,2008,29(8):2029-2031. 被引量：8

同被引文献14

1Sebastiani F. Machine learning in automated text categori- zation[ J]. ACM Computing Surveys, 2002,34( 1 ) : 1-47.
2Chen Haibin, Tan Pangning. Semi-supervised learning withdata calibration for long-term time series forecasting [ C ]// Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008:133-141.
3Zhou Xiaojin. Semi-supervised Learning Literature Survey [ DB/OL]. http://pages, cs. wisc. edu/~ jerryzhu/pub/ ssl_survey, pdfb, 2008-07-19.
4Pierce D, Cardie C. Limitations of co-training for natural language learning from large datasets [ C ]//Proceedings of the 2001 Conference on Empirical Methods in Natural Lan- guage Processing. 2001 : 1-9.
5Steedman M, Osborne M, Sarkar A, et al. Bootstrapping statistical parsers from small datasets[ C]// Proceedings of the lOth Conference on European Chapter of the Association for Computational. 2003 : 331-338.
6Li Ming, Li Hang, Zhou Zhihua. Semi-supervised docu- ment retrieval[ J]. Information Processing & Management, 2008,45 (3) :341-355.
7Li Ming, Zhou Zhihua. Improve computer-aided diagnosis with machine learning techniques using undiagnosed sam- pies[J].IEEE Transactions on Systems, Man, and Cyber- netics, Part A: Systems and Humans, 2007,37(6) :1088- 1098.
8Mavroeidis D, Chaidos K, Pirillos S, et al. Using tri-train- ing and support vector machines for addressing the ecml- pkdd 2006 discovery challenge [ C ]// Proceedings of the ECML-PKDD Discovery Challenge Workshop, 2006. 2006 : 39 -47.
9Blum A, Mitchell T. Combining labeled and unlabeled data with co-training [ C ]// Proceedings of the Workshop on Computational Learning Theory. 1998:92-100.
10Hotho A, Staab S, Stumme G. WordNet improves text doc- ument clustering [ C ]// Proceedings of Semantic Web Workshop of the 26th Annual International ACM SIGIR Conference. 2003:541-544.

引证文献1

1古平,吴庭君,文静云.基于概念与词根双特征互助文本分类模型[J].计算机与现代化,2015(8):93-97.

1贾志洋,高炜,王勇刚.结合信息检索技术的半监督文本分类方法[J].苏州大学学报（自然科学版）,2012,28(1):34-39. 被引量：1
2高迎,王丽君,王锡钢.Simutem:一个中文信息检索系统[J].鞍山师范学院学报,2001,3(3):82-85.
3张宝华.探讨Ontology的信息检索策略[J].电脑编程技巧与维护,2009(10):86-88.
4郑海清,林琛,牛军钰.一种基于紧密度的半监督文本分类方法[J].中文信息学报,2007,21(3):54-60. 被引量：11
5钱慎一,朱艳玲,朱颢东.基于多层挖掘策略的特征选择及在科技文献分类中的应用[J].兰州理工大学学报,2015,41(6):109-113. 被引量：1
6郭立,朱俊株,陆大虎.基于Gabor小波变换的无监督纹理图像分割[J].微机发展,2000,10(5):51-54. 被引量：4
7李广水,宋丁全,郑滔,李杨,苏继申.协同训练支持向量机对遥感影像的分类研究[J].计算机工程与应用,2009,45(29):160-163. 被引量：3
8江丽,郭顺生.基于半监督拉普拉斯特征映射的故障诊断[J].中国机械工程,2016,27(14):1911-1916. 被引量：6
9吕月娥.中文科技期刊数据库文献分类与检索[J].临沂师范学院学报,2008,30(6):104-107.
10王张琦,曹渠江.基于马尔可夫链的半监督分类器[J].上海理工大学学报,2007,29(1):51-54. 被引量：1

中国海洋大学学报（自然科学版）

2010年第2期

浏览历史

内容加载中请稍等...

一种基于Co-Training的海洋文献分类方法被引量：1

参考文献11

二级参考文献51

共引文献76

同被引文献14

引证文献1

相关作者

相关机构

相关主题

浏览历史

一种基于Co-Training的海洋文献分类方法 被引量：1

参考文献11

二级参考文献51

共引文献76

同被引文献14

引证文献1

相关作者

相关机构

相关主题

浏览历史

一种基于Co-Training的海洋文献分类方法被引量：1