期刊文献+

一种基于Co-Training的海洋文献分类方法 被引量:1

A Marine Literature Classification Method Based on Co-training
在线阅读 下载PDF
导出
摘要 使用有监督机器学习方法进行海洋文献的分类往往存在人工标注量太大的缺点,针对这个问题,提出利用半监督机器学习中的协同训练(Co-training)方法来实现减小人工标注量的目标。该方法从2个View分别训练不同的分类器,在此基础上,根据少量有标注文档从大量无标注文档中获取有用信息,通过协同训练来提升2个分类器的性能,并训练出最终分类模型。实验结果表明,在人工标注仅2篇文献的条件下,该方法最终的分类性能十分接近需人工标注1 500多篇文献的有监督分类器。这说明将Co-training方法应用于海洋文献分类可以大大减小人工标注量,并有着较为良好的分类性能。 It always takes a large number of manual work to label marine papers when using supervised machine learning method. To address this issue, we take advantage of Co-training, which is a kind of semi-supervised learning method, for building the marine paper classification. We train two different clas- sifiers from two views. One view is made up of the feature set of abstract, and the other is made up of the feature sets of title, subject, major and class code. On this basis, we use a small initial labeled set to ob- tain useful information from a large set of unlabeled documents, and boost the performance of two classifi- ers by Co-training. Experiments shows that even if there are only 2 labeled samples in the training set, the F1 value and error rate of the classification system could reach about 85.88% and 14. 35%. They are close to the performance of supervised classifier (90. 20% and 9. 13%) which is trained by more than 1 500 labeled samples. These show that the application of Co-training on marine papers classification can significantly reduce the manual work, and also has well performance. Thus, it is very suitable for practi- cal applications.
出处 《中国海洋大学学报(自然科学版)》 CAS CSCD 北大核心 2010年第2期105-110,共6页 Periodical of Ocean University of China
基金 国家自然科学基金项目(60602017) 教育部"新世纪优秀人才支持计划"基金(NECT-07-0784) 山东省优秀青年科学家科研奖励基金(2008BS01003)资助
关键词 海洋文献 文本分类 机器学习 半监督学习 协同训练 marine literature text categorization machine learning semi-supervised learning Co-training
作者简介 徐建良(1969-),男,教授,博导,主要研究方向为计算复杂性理论和人工智能。E-mail:cheung.colin@gmail.com 通讯联系人:E-mail:ihcil@ouc,edu.cn
  • 相关文献

参考文献11

  • 1邵艳.网络环境下海洋院校图书馆海洋科学知识服务体系模式研究[J].浙江海洋学院学报(人文科学版),2007,24(3):133-135. 被引量:2
  • 2Ikonomakis M, Kotsiantis S, Tampakas V. Text classification: a recent overview [C]. //Proceedings of the 9th WSEAS International Conference on Computers, Greece: Athens, 2005: 125.
  • 3Tom M Mitchell.曾华军 张银奎译.机器学习[M].北京:机械工业出版社,2003..
  • 4牛强,王志晓,陈岱,夏士雄.基于KNN的Web文本分类方法的研究[J].计算机应用与软件,2007,24(10):210-211. 被引量:8
  • 5Markl PV, Kutsch M, Tran PT M, et al. MAXENT: consistent cardinality estimation in action [C]. //Proceedings of the 2006 ACM SIGMOD international conference on Management of data, Chicago: IL, 2006.
  • 6VLADIMIR N.Vapnik.统计学习理论[M].北京:电子工业出版社,2004.
  • 7门昌骞,王文剑.一种基于多学习器标记的半监督SVM学习方法[J].广西师范大学学报(自然科学版),2008,26(1):186-189. 被引量:9
  • 8Haibin Cheng, Pang-Ning Tan. Semi-supervised learning with data calibration for long-term time series forecasting [C]. //Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, USA: Las Vegas Nevada, 2008.
  • 9Avrim Blum, Tom Mitchell. Combining labeled and unlabeled data with Co-training [C]. //Proceedings of the 11th Annual Conference on Computational Learning Theory, Wiseonsin: MI, 1998: 92-100.
  • 10龙军,殷建平,祝恩,赵文涛.主动学习研究综述[J].计算机研究与发展,2008,45(z1):300-304. 被引量:31

二级参考文献51

  • 1李生琦,徐福缘,倪明.企业知识仓库的构架体系与实现技术[J].现代图书情报技术,2004(10):81-84. 被引量:9
  • 2叶茜.图书馆知识服务及实施对策[J].情报探索,2006(3):37-38. 被引量:9
  • 3杨淑萍.知识创新与高校重点学科信息服务平台的构建[J].河南图书馆学刊,2006,26(1):36-39. 被引量:8
  • 4孔怡青,王士同.半监督学习贝叶斯分类(英文)[J].广西师范大学学报(自然科学版),2006,24(4):99-102. 被引量:1
  • 5[1]D Cohn,Atlas R Ladner.Improving generalization with active learning.Machine Learning,1994,5(2):201-221
  • 6[2]Y Freund,H S Seung,E Shamir,et al.Selective sampling using the query by committee algorithm.Machine Learning,1997,28(2-3):133-168
  • 7[3]M Kaariainen.Active learning in the non-realizable case.In:Proc of the 17th Int'l Conf on Algorithmic Learning Theory.Berlin:Springer,2006.63-77
  • 8[4]M -F Balcan,A Beygelzimer,J Langford.Agnostic active learning.In:Proc of the 23rd Int'l Conf on Machine Learning.San Francisco,CA:Morgan Kaufmann,2006
  • 9[5]S Dasgupta.Coarse sample complexity bounds for active learning.In:Proc of Advances in Neural Information Processing Systems.Cambridge,MA:MIT Press,2005
  • 10[6]S Dasgupta,A T Kalai,C Monteleoni.Analysis of perceptron-based active learning.In:Proc of the 18th Annual Conf on Learning Theory.Berlin:Springer,2005

共引文献76

同被引文献14

  • 1Sebastiani F. Machine learning in automated text categori- zation[ J]. ACM Computing Surveys, 2002,34( 1 ) : 1-47.
  • 2Chen Haibin, Tan Pangning. Semi-supervised learning withdata calibration for long-term time series forecasting [ C ]// Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008:133-141.
  • 3Zhou Xiaojin. Semi-supervised Learning Literature Survey [ DB/OL]. http://pages, cs. wisc. edu/~ jerryzhu/pub/ ssl_survey, pdfb, 2008-07-19.
  • 4Pierce D, Cardie C. Limitations of co-training for natural language learning from large datasets [ C ]//Proceedings of the 2001 Conference on Empirical Methods in Natural Lan- guage Processing. 2001 : 1-9.
  • 5Steedman M, Osborne M, Sarkar A, et al. Bootstrapping statistical parsers from small datasets[ C]// Proceedings of the lOth Conference on European Chapter of the Association for Computational. 2003 : 331-338.
  • 6Li Ming, Li Hang, Zhou Zhihua. Semi-supervised docu- ment retrieval[ J]. Information Processing & Management, 2008,45 (3) :341-355.
  • 7Li Ming, Zhou Zhihua. Improve computer-aided diagnosis with machine learning techniques using undiagnosed sam- pies[J].IEEE Transactions on Systems, Man, and Cyber- netics, Part A: Systems and Humans, 2007,37(6) :1088- 1098.
  • 8Mavroeidis D, Chaidos K, Pirillos S, et al. Using tri-train- ing and support vector machines for addressing the ecml- pkdd 2006 discovery challenge [ C ]// Proceedings of the ECML-PKDD Discovery Challenge Workshop, 2006. 2006 : 39 -47.
  • 9Blum A, Mitchell T. Combining labeled and unlabeled data with co-training [ C ]// Proceedings of the Workshop on Computational Learning Theory. 1998:92-100.
  • 10Hotho A, Staab S, Stumme G. WordNet improves text doc- ument clustering [ C ]// Proceedings of Semantic Web Workshop of the 26th Annual International ACM SIGIR Conference. 2003:541-544.

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部