摘要
近年来,特征选择在机器学习领域中应用十分广泛.为提高文本计算效率,改善数据分类性能,提出两步法解决特征选择问题.结合过滤式中CEA算法和封装式中Boruta算法,引入参数p控制Boruta算法中阴影部分比例,提高封装阶段效率,降低整体算法时间复杂度,筛选出较优的候选特征集.在三个数据集上利用随机森林分类器进行实验,结果表明,该算法在平均分类错误率,召回率,准确率和F1值上均优于传统的Boruta和CEA算法,能够有效地减少最终选择的特征子集中的特征数量,提高文本分类效率和预测性能.
In recent years,feature selection has been widely used in machine learning.In order to improve the efficiency of text computation and the performance of data classification,we propose a two-step method to solve the problem of feature selection.Combining CEA algorithm in the filtering formula and Boruta algorithm in the packaging formula,the parameter p is introduced to control the proportion of shadow part in Boruta algorithm,improve the efficiency of the packaging phase,reduce the time complexity of the overall algorithm,and screen out the better candidate feature set.The results show that the algorithm is superior to the traditional Boruta and CEA algorithms in terms of average classification error rate,recall rate,accuracy rate and F1 value,and can effectively reduce the number of features in the final selected feature subset and improve text classification efficiency and prediction performance.
作者
朱颢东
常志芳
ZHU Haodong;CHANG Zhifang(School of Computer and Communication Engineering,Zhengzhou University of Light Industry,Zhengzhou,450000,China)
出处
《湖北民族大学学报(自然科学版)》
CAS
2020年第3期349-354,共6页
Journal of Hubei Minzu University:Natural Science Edition
基金
河南省高等学校重点科研项目(19A520009).
关键词
特征选择
降维
Boruta
CEA
机器学习
feature selection
dimension reduction
boruta
comprehensive evaluation algorithm
machine learning
作者简介
第一作者:朱颢东(1980-),男,博士,教授,主要从事智能信息处理、智能计算的研究.