期刊文献+

基于层次密度聚类的去噪自适应混合采样 被引量:1

Denoising and Adaptive Hybrid Sampling Based on Hierarchical Density Clustering
在线阅读 下载PDF
导出
摘要 针对非平衡数据存在的类内不平衡、噪声、生成样本覆盖面小等问题,提出了基于层次密度聚类的去噪自适应混合采样算法(adaptive denoising hybrid sampling algorithm based on hierarchical density clustering,ADHSBHD).首先引入HDBSCAN聚类算法,将少数类和多数类分别聚类,将全局离群点和局部离群点的交集视为噪声集,在剔除噪声样本之后对原数据集进行处理,其次,根据少数类样本中每簇的平均距离,采用覆盖面更广的采样方法自适应合成新样本,最后删除一部分多数类样本集中的对分类贡献小的点,使数据集均衡.ADHSBHD算法在7个真实数据集上进行评估,结果证明了其有效性. As imbalanced data are exposed to problems such as intra-class imbalance,noise,and small coverage of generated samples,an adaptive denoising hybrid sampling algorithm based on hierarchical density clustering(ADHSBHD)is proposed.Firstly,the clustering algorithm HDBSCAN is introduced to perform clustering on minority classes and majority classes separately;the intersection of global and local outliers is regarded as the noise set,and the original data set is processed after noise samples are eliminated.Secondly,according to the average distance between clusters of samples in minority classes,the adaptive sampling method with broader coverage is used to synthesize new samples.Finally,some points that contribute little to the classification of majority classes are deleted to balance the dataset.The ADHSBHD algorithm is evaluated on six real data sets,and the results can prove its effectiveness.
作者 姜新盈 王舒梵 严涛 JIANG Xin-Ying;WANG Shu-Fan;YAN Tao(School of Mathematics,Physics and Statistics,Shanghai University of Engineering Science,Shanghai 201620,China)
出处 《计算机系统应用》 2022年第10期206-210,共5页 Computer Systems & Applications
关键词 不平衡数据 分类 聚类 混合采样 imbalanced data classification cluster hybrid sampling
作者简介 通信作者:姜新盈,E-mail:jxynovelty@163.com
  • 相关文献

参考文献5

二级参考文献61

  • 1蒋盛益,谢照青,余雯.基于代价敏感的朴素贝叶斯不平衡数据分类研究[J].计算机研究与发展,2011,48(S1):387-390. 被引量:21
  • 2郑恩辉,李平,宋执环.代价敏感支持向量机[J].控制与决策,2006,21(4):473-476. 被引量:33
  • 3毛勇,周晓波,夏铮,尹征,孙优贤.特征选择算法研究综述[J].模式识别与人工智能,2007,20(2):211-218. 被引量:95
  • 4Menzies T,Greenwald J,Frank A.Data mining static code attributes to learn defect predictors[J].IEEE Transactions on Software Engineering,2007,33(1):2-13.
  • 5Turhan B,Bener A.Analysis of Naive Bayes assumptions on software fault data:An empirical study[J].Data&Knowledge Engineering,2009,68(2):278-290.
  • 6Boetticher G D.Improving credibility of machine learner models in software engineering[M]∥Advanced Machine Learner Applications in Software Engineering(Series on Software Engineering and Knowledge Engineering),USA:Langston University,2006:52-72.
  • 7Catal C,Diri B.Investigating the effect of dataset size,metrics sets and feature selection techniques on software fault prediction problem[J].Information Sciences,2009,179(8):1040-1058.
  • 8Riquelme J C,Ruiz R,Rodriguez D,et al.Finding defective modules from highly unbalanced datasets[J].Actas de los Talleres de las Jornadas de Ingeniería del Software y Bases de Datos,2008,2(1):67-74.
  • 9Menzies T,Turhan B,Bener A,et al.Implications of ceiling effects in defect predictors[C]∥Proc of the 4th International Workshop on Predictor Models in Software Engineering,2008:47-54.
  • 10Seiffert C,Khoshgoftaar T M,Van Hulse J.Improving software-quality predictions with data sampling and boosting[J].IEEE Transactions on Systems,Man and Cybernetics,Part A:Systems and Humans,2009,39(6):1283-1294.

共引文献193

同被引文献13

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部