期刊文献+

基于统计相关性与K-means的区分基因子集选择算法 被引量:56

Statistical Correlation and K-Means Based Distinguishable Gene Subset Selection Algorithms
在线阅读 下载PDF
导出
摘要 针对高维小样本癌症基因数据集的有效区分基因子集选择难题,提出基于统计相关性和K-means的新颖混合基因选择算法实现有效区分基因子集选择.算法首先采用Pearson相关系数和Wilcoxon秩和检验计算各基因与类标的相关性,根据统计相关性原则选取与类标相关性较大的若干基因构成预选择基因子集;然后,采用K-means算法将预选择基因子集中高度相关的基因聚集到同一类簇,训练SVM分类模型,计算每一个基因的权重,从每一类簇选择一个权重最大或者采用轮盘赌思想从每一类簇选择一个得票数最多的基因作为本类簇的代表基因,各类簇的代表基因构成有效区分基因子集.将该算法与采用随机策略选择各类簇代表基因的随机基因选择算法Random,Guyon的经典基因选择算法SVM-RFE、采用顺序前向搜索策略的基因选择算法SVM-SFS进行实验比较,几个经典基因数据集上的200次重复实验的平均实验结果表明:所提出的混合基因选择算法能够选择到区分性能非常好的基因子集,建立在该区分基因子集上的分类器具有非常好的分类性能. To deal with the challenging problem of recognizing the small number of distinguishable genes which can tell the cancer patients from normal people in a dataset with a small number of samples and tens of thousands of genes, novel hybrid gene selection algorithms are proposed in this paper based on the statistical correlation and K-means algorithm. The Pearson correlation coefficient and Wilcoxon signed-rank test are respectively adopted to calculate the importance of each gene to the classification to filter the least important genes and preserve about 10 percent of the important genes as the pre-selected gene subset. Then the related genes in the pre-selected gene subset are clustered via K-means algorithm, and the weight of each gene is calculated from the related coefficient of the SVM classifier. The most important gene, with the biggest weight or with the highest votes when the roulette wheel strategy is used, is chosen as the representative gene of each cluster to construct the distinguishable gene subset. In order to verify the effectiveness of the proposed hybrid gene subset selection algorithms, the random selection strategy (named Random) is also adopted to select the representative genes from clusters. The proposed distinguishable gene subset selection algorithms are compared with Random and the very popular gene selection algorithm SVM-RFE by Guyon and the pre-studied gene selection algorithm SVM-SFS. The average experimental results of 200 runs of the aforementioned gene selection algorithms on some classic and very popular gene expression datasets with extensive experiments demonstrate that the proposed distinguishable gene subset selection algorithms can find the optimal gene subset, and the classifier based on the selected gene subset achieves very high classification accuracy.
出处 《软件学报》 EI CSCD 北大核心 2014年第9期2050-2075,共26页 Journal of Software
基金 国家自然科学基金(31372250) 中央高校基本科研业务费专项基金(GK201102007) 陕西省科技攻关项目(2013K12-03-24)
关键词 区分基因子集选择 Pearson 相关系数 Wilcoxon 秩和检验 K-MEANS 聚类 统计相关性 FILTER 算法 Wrapper算法 distinguishable gene subset selection Pearson correlation coefficient Wilcxon singed-rank test K-means clustering statistical correlation Filter algorithms Wrapper algorithms
作者简介 通讯作者:谢娟英,E-mail:xiejuany@snnu.edu.cn,http://www.snnu.edu.cn谢娟英(1971-),女,陕西西安人,博士,副教授,CCF高级会员,主要研究领域为机器学习.数据挖掘.E—mail:xiejuany@snnu.edu.cn 高红超(1988-),男,硕士生,主要研究领域为智能信息处理.E-mail:852383636@qq.com
  • 相关文献

参考文献3

二级参考文献59

  • 1毛勇,周晓波,夏铮,尹征,孙优贤.特征选择算法研究综述[J].模式识别与人工智能,2007,20(2):211-218. 被引量:95
  • 2[1]Khan J, Wei J S, Ringner M, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 2001, 7(6): 673~679
  • 3[2]Anil K, Robert P R, Mar Jianchang. Statistical pattern recognition: A review. IEEE Trans Pattern Analysis and Machine Intelligence, 2000, 22(1): 4~37
  • 4[3]Herrero J, Valencia A, Dopazo J. A hierarchical unsupervised growing neural network for clustering gene expression patterns. bioinformatics, 2001, 17(2): 126~136
  • 5[4]Loog M, duin R P W. Multiclass linear dimension reduction by weighted pairwise Fisher criteria. IEEE Trans Pattern Analysis and Machine Intelligence, 2001, 23(7): 762~766
  • 6[5]Mjolsness E, DeCoste D. Machine learning for science: State of the art and future prospects. Science, 2001, 293(14): 2051~2055
  • 7[6]Ramaswarmy S, Tamayo P, Rifkin R, et al. Multiclass cancer diagnosis using tumor gene expression signatures. PNAS, 2001, 26: 15149~15154
  • 8[7]Xiong Momiao, Fang Xiangzhong, Zhao Jinying. Biomarker identification by feature wrappers. Genome Research (see www.genome.org), 2001, 11: 178~188
  • 9[8]Dudoit S, Fridlyand J, Speed T P. Comparison of discrimination methods for the classification of tumors using gene expression data, Technical report #576, University of California, Berkeley, June 2000
  • 10[9]Guyon I, Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines. Machine Learning, 2002, 46(3): 389~422

共引文献113

同被引文献420

引证文献56

二级引证文献356

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部