期刊文献+

基于混合成对惩罚的多个数据集效应异质性分析

Effect Heterogeneity Analysis of Multiple Datasets Based on a Hybrid Pairwise Penalty Method
在线阅读 下载PDF
导出
摘要 大数据通常是由主体或来源各异的多个数据集融合而成,因此同一个自变量对因变量的影响在不同数据集间可能存在差异,即效应异质性。从数据中挖掘出潜在的效应异质性已成为大数据分析的重要目标之一。基于融合惩罚和成对惩罚的整合分析方法是目前较为主流的两类效应异质性分析方法,但前者高度依赖模型系数的排序,而后者则计算量较大。为此,本文提出基于混合成对惩罚的新型整合分析方法。相比基于融合惩罚的整合分析方法,新方法对模型系数排序的敏感度大大降低。相比基于成对惩罚的整合分析方法,新方法减少了大量的冗余惩罚项,在降低计算量的同时提高了结果准确性。大量的模拟实验和黑色素瘤的致病基因识别应用研究均展示了新方法在识别效应异质性方面的优势。 Big data are usually combined by multiple datasets composed of different subjects or from different sources,which may lead to differences in the impact of the same independent variable on dependent variables between different datasets,namely,effect heterogeneity.Mining the potential effect heterogeneity from data has become one of the important goals of big data analysis.The integrative analysis methods based on fusion penalty and pairwise penalty are the two mainstream methods at present,but the fusion penalty is highly dependent on the ordering of coefficients,and the pairwise penalty incurs high computational cost.To this end,this paper proposes a new integrative analysis method based on a hybrid pairwise penalty.Compared with the fusion penalty-based method,the sensitivity of the new method to the coefficient ordering is greatly reduced.Compared with the pairwise penalty-based method,the new method reduces a large number of redundant penalty terms so that it can reduce computation cost and improve the accuracy of the results.We conduct extensive simulation studies and provide an application example in identification of pathogenicity genes in melanoma to demonstrate the advantage of the new method in identifying the effect heterogeneity over other methods.
作者 孙怡帆 姚一枝 于雪 Sun Yifan;Yao Yizhi;Yu Xue
出处 《统计研究》 CSSCI 北大核心 2024年第9期150-160,共11页 Statistical Research
基金 中国人民大学科学研究基金(中央高校基本科研业务费专项资金资助)项目“高维数据效应异质性挖掘的方法、理论与应用”(23XNL014)。
关键词 大数据 效应异质性 混合成对惩罚 整合分析 Big Data Effect Heterogeneity Hybrid Pairwise Penalty Integrative Analysis
作者简介 孙怡帆,中国人民大学应用统计科学研究中心、统计学院、未来区块链与隐私计算高精尖创新中心教授。研究方向为高维数据分析、多源异构数据分析、分布式优化算法;姚一枝,中国人民大学应用统计科学研究中心、统计学院硕士研究生。研究方向为数理统计、高维数据分析;通讯作者:于雪,中国人民大学应用统计科学研究中心、统计学院博士研究生。研究方向为高维数据分析、联邦学习、图学习。电子邮箱:xueyu_2019@ruc.edu.cn。
  • 相关文献

参考文献4

二级参考文献18

  • 1Fan J, Han F, Liu H. Challenges of Big Data analysis [J] National Science Review, 2014, 1 (2) :293 -314.
  • 2Yuan M, Lin Y. Model selection and estimation in regression with grouped variables [ J ]. Journal of the Royal Statistical Society: Series B, 2006, 68:49 -67.
  • 3Simon N, Friedman J, Hastie T and Tibshirani R. A sparse Group lasso [ J]. Journal of Computational and Graphical Statistics, 2013, 22(2) :231 -245.
  • 4Huang J, Ma S, Xie H and Zhang C. -H. A group bridge approach for variable selection [ J]. Biometrika, 2009, 96:339 - 355.
  • 5Ma S, Huang J, Song X. Integrative analysis and variable selection with multiple high-dimensional data sets [ J]. Biostatistics, 2011 a, 12(4) : 763 -775.
  • 6Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties [ J]. Journal of the American Statistical Association, 2001, 96 : 1348 - 1360.
  • 7Ma S, Dai Y, Huang J and Xie Y. Identification of breast cancer prognosis markers via integrative analysis [ J ]. Computational statistics and data analysis, 2012, 56 (9) : 2718 - 2728.
  • 8Huang J, Wei F, Ma S. Consistent group selection and estimation via normed minimax concave penalty, 2010. Unpublished manuscript.
  • 9Huang J, Breheny P, Ma S. A selective review of group selection in high-dimensional models [J]. Statistical Science, 2012, 27(4): 481 - 499.
  • 10Ma S, Huang J, Wei F, et al. Integrative analysis of multiple cancer prognosis studies with gene expression measurements [ J]. Statistics in medicine, 2011b, 30(28) : 3361 -3371.

共引文献114

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部