基于混合成对惩罚的多个数据集效应异质性分析

Effect Heterogeneity Analysis of Multiple Datasets Based on a Hybrid Pairwise Penalty Method

在线阅读下载PDF

导出

摘要大数据通常是由主体或来源各异的多个数据集融合而成,因此同一个自变量对因变量的影响在不同数据集间可能存在差异,即效应异质性。从数据中挖掘出潜在的效应异质性已成为大数据分析的重要目标之一。基于融合惩罚和成对惩罚的整合分析方法是目前较为主流的两类效应异质性分析方法,但前者高度依赖模型系数的排序,而后者则计算量较大。为此,本文提出基于混合成对惩罚的新型整合分析方法。相比基于融合惩罚的整合分析方法,新方法对模型系数排序的敏感度大大降低。相比基于成对惩罚的整合分析方法,新方法减少了大量的冗余惩罚项,在降低计算量的同时提高了结果准确性。大量的模拟实验和黑色素瘤的致病基因识别应用研究均展示了新方法在识别效应异质性方面的优势。 Big data are usually combined by multiple datasets composed of different subjects or from different sources,which may lead to differences in the impact of the same independent variable on dependent variables between different datasets,namely,effect heterogeneity.Mining the potential effect heterogeneity from data has become one of the important goals of big data analysis.The integrative analysis methods based on fusion penalty and pairwise penalty are the two mainstream methods at present,but the fusion penalty is highly dependent on the ordering of coefficients,and the pairwise penalty incurs high computational cost.To this end,this paper proposes a new integrative analysis method based on a hybrid pairwise penalty.Compared with the fusion penalty-based method,the sensitivity of the new method to the coefficient ordering is greatly reduced.Compared with the pairwise penalty-based method,the new method reduces a large number of redundant penalty terms so that it can reduce computation cost and improve the accuracy of the results.We conduct extensive simulation studies and provide an application example in identification of pathogenicity genes in melanoma to demonstrate the advantage of the new method in identifying the effect heterogeneity over other methods.

作者孙怡帆姚一枝于雪 Sun Yifan;Yao Yizhi;Yu Xue

机构地区中国人民大学应用统计科学研究中心、统计学院、未来区块链与隐私计算高精尖创新中心中国人民大学应用统计科学研究中心、统计学院

出处《统计研究》 CSSCI 北大核心 2024年第9期150-160,共11页 Statistical Research

基金中国人民大学科学研究基金(中央高校基本科研业务费专项资金资助)项目“高维数据效应异质性挖掘的方法、理论与应用”(23XNL014)。

关键词大数据效应异质性混合成对惩罚整合分析 Big Data Effect Heterogeneity Hybrid Pairwise Penalty Integrative Analysis

分类号 O212 [理学—概率论与数理统计]

作者简介孙怡帆,中国人民大学应用统计科学研究中心、统计学院、未来区块链与隐私计算高精尖创新中心教授。研究方向为高维数据分析、多源异构数据分析、分布式优化算法;姚一枝,中国人民大学应用统计科学研究中心、统计学院硕士研究生。研究方向为数理统计、高维数据分析;通讯作者:于雪,中国人民大学应用统计科学研究中心、统计学院博士研究生。研究方向为高维数据分析、联邦学习、图学习。电子邮箱:xueyu_2019@ruc.edu.cn。

引文网络
相关文献

参考文献4

1李仲达,林建浩,王美今.大数据时代的高维统计:稀疏建模的发展及其应用[J].统计研究,2015,32(10):3-11. 被引量：14
2马双鸽,王小燕,方匡南.大数据的整合分析方法[J].统计研究,2015,32(11):3-11. 被引量：31
3张庆昭,陈子怡,方匡南.多源异常检测的整合单类SVM方法及应用[J].统计研究,2023,40(4):138-150. 被引量：5
4Jianqing Fan,Fang Han,Han Liu.Challenges of Big Data analysis[J].National Science Review,2014,1(2):293-314. 被引量：72

二级参考文献18

1Fan J, Han F, Liu H. Challenges of Big Data analysis [J] National Science Review, 2014, 1 (2) :293 -314.
2Yuan M, Lin Y. Model selection and estimation in regression with grouped variables [ J ]. Journal of the Royal Statistical Society: Series B, 2006, 68:49 -67.
3Simon N, Friedman J, Hastie T and Tibshirani R. A sparse Group lasso [ J]. Journal of Computational and Graphical Statistics, 2013, 22(2) :231 -245.
4Huang J, Ma S, Xie H and Zhang C. -H. A group bridge approach for variable selection [ J]. Biometrika, 2009, 96:339 - 355.
5Ma S, Huang J, Song X. Integrative analysis and variable selection with multiple high-dimensional data sets [ J]. Biostatistics, 2011 a, 12(4) : 763 -775.
6Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties [ J]. Journal of the American Statistical Association, 2001, 96 : 1348 - 1360.
7Ma S, Dai Y, Huang J and Xie Y. Identification of breast cancer prognosis markers via integrative analysis [ J ]. Computational statistics and data analysis, 2012, 56 (9) : 2718 - 2728.
8Huang J, Wei F, Ma S. Consistent group selection and estimation via normed minimax concave penalty, 2010. Unpublished manuscript.
9Huang J, Breheny P, Ma S. A selective review of group selection in high-dimensional models [J]. Statistical Science, 2012, 27(4): 481 - 499.
10Ma S, Huang J, Wei F, et al. Integrative analysis of multiple cancer prognosis studies with gene expression measurements [ J]. Statistics in medicine, 2011b, 30(28) : 3361 -3371.

共引文献114

1孙怡帆,王彩晶,罗梓烨.基于变系数模型的高维数据异同性识别方法研究[J].统计研究,2021,38(5):136-146. 被引量：2
2范新妍,方匡南,郑陈璐,张志远.基于整合治愈率模型的信贷违约时点预测[J].统计研究,2021(2):99-113. 被引量：4
3李欢,董娜,潘敏,余睿,熊峰.基于PCA-ARDL-BP神经网络的房价指数预测研究[J].建筑经济,2022,43(S01):759-763. 被引量：2
4唐晓彬,张瑞,刘立新.基于蝙蝠算法SVR模型的北京市二手房价预测研究[J].统计研究,2018,35(11):71-81. 被引量：29
5王宜鸿,魏雪迎,叶鹰.大小数据集上的信息分析刍议[J].图书馆杂志,2018,37(12):14-19. 被引量：6
6方匡南,赵梦峦.基于多源数据融合的个人信用评分研究[J].统计研究,2018,35(12):92-101. 被引量：21
7李艳明,杨亚东,张昭军,方向东.精准医学大数据的分析与共享[J].中国医学前沿杂志（电子版）,2015,7(6):4-10. 被引量：18
8LIU JingYuan,ZHONG Wei,LI RunZe.A selective overview of feature screening for ultrahigh-dimensional data[J].Science China Mathematics,2015,58(10):2033-2054. 被引量：11
9李生慧,徐志伟,郑志杰.对医学信息大数据趋势下医学统计学教学的几点思考[J].新校园（上旬刊）,2015,0(10):60-61. 被引量：1
10孟润堂,罗艺,宇传华,邱杰,周达.健康大数据在公共卫生领域中的应用与挑战[J].中国全科医学,2015,18(35):4388-4392. 被引量：45

1李晓颖,牟津瑶.传统村落文化景观基因感知信息链构建及发展研究——基于游客感知视角[J].南方建筑,2024(9):58-67. 被引量：2
2焦志伦,李雯雯,刘秉镰.数字经济发展必然减少行业碳排放吗?——来自物流业的新证据[J].南开经济研究,2024(6):110-128. 被引量：10
3顾亦然,薛宇辰,张腾飞.ID4TST:基于融合数据集的文本风格迁移模型[J].小型微型计算机系统,2024,45(10):2338-2344.
4马中正,杨云川,马翔,周迟,丁丁,霍俊一,徐楠,崔培元,周磊.胰腺癌双硫死亡相关的lncRNA预后模型的构建及免疫反应研究[J].中华普通外科学文献（电子版）,2024,18(5):368-376.
5蔡佩,洪舒盈,欧阳丽萍.五邑侨乡传统村落文化遗产景观基因识别与运用研究[J].山西建筑,2024,50(21):16-20.

统计研究

2024年第9期

浏览历史

内容加载中请稍等...

基于混合成对惩罚的多个数据集效应异质性分析

参考文献4

二级参考文献18

共引文献114

相关作者

相关机构

相关主题

浏览历史