摘要
目的基于机器学习算法探讨阿尔兹海默病发病的脂蛋白及代谢物影响因素。方法从ADNI数据库中选取2012年诊断结果为正常(cognitive normal,CN)和阿尔兹海默病(Alzheimer disease,AD)的研究对象共314例,收集其脂蛋白及代谢物数据。采用随机森林、lasso回归、XGboost算法三种方法对变量进行重要性排序及筛选。利用三种方法筛选出的变量,结合研究人群的性别、年龄、婚姻状况构建随机森林模型,预测影响AD发病的重要因素。结果三种方法共筛选出12个脂蛋白及代谢物变量,结合研究人群的年龄、性别、婚姻状况共15个变量被纳入随机森林模型。模型的准确率为84.13%、灵敏度为93.75%、特异度为53.33%、Kappa值为0.5183、AUC(95%CI)为0.735(0.600~0.871)。根据随机森林模型中Mean Decrease Accuracy和Mean Decrease Gini两指标分别筛选出的排名前五的变量中均包含以下四个变量:大极低密度脂蛋白中的磷脂与总脂质之比(L_VLDL_PL_PCT)、年龄(AGE)、乳糜微粒和极大极低密度脂蛋白中的甘油三酯(XXL_VLDL_TG)和大极低密度脂蛋白中的胆固醇与总脂质之比(L_VLDL_C_PCT)。四个变量的Mean Decrease Accuracy值分别为6.68、6.65、6.10、5.49;其Mean Decrease Gini值分别为7.35、11.71、8.77、10.08。结论L_VLDL_PL_PCT、AGE、XXL_VLDL_TG和L_VLDL_C_PCT四个变量与AD发病密切相关,该研究为AD的诊断及预防提供了新的见解。深入研究脂蛋白水平与AD发病的关联性,可预防及延缓AD的发生。
Objective To investigate the influencing factors of lipoproteins and metabolites in Alzheimer’s disease based on machine learning algorithm.Methods A total of 314 subjects diagnosed with cognitive normal(CN)and Alzheimer’s disease(AD)in 2012 were selected from the ADNI database,and their lipoprotein and metabolite data were collected.Random forest,Lasso regression,and XG boost algorithm were used to sort and screen the important variables.Using the variables selected by the three methods and combining the gender,age,and marital status of the study population,a random forest model was constructed to predict the important factors affecting the incidence of AD.Results A total of 12 lipoprotein and metabolite variables were screened by three methods,and 15 variables were incorporated into the random forest model combined with the age,sex,and marital status of the study population.The accuracy of the model was 84.13%,the sensitivity was 93.75%,the specificity was 53.33%,the Kappa value was 0.5183,and the AUC(95%CI)was 0.735(0.600-0.871).The most important variables selected in the random forest model based on Mean Decrease Accuracy and Mean Decrease Gini were phospholipid to total lipid ratio in large very low-density lipoprotein(L_VLDL_PL_PCT),age,triglycerides in cytokinome’s and very low-density lipoprotein(XXL_VLDL_TG),and cholesterol to total lipid ratio in large very low-density lipoprotein(L_VLDL_C_PCT).The Mean Decrease Accuracy of the four variables were 6.68,6.65,6.10,and 5.49,respectively,and the Mean Decrease Gini values were 7.35,11.71,8.77,and 10.08,respectively.Conclusion L_VLDL_PL_PCT,AGE,XXL_VLDL_TG,and L_VLDL_C_PCT are closely related to the incidence of AD.This study can provide new insights for the diagnosis and prevention of AD.In-depth study of the relationship between lipoprotein levels and AD incidence can prevent and delay the occurrence of AD.
作者
王凤琳
王爱民
黄一铭
徐雅琪
张文婧
石福艳
王素珍
WANG Feng-lin;WANG Ai-min;HUANG Yi-ming;XU Ya-qi;ZHANG Wen-jing;SHI Fu-yan;WANG Su-zhen(School of Public Health,Weifang Medical College,Weifang,Shandong 261053,China)
出处
《现代预防医学》
CAS
北大核心
2023年第23期4225-4230,共6页
Modern Preventive Medicine
基金
国家自然科学基金(81803337,81872719,82003560)
国家统计局课题(2018LY79)
山东省自然科学基金(ZR2019MH034,ZR2020MH340)
山东省高等学校青创人才引育计划(2019-6-156,Lu-Jiao)
潍坊医学院博士启动基金(2017BSQD51)。
作者简介
王凤琳(2000-),女,硕士在读,研究方向:流行病与卫生统计学;通信作者:石福艳,E-mail:shifuyan@126.com;通信作者:王素珍,E-mail:wangsz@wfmc.edu.cn。