摘要
为对乳腺癌5年生存状态进行预测并分析其影响因素,首先,选取SEER数据库中2004—2010年乳腺癌相关数据,对选取的特征进行数据预处理;其次,在数据层面上,对数据进行SMOTE上采样以解决数据类别不平衡问题;在算法层面上,比较LightGBM、CatBoost和GBDT这3个模型在预测乳腺癌5年生存状态上的优劣;最后,根据重要性对乳腺癌5年生存状态的影响因素进行排序,并通过SHAP值对影响因素进行解释分析。本文构建的乳腺癌5年生存状态预测模型比单一模型具有更好的性能,其准确率、AUC、召回率、精确度和F_(1)值分别为0.9060、0.8443、0.9837、0.9160和0.9487;发现乳腺癌5年生存状态与肿瘤大小、检出的淋巴结总数、淋巴结转移数、雌激素受体、孕激素受体、年龄等因素有较大关系。本预测模型选择出的重要性特征与目前的临床结果保持一致,能为临床预后预测提供一定的技术支持。
The research is conducted to predict the 5-year survival status of breast cancer and analyze the influence factors.Firstly,the breast cancer related data from 2004—2010 were selected from the SEER database,and the selected featured data were preprocessed.Secondly,in terms of data,SMOTE algorithm was used to oversample the data to solve the imbalance of data categories;in terms of algorithm,the advantagess and disadvantages of lightgbm,catboost and gbc in predicting the 5-year survival status of breast cancer were compared.Finally,the influencing factors of breast cancer 5-year survival status were analyzed by SHAP value after ranking.The 5-year survival prediction model of breast cancer constructed in this paper has better performance than a single model.The accuracy rate,AUC,recall rate,precision rate and F_(1)-score are 0.9060,0.8443,0.9837,0.9160 and 0.9487 respectively;and it shows that the 5-year survival status of breast cancer is closely related to tumor size,examined lymph nodes,positive lymph nodes,ER status,PR status,and age.The model can provide prognosis prediction for the clinic with its excellent performance and the selected important features consistent with the current clinical results.
作者
张继婕
覃庆洪
刘雪萍
王康权
魏薇
ZHANG Jijie;QIN Qinghong;LIU Xueping;WANG Kangquan;WEI Wei(College of Science,Guangxi University of Science and Technology,Liuzhou 545006,China;Affiliated Cancer Hospital,Guangxi Medical University,Nanning 530021,China;Medical School,Guangxi University of Science and Technology,Liuzhou 545005,China)
出处
《广西科技大学学报》
2022年第1期101-109,共9页
Journal of Guangxi University of Science and Technology
基金
广西自然科学基金项目(2019GXNSFAA245067)资助。
关键词
SEER数据库
乳腺癌
集成学习
预后预测
SEER database
breast cancer
ensemble learning
prognosis prediction
作者简介
张继婕,在读硕士研究生;通信作者:刘雪萍,博士,副研究员,研究方向:药理学,E-mail:100000774@gxust.edu.cn。