肺癌基因表达数据的Stacking集成学习法分析

EIR

Educational Innovation and Research

3066-82983066-828X

Art and Technology

10.61369/EIR.2025100010

Article

肺癌基因表达数据的Stacking集成学习法分析https://artdesignp.com/journal/EIR/1/10/10.61369/EIR.2025100010郝智航,胡广才,倪士峰,王乐,王欢

2025

110

2025-12-20

本研究针对肺癌基因表达数据高维度、小样本及标注噪声对传统单模型的挑战，提出一种增强型Stacking集成学习框架，提升分类性能与鲁棒性。以GSE252168数据为基础，首先通过混合特征选择策略，将基因维度由30715降至1500；继而集成SVM、逻辑回归、随机森林与XGBoost作为异构基学习器，其核心创新在于双重增强机制：一方面将基学习器生成的元特征与原特征拼接以构建元层输入，另一方面在推理时采用基于验证集F1与AUC的动态权重自适应融合基模型输出，元学习器以L1正则化逻辑回归。为评估鲁棒性，训练时注入8%标签噪声。实验结果表明，该框架在测试集上获得F1=0.9162、AUC=0.9752、准确率高达96.06%，显著优于最佳单模型；本研究有效解决了高维基因数据分类难题，为肺癌精准诊断提供了可靠的技术支撑。肺癌数据分析,Stacking 集成学习,机器学习,动态权重,精准诊断

[1]Stark R,Grzelak M,Hadfield J.RNA sequencing: the teenage years[J].Nature Reviews Genetics,2019,20(11):631-656.[2]Byron S A,Van Keuren-Jensen K,Engelthaler D M,et al.Translating RNA sequencing into clinical diagnostics: opportunities and challenges[J].Nature Reviews Genetics,2016,17(5):257-271.[3]Bzdok D, Altman N, Krzywinski M. Statistics versus machine learning[J]. Nature Methods,2018,15(4):233-234.[4]Yi Z,Prabhakar C,Jianghua H.Nested cross-validation with ensemble feature selection and classification model for high-dimensional biological data[J]. Communications in Statistics - Simulation and Computation, 2023, 52(1): 110-125.[5]Ahrens A, Hansen C B, Schaffer M E. pystacked:Stacking generalization and machine learning in Stata[J].The Stata Journal,2023,23(4):909-931.[6]Libbrecht M W, Noble W S. Machine learning applications in genetics and genomics[J].Nature Reviews Genetics, 2015,16(6): 321-332.[7]Mohammed M M A.A Comparison of Cancer Classification Methods Based on Microarray Data[D].University of KwaZulu-Natal,Pietermaritzburg,2018.[8]Guyon I, Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines[J]. Machine Learning, 2002, 46(1): 389-422.[9] 杜冲, 周长银, 李悦. 集成特征选择方法在基因表达数据上的应用[J].山东科技大学学报( 自然科学版), 2019,38(01): 85-90.[10]Ma J,Yu M K,Fong S,et al. Using deep learning to model the hierarchical structure and function of a cell[J]. Nature Methods,2018,15(4):290-298.[11]Sumon M S I,Shahriar S M S,Hasan M A M,et al.Integrative stacking machine learning model for small cell lung cancer prediction using metabolomics profiling[J]. Cancers, 2024,16(24):4225.[12]Ganie S M, Dutta Pramanik P K,Zhao Z.Enhanced and interpretable prediction of multiple cancer types using a stacking ensemble approach with SHAP analysis[J]. Bioengineering, 2025, 12(5): 472.[13]Naderalvojoud B, Hernandez-Boussard T. Improving machine learning with ensemble learning on observational healthcare data[C]//AMIA Annual Symposium Proceedings.2023,1(11): 521-529.[14]Chicco D,Alameer A,Rahmati S,et al.Towards a potential pan-cancer prognostic signature for gene expression based on probesets and ensemble machine learning[J]. BioData Mining, 2022,15:28.[15] 李泉伦, 陈争光, 焦峰. 基于Stacking集成学习的近红外光谱油页岩含油率预测[J].光谱学与光谱分析,2023,43(04): 1030-1036.[16] 张松兰. 支持向量机的算法及应用综述[J]. 江苏理工学院学报, 2016, 22(02): 14-17+21.[17]Breiman L,Friedman J,Olshen R A, et al. Classification and Regression Trees[M]. Chapman and Hall/CRC,2017.[18]Chen T,Guestrin C.Xgboost:A scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:785-794.[19]Mahmoud A, Takaoka E. An enhanced machine learning approach with stacking ensemble learner for accurate liver cancer diagnosis using feature selection and gene expression data[J]. Healthcare Analytics, 2025,7:100373.[20] 王笑. 基于机器学习的单细胞类型识别方法研究[D]. 西北农林科技大学,2024.DOI:10.27409/d.cnki.gxbnu.2024.002493.[21] 钟晨露. 联合人工智能与基因检测指导非瓣膜性房颤患者华法林个体化抗凝研究[D]. 扬州大学,2024.DOI:10.27441/d.cnki.gyzdu.2024.001110.[22] 吴继明. 基于多组学数据的癌症类型识别与分期诊断研究[D]. 景德镇陶瓷大学, 2024. DOI:10.27191/d.cnki.gjdtc.2024.000106.[23] 郭依晨. 整合多组学数据建立女性癌症的生存预测模型[D]. 华北电力大学( 北京),2023.DOI:10.27140/d.cnki.ghbbu.2023.000571.[24] 杨晨雨, 刘振浩, 代培斌, 等. 基于多组学数据的肿瘤药物敏感性预测[J].生物工程学报,2022,38(06):2201-2212. DOI:10.13345/j.cjb.210676.[25] 于明铭. 基于机器学习的卵巢癌转录组数据分析方法研究[D]. 新疆大学,2022.DOI:10.27429/d.cnki.gxjdu.2022.001524.[26] 陈一凡. 生物大分子序列表征学习方法及其应用研究[D]. 湖南大学, 2023. DOI:10.27135/d.cnki.ghudu.2023.000546.[27]王欣. 非编码RNA 与其启动子的识别及疾病关联预测方法研究[D]. 哈尔滨工业大学, 2023. DOI:10.27061/d.cnki.ghgdu.2023.005953.[28]Huynh-Thu V A, Irrthum A, Wehenkel L, et al. Inferring regulatory networks from expression data using tree-based methods[J]. PloS one, 2010, 5(9): e12776.