Essential proteins are crucial for biological processes and can be identified through both experimental and computational methods.While experimental approaches are highly accurate,they often demand extensive time and ...Essential proteins are crucial for biological processes and can be identified through both experimental and computational methods.While experimental approaches are highly accurate,they often demand extensive time and resources.To address these challenges,we present a computational ensemble learning framework designed to identify essential proteins more efficiently.Our method begins by using node2vec to transform proteins in the protein–protein interaction(PPI)network into continuous,low-dimensional vectors.We also extract a range of features from protein sequences,including graph-theory-based,information-based,compositional,and physiochemical attributes.Additionally,we leverage deep learning techniques to analyze high-dimensional position-specific scoring matrices(PSSMs)and capture evolutionary information.We then combine these features for classification using various machine learning algorithms.To enhance performance,we integrate the outputs of these algorithms through ensemble methods such as voting,weighted averaging,and stacking.This approach effectively addresses data imbalances and improves both robustness and accuracy.Our ensemble learning framework achieves an AUC of 0.960 and an accuracy of 0.9252,outperforming other computational methods.These results demonstrate the effectiveness of our approach in accurately identifying essential proteins and highlight its superior feature extraction capabilities.展开更多
Essential proteins are inseparable in cell growth and survival. The study of essential proteins is important for understanding cellular functions and biological mechanisms. Therefore, various computable methods have b...Essential proteins are inseparable in cell growth and survival. The study of essential proteins is important for understanding cellular functions and biological mechanisms. Therefore, various computable methods have been proposed to identify essential proteins. Unfortunately, most methods based on network topology only consider the interactions between a protein and its neighboring proteins, and not the interactions with its higher-order distance proteins. In this paper, we propose the DSEP algorithm in which we integrated network topology properties and subcellular localization information in protein–protein interaction(PPI) networks based on four-order distances, and then used random walks to identify the essential proteins. We also propose a method to calculate the finite-order distance of the network, which can greatly reduce the time complexity of our algorithm. We conducted a comprehensive comparison of the DSEP algorithm with 11 existing classical algorithms to identify essential proteins with multiple evaluation methods. The results show that DSEP is superior to these 11 methods.展开更多
基金financially supported by the National Key R&D Program of China(Grant No.2022YFF1202600)the National Natural Science Foundation of China(Grant No.82301158)+4 种基金Science and Technology Innovation Action Plan of Shanghai Science and Technology Committee(Grant No.22015820100)Two-hundred Talent Support(Grant No.20152224)Translational Medicine Innovation Project of Shanghai Jiao Tong University School of Medicine(Grant No.TM201915)Clinical Research Project of Multi-Disciplinary Team,Shanghai Ninth People’s Hospital,Shanghai Jiao Tong University School of Medicine(Grant No.201914)China Postdoctoral Science Foundation(Grant No.2023M742332)。
文摘Essential proteins are crucial for biological processes and can be identified through both experimental and computational methods.While experimental approaches are highly accurate,they often demand extensive time and resources.To address these challenges,we present a computational ensemble learning framework designed to identify essential proteins more efficiently.Our method begins by using node2vec to transform proteins in the protein–protein interaction(PPI)network into continuous,low-dimensional vectors.We also extract a range of features from protein sequences,including graph-theory-based,information-based,compositional,and physiochemical attributes.Additionally,we leverage deep learning techniques to analyze high-dimensional position-specific scoring matrices(PSSMs)and capture evolutionary information.We then combine these features for classification using various machine learning algorithms.To enhance performance,we integrate the outputs of these algorithms through ensemble methods such as voting,weighted averaging,and stacking.This approach effectively addresses data imbalances and improves both robustness and accuracy.Our ensemble learning framework achieves an AUC of 0.960 and an accuracy of 0.9252,outperforming other computational methods.These results demonstrate the effectiveness of our approach in accurately identifying essential proteins and highlight its superior feature extraction capabilities.
基金Project supported by the Gansu Province Industrial Support Plan (Grant No.2023CYZC-25)the Natural Science Foundation of Gansu Province (Grant No.23JRRA770)the National Natural Science Foundation of China (Grant No.62162040)。
文摘Essential proteins are inseparable in cell growth and survival. The study of essential proteins is important for understanding cellular functions and biological mechanisms. Therefore, various computable methods have been proposed to identify essential proteins. Unfortunately, most methods based on network topology only consider the interactions between a protein and its neighboring proteins, and not the interactions with its higher-order distance proteins. In this paper, we propose the DSEP algorithm in which we integrated network topology properties and subcellular localization information in protein–protein interaction(PPI) networks based on four-order distances, and then used random walks to identify the essential proteins. We also propose a method to calculate the finite-order distance of the network, which can greatly reduce the time complexity of our algorithm. We conducted a comprehensive comparison of the DSEP algorithm with 11 existing classical algorithms to identify essential proteins with multiple evaluation methods. The results show that DSEP is superior to these 11 methods.