摘要
依赖于临床标签的氨基酸致病突变预测方法通常由于标签存在跨基因的偏差、稀疏噪声等因素,出现性能膨胀的情况.为解决此问题,创新地在不需要标签的情况下,利用预训练蛋白质语言模型计算ClinVar数据库中突变位点的氨基酸概率分布,并基于此分布构造突变型与野生型氨基酸出现概率的对数优势比(LOR),使用一种全局-局部结合的高斯混合模型拟合LOR,从而无监督地计算突变致病效应概率分数(PPE)并推断致病性,最后给出预测的不确定性度量.使用与深度突变扫描(DMS)实验的相关性作为评估指标以避免标签泄漏等问题.模型评估结果验证PPE具有稳健的致病性预测性能,在2458个蛋白质上的接收者操作特征曲线下面积(AUC)平均值约为0.89,与4种DMS实验的平均斯皮尔曼相关系数约为0.44,优于大部分依赖标签的计算方法,且与高通量实验的性能相当.该研究为遗传变异的解释、疾病的研究、诊断和临床治疗提供了可靠的辅助工具.
Amino acid pathogenic mutation predictors that rely on clinical labels usually suffer from inflated performance due to label bias across genes and sparse noise.Innovatively,the probability distribution of amino acids at each mutation loci is calculated by using a pre-trained protein language model,and the Log Odds Ratio(LOR)of the probability of mutant versus wildtype amino acids is constructed based on this distribution.LOR is fitted by using a combined global-local Gaussian Mixture Model to calculate the probability of pathogenic effect(PPE)of mutations and the measure of the predictive uncertainty.Correlation with Deep Mutation Scanning(DMS)experiments is used as an evaluation metric to avoid label leakage.Evaluation results validate that PPE has robust prediction performance with a mean Area Under the Receiver Operating Characteristic Curve(AUC)of about 0.89 on 2458 proteins and a mean Spearman correlation coefficient of about 0.44 with four DMS experiments,outperforming most label-dependent methods and comparable to the performance of high-throughput experiments.This study provides a reliable aid for the interpretation of genetic variants,disease research,diagnosis,and clinical treatment.
作者
罗江毅
姚音
LUO Jiangyi;YAO Yin(School of Life Sciences,Fudan University,Shanghai 200438,China)
出处
《河南科学》
2023年第8期1093-1101,共9页
Henan Science
作者简介
罗江毅(1999-),男,硕士研究生,研究方向为计算生物学及其应用;通信作者:姚音(1963-),女,教授,博士,研究方向为统计遗传学、精神病遗传学和计算生物学。