摘要
DNA N6甲基腺嘌呤(6mA)是DNA中一种重要的甲基化修饰,参与生物学许多调控过程,在生物过程中起着重要的作用.文章用了公开的小鼠数据集进行研究,首先对小鼠的基因序列(A、T、C、G)通过数学表示符进行信息编码,然后采用卡方检验的方法对编码信息进行特征筛选,筛选出6mA位点相关的特征进行下一步的研究,最后用了7种机器学习算法构建分类模型,并采用五折交叉验证(5-Fold Cross-Validation)对预测结果进行验证,结果显示在使用滑动窗口编码方式下选取前20个最优特征作为训练集样本特征,其随机森林模型对于小鼠6mA位点预测准确率可达到1.
DNA N6-methyladenine(6 mA)is an important DNA methylation modification that plays a significant role in many biological regulatory processes.This article used a publicly available mouse dataset to study this modification.Firstly,the mouse gene sequence(A,T,C,G)was encoded using mathematical representation symbols.Then,the encoded information was subjected to feature selection using chi-square testing to select features related to 6mA sites for further study.Seven machine learning algorithms were then used to construct a classification model,and the predictive results were validated using a five-fold crossvalidation method.The results showed that selecting the top 20 optimal features as training set sample features using a sliding window encoding method yielded a random forest model that achieved an accuracy of 1 in predicting mouse 6mA sites.
作者
冯欣
李英瑞
王苹
董哲原
辛瑞昊
FENG Xin;LI Yingrui;WANG Ping;DONG Zheyuan;XIN Ruihao(School of Mathematics and Science,Jilin Institute of Chemical Technology,Jilin City 1320222,China;School of Information and Control Engineering,Jilin Institute of Chemical Technology,Jilin City 132022,China)
出处
《吉林化工学院学报》
CAS
2022年第11期14-19,共6页
Journal of Jilin Institute of Chemical Technology
基金
吉林省高教科研课题(JGJX2021D226)
吉林省高教科研课题(JGJX2021D213)
关键词
基因位点
特征选择
机器学习
gene sequence site
feature selection
machine learning
作者简介
冯欣(1989-),女,吉林省吉林市人,吉林化工学院副教授,博士,主要从事大数据分析与挖掘和机器学习方面的研究;通信作者:辛瑞昊(1989-),男,吉林梅河口人,吉林化工学院讲师,博士,主要从事智能制造的大数据、云计算、数据挖掘方面的研究.E-mail:xinruihao@jlict.edu.cn;李英瑞,吉林化工学院2020级研究生;王苹,吉林化工学院2021级研究生;董哲原,吉林化工学院2020级研究生