Protein secondary structure prediction and high-throughput drug screen data mining are two important applications in bioinformatics. The data is represented in sparse feature spaces and can be unrepresentative of futu...Protein secondary structure prediction and high-throughput drug screen data mining are two important applications in bioinformatics. The data is represented in sparse feature spaces and can be unrepresentative of future data. There is certainly some noise in the data and there may be significant noise. Supervised learners in this context will display their inherent bias toward certain solutions, generally solutions that fit the training set well. In this paper, we first describe an ensemble approach using subsampling that scales well with dataset size. A sufficient number of ensemble members using subsamples of the data can yield a more accurate classifier than a single classifier using the entire dataset. Experiments on several datasets demonstrate the effectiveness of the approach. We report results from the KDD Cup 2001 drug discovery dataset in which our approach yields a higher weighted accuracy than the winning entry. We then ex-tend our ensemble approach to create an over-generalized classifier for prediction by reducing the individual subsample size. The ensemble strategy using small subsamples has the effect of averaging over a wider range of hypotheses. We show that both protein secondary structure prediction and drug discovery prediction can be improved by the use of over-generalization, specifically through the use of ensembles of small subsamples.展开更多
为满足不同种类食品对大豆分离蛋白(soybean protein isolate,SPI)不同功能性的需求,本研究利用红外光谱快速采集70组不同pH值处理后SPI的数据,探讨pH值变化对SPI结构含量的影响。使用均值中心化、多元散射校正、标准正态变量变换和归...为满足不同种类食品对大豆分离蛋白(soybean protein isolate,SPI)不同功能性的需求,本研究利用红外光谱快速采集70组不同pH值处理后SPI的数据,探讨pH值变化对SPI结构含量的影响。使用均值中心化、多元散射校正、标准正态变量变换和归一化算法对红外光谱数据进行预处理,基于二维相关红外光谱提取特征波段,再利用偏最小二乘(partial least square,PLS)法和算术优化算法-随机森林(arithmetic optimization algorithm-random forests,AOA-RF)建立不同pH值条件下SPI结构及含量的预测模型。结果表明,经均值中心化和多元散射校正结合处理后,α-螺旋、β-折叠、β-转角和无规卷曲模型的相对标准偏差分别为1.29%、1.60%、1.37%、7.28%,两者结合对光谱数据的预处理效果最佳。预测α-螺旋和β-折叠含量最优模型为AOA-RF(特征波段),校正集决定系数为0.9350和0.9266,预测集决定系数为0.8568和0.8701;预测β-转角和无规卷曲含量最优模型为PLS(特征波段),校正集决定系数为0.9154和0.8817,预测集决定系数为0.8913和0.7843。本研究结果可为工业生产过程中产品质量快速检测和工艺条件控制提供理论支撑。展开更多
基金This research was partially funded by Tripos Inc.+2 种基金 the United States Department of Energy through the Sandia National Laboratories LDRD program and ASCI VIEWS Data Discovery Program contract number DE-AC04-76D000789 and the National Science Foundati
文摘Protein secondary structure prediction and high-throughput drug screen data mining are two important applications in bioinformatics. The data is represented in sparse feature spaces and can be unrepresentative of future data. There is certainly some noise in the data and there may be significant noise. Supervised learners in this context will display their inherent bias toward certain solutions, generally solutions that fit the training set well. In this paper, we first describe an ensemble approach using subsampling that scales well with dataset size. A sufficient number of ensemble members using subsamples of the data can yield a more accurate classifier than a single classifier using the entire dataset. Experiments on several datasets demonstrate the effectiveness of the approach. We report results from the KDD Cup 2001 drug discovery dataset in which our approach yields a higher weighted accuracy than the winning entry. We then ex-tend our ensemble approach to create an over-generalized classifier for prediction by reducing the individual subsample size. The ensemble strategy using small subsamples has the effect of averaging over a wider range of hypotheses. We show that both protein secondary structure prediction and drug discovery prediction can be improved by the use of over-generalization, specifically through the use of ensembles of small subsamples.
文摘为满足不同种类食品对大豆分离蛋白(soybean protein isolate,SPI)不同功能性的需求,本研究利用红外光谱快速采集70组不同pH值处理后SPI的数据,探讨pH值变化对SPI结构含量的影响。使用均值中心化、多元散射校正、标准正态变量变换和归一化算法对红外光谱数据进行预处理,基于二维相关红外光谱提取特征波段,再利用偏最小二乘(partial least square,PLS)法和算术优化算法-随机森林(arithmetic optimization algorithm-random forests,AOA-RF)建立不同pH值条件下SPI结构及含量的预测模型。结果表明,经均值中心化和多元散射校正结合处理后,α-螺旋、β-折叠、β-转角和无规卷曲模型的相对标准偏差分别为1.29%、1.60%、1.37%、7.28%,两者结合对光谱数据的预处理效果最佳。预测α-螺旋和β-折叠含量最优模型为AOA-RF(特征波段),校正集决定系数为0.9350和0.9266,预测集决定系数为0.8568和0.8701;预测β-转角和无规卷曲含量最优模型为PLS(特征波段),校正集决定系数为0.9154和0.8817,预测集决定系数为0.8913和0.7843。本研究结果可为工业生产过程中产品质量快速检测和工艺条件控制提供理论支撑。