In this paper we apply the nonlinear time series analysis method to small-time scale traffic measurement data. The prediction-based method is used to determine the embedding dimension of the traffic data. Based on the...In this paper we apply the nonlinear time series analysis method to small-time scale traffic measurement data. The prediction-based method is used to determine the embedding dimension of the traffic data. Based on the reconstructed phase space, the local support vector machine prediction method is used to predict the traffic measurement data, and the BIC-based neighbouring point selection method is used to choose the number of the nearest neighbouring points for the local support vector machine regression model. The experimental results show that the local support vector machine prediction method whose neighbouring points are optimized can effectively predict the small-time scale traffic measurement data and can reproduce the statistical features of real traffic measurements.展开更多
In this paper a new continuous variable called core-ratio is defined to describe the probability for a residue to be in a binding site, thereby replacing the previous binary description of the interface residue using ...In this paper a new continuous variable called core-ratio is defined to describe the probability for a residue to be in a binding site, thereby replacing the previous binary description of the interface residue using 0 and 1. So we can use the support vector machine regression method to fit the core-ratio value and predict the protein binding sites. We also design a new group of physical and chemical descriptors to characterize the binding sites. The new descriptors are more effective, with an averaging procedure used. Our test shows that much better prediction results can be obtained by the support vector regression (SVR) method than by the support vector classification method.展开更多
The distribution of data has a significant impact on the results of classification.When the distribution of one class is insignificant compared to the distribution of another class,data imbalance occurs.This will resu...The distribution of data has a significant impact on the results of classification.When the distribution of one class is insignificant compared to the distribution of another class,data imbalance occurs.This will result in rising outlier values and noise.Therefore,the speed and performance of classification could be greatly affected.Given the above problems,this paper starts with the motivation and mathematical representing of classification,puts forward a new classification method based on the relationship between different classification formulations.Combined with the vector characteristics of the actual problem and the choice of matrix characteristics,we firstly analyze the orderly regression to introduce slack variables to solve the constraint problem of the lone point.Then we introduce the fuzzy factors to solve the problem of the gap between the isolated points on the basis of the support vector machine.We introduce the cost control to solve the problem of sample skew.Finally,based on the bi-boundary support vector machine,a twostep weight setting twin classifier is constructed.This can help to identify multitasks with feature-selected patterns without the need for additional optimizers,which solves the problem of large-scale classification that can’t deal effectively with the very low category distribution gap.展开更多
The application of carbon dioxide(CO_(2)) in enhanced oil recovery(EOR) has increased significantly, in which CO_(2) solubility in oil is a key parameter in predicting CO_(2) flooding performance. Hydrocarbons are the...The application of carbon dioxide(CO_(2)) in enhanced oil recovery(EOR) has increased significantly, in which CO_(2) solubility in oil is a key parameter in predicting CO_(2) flooding performance. Hydrocarbons are the major constituents of oil, thus the focus of this work lies in investigating the solubility of CO_(2) in hydrocarbons. However, current experimental measurements are time-consuming, and equations of state can be computationally complex. To address these challenges, we developed an artificial intelligence-based model to predict the solubility of CO_(2) in hydrocarbons under varying conditions of temperature, pressure, molecular weight, and density. Using experimental data from previous studies,we trained and predicted the solubility using four machine learning models: support vector regression(SVR), extreme gradient boosting(XGBoost), random forest(RF), and multilayer perceptron(MLP).Among four models, the XGBoost model has the best predictive performance, with an R^(2) of 0.9838.Additionally, sensitivity analysis and evaluation of the relative impacts of each input parameter indicate that the prediction of CO_(2) solubility in hydrocarbons is most sensitive to pressure. Furthermore, our trained model was compared with existing models, demonstrating higher accuracy and applicability of our model. The developed machine learning-based model provides a more efficient and accurate approach for predicting CO_(2) solubility in hydrocarbons, which may contribute to the advancement of CO_(2)-related applications in the petroleum industry.展开更多
Background: The accurate estimation of soil nutrient content is particularly important in view of its impact on plant growth and forest regeneration. In order to investigate soil nutrient content and quality for the n...Background: The accurate estimation of soil nutrient content is particularly important in view of its impact on plant growth and forest regeneration. In order to investigate soil nutrient content and quality for the natural regeneration of Dacrydium pectinatum communities in China, designing advanced and accurate estimation methods is necessary.Methods: This study uses machine learning techniques created a series of comprehensive and novel models from which to evaluate soil nutrient content. Soil nutrient evaluation methods were built by using six support vector machines and four artificial neural networks.Results: The generalized regression neural network model was the best artificial neural network evaluation model with the smallest root mean square error(5.1), mean error(-0.85), and mean square prediction error(29). The accuracy rate of the combined k-nearest neighbors(k-NN) local support vector machines model(i.e. k-nearest neighbors-support vector machine(KNNSVM)) for soil nutrient evaluation was high, comparing to the other five partial support vector machines models investigated. The area under curve value of generalized regression neural network(0.6572) was the highest, and the cross-validation result showed that the generalized regression neural network reached 92.5%.Conclusions: Both the KNNSVM and generalized regression neural network models can be effectively used to evaluate soil nutrient content and quality grades in conjunction with appropriate model variables. Developing a new feasible evaluation method to assess soil nutrient quality for Dacrydium pectinatum, results from this study can be used as a reference for the adaptive management of rare and endangered tree species. This study, however, found some uncertainties in data acquisition and model simulations, which will be investigated in upcoming studies.展开更多
目的采用4种机器学习算法分别构建结直肠癌患者术前营养不良的临床风险预测模型,探讨其预测价值。方法回顾性收集2023年1月—2024年5月在新疆医科大学附属肿瘤医院胃肠外科就诊的412例结直肠癌患者的术前资料;按7∶3的比例随机分为训练...目的采用4种机器学习算法分别构建结直肠癌患者术前营养不良的临床风险预测模型,探讨其预测价值。方法回顾性收集2023年1月—2024年5月在新疆医科大学附属肿瘤医院胃肠外科就诊的412例结直肠癌患者的术前资料;按7∶3的比例随机分为训练集(n=288)和验证集(n=124),采用单因素分析及二元logistic回归分析筛选出术前营养不良的预测因子;基于逻辑回归(LR)、支持向量机(SVM)、轻量级梯度提升(LightGBM)、多层感知机(MLP)4种机器学习算法分别构建结直肠癌患者术前营养不良风险预测模型,绘制ROC曲线评价4种算法模型预测效能,通过Delong检验比较4种模型的AUC差异。选择最优算法模型,采用校准曲线和临床决策曲线(DCA曲线)进行验证。结果(1)结直肠癌患者术前营养不良发生率为33.7%,年龄、Braden评分是其独立危险因素;(2)训练集中LightGBM算法模型预测结直肠癌患者术前发生营养不良的AUC高于LR、SVM、MLP算法模型(0.941 VS 0.874、0.830、0.831);(3)ROC曲线结果提示,LightGBM算法模型验证集中预测结直肠癌患者术前发生营养不良的AUC为0.926(95%CI:0.882~0.969);校准曲线显示,LightGBM算法模型预测结直肠癌患者术前发生营养不良的曲线与实际发生营养不良一致性良好;DCA曲线结果显示,LightGBM算法模型在阈值概率区间为0.16~0.79可以提供显著临床净收益。结论基于LightGBM算法构建的临床预测模型在预测结直肠癌患者术前发生营养不良中有较高价值,可以为临床人员实施营养管理提供参考。展开更多
基金Project supported by the National Natural Science Foundation of China (Grant No 60573065)the Natural Science Foundation of Shandong Province,China (Grant No Y2007G33)the Key Subject Research Foundation of Shandong Province,China(Grant No XTD0708)
文摘In this paper we apply the nonlinear time series analysis method to small-time scale traffic measurement data. The prediction-based method is used to determine the embedding dimension of the traffic data. Based on the reconstructed phase space, the local support vector machine prediction method is used to predict the traffic measurement data, and the BIC-based neighbouring point selection method is used to choose the number of the nearest neighbouring points for the local support vector machine regression model. The experimental results show that the local support vector machine prediction method whose neighbouring points are optimized can effectively predict the small-time scale traffic measurement data and can reproduce the statistical features of real traffic measurements.
基金Project supported by the National Natural Science Foundation of China (Grant Nos. 10674172 and 10874229)
文摘In this paper a new continuous variable called core-ratio is defined to describe the probability for a residue to be in a binding site, thereby replacing the previous binary description of the interface residue using 0 and 1. So we can use the support vector machine regression method to fit the core-ratio value and predict the protein binding sites. We also design a new group of physical and chemical descriptors to characterize the binding sites. The new descriptors are more effective, with an averaging procedure used. Our test shows that much better prediction results can be obtained by the support vector regression (SVR) method than by the support vector classification method.
基金Hebei Province Key Research and Development Project(No.20313701D)Hebei Province Key Research and Development Project(No.19210404D)+13 种基金Mobile computing and universal equipment for the Beijing Key Laboratory Open Project,The National Social Science Fund of China(17AJL014)Beijing University of Posts and Telecommunications Construction of World-Class Disciplines and Characteristic Development Guidance Special Fund “Cultural Inheritance and Innovation”Project(No.505019221)National Natural Science Foundation of China(No.U1536112)National Natural Science Foundation of China(No.81673697)National Natural Science Foundation of China(61872046)The National Social Science Fund Key Project of China(No.17AJL014)“Blue Fire Project”(Huizhou)University of Technology Joint Innovation Project(CXZJHZ201729)Industry-University Cooperation Cooperative Education Project of the Ministry of Education(No.201902218004)Industry-University Cooperation Cooperative Education Project of the Ministry of Education(No.201902024006)Industry-University Cooperation Cooperative Education Project of the Ministry of Education(No.201901197007)Industry-University Cooperation Collaborative Education Project of the Ministry of Education(No.201901199005)The Ministry of Education Industry-University Cooperation Collaborative Education Project(No.201901197001)Shijiazhuang science and technology plan project(236240267A)Hebei Province key research and development plan project(20312701D)。
文摘The distribution of data has a significant impact on the results of classification.When the distribution of one class is insignificant compared to the distribution of another class,data imbalance occurs.This will result in rising outlier values and noise.Therefore,the speed and performance of classification could be greatly affected.Given the above problems,this paper starts with the motivation and mathematical representing of classification,puts forward a new classification method based on the relationship between different classification formulations.Combined with the vector characteristics of the actual problem and the choice of matrix characteristics,we firstly analyze the orderly regression to introduce slack variables to solve the constraint problem of the lone point.Then we introduce the fuzzy factors to solve the problem of the gap between the isolated points on the basis of the support vector machine.We introduce the cost control to solve the problem of sample skew.Finally,based on the bi-boundary support vector machine,a twostep weight setting twin classifier is constructed.This can help to identify multitasks with feature-selected patterns without the need for additional optimizers,which solves the problem of large-scale classification that can’t deal effectively with the very low category distribution gap.
基金supported by the Fundamental Research Funds for the National Major Science and Technology Projects of China (No. 2017ZX05009-005)。
文摘The application of carbon dioxide(CO_(2)) in enhanced oil recovery(EOR) has increased significantly, in which CO_(2) solubility in oil is a key parameter in predicting CO_(2) flooding performance. Hydrocarbons are the major constituents of oil, thus the focus of this work lies in investigating the solubility of CO_(2) in hydrocarbons. However, current experimental measurements are time-consuming, and equations of state can be computationally complex. To address these challenges, we developed an artificial intelligence-based model to predict the solubility of CO_(2) in hydrocarbons under varying conditions of temperature, pressure, molecular weight, and density. Using experimental data from previous studies,we trained and predicted the solubility using four machine learning models: support vector regression(SVR), extreme gradient boosting(XGBoost), random forest(RF), and multilayer perceptron(MLP).Among four models, the XGBoost model has the best predictive performance, with an R^(2) of 0.9838.Additionally, sensitivity analysis and evaluation of the relative impacts of each input parameter indicate that the prediction of CO_(2) solubility in hydrocarbons is most sensitive to pressure. Furthermore, our trained model was compared with existing models, demonstrating higher accuracy and applicability of our model. The developed machine learning-based model provides a more efficient and accurate approach for predicting CO_(2) solubility in hydrocarbons, which may contribute to the advancement of CO_(2)-related applications in the petroleum industry.
基金financially supported by the Fundamental Research Funds for the Central Non-profit Research Institution of CAF (CAFBB2017ZB004)。
文摘Background: The accurate estimation of soil nutrient content is particularly important in view of its impact on plant growth and forest regeneration. In order to investigate soil nutrient content and quality for the natural regeneration of Dacrydium pectinatum communities in China, designing advanced and accurate estimation methods is necessary.Methods: This study uses machine learning techniques created a series of comprehensive and novel models from which to evaluate soil nutrient content. Soil nutrient evaluation methods were built by using six support vector machines and four artificial neural networks.Results: The generalized regression neural network model was the best artificial neural network evaluation model with the smallest root mean square error(5.1), mean error(-0.85), and mean square prediction error(29). The accuracy rate of the combined k-nearest neighbors(k-NN) local support vector machines model(i.e. k-nearest neighbors-support vector machine(KNNSVM)) for soil nutrient evaluation was high, comparing to the other five partial support vector machines models investigated. The area under curve value of generalized regression neural network(0.6572) was the highest, and the cross-validation result showed that the generalized regression neural network reached 92.5%.Conclusions: Both the KNNSVM and generalized regression neural network models can be effectively used to evaluate soil nutrient content and quality grades in conjunction with appropriate model variables. Developing a new feasible evaluation method to assess soil nutrient quality for Dacrydium pectinatum, results from this study can be used as a reference for the adaptive management of rare and endangered tree species. This study, however, found some uncertainties in data acquisition and model simulations, which will be investigated in upcoming studies.
文摘目的采用4种机器学习算法分别构建结直肠癌患者术前营养不良的临床风险预测模型,探讨其预测价值。方法回顾性收集2023年1月—2024年5月在新疆医科大学附属肿瘤医院胃肠外科就诊的412例结直肠癌患者的术前资料;按7∶3的比例随机分为训练集(n=288)和验证集(n=124),采用单因素分析及二元logistic回归分析筛选出术前营养不良的预测因子;基于逻辑回归(LR)、支持向量机(SVM)、轻量级梯度提升(LightGBM)、多层感知机(MLP)4种机器学习算法分别构建结直肠癌患者术前营养不良风险预测模型,绘制ROC曲线评价4种算法模型预测效能,通过Delong检验比较4种模型的AUC差异。选择最优算法模型,采用校准曲线和临床决策曲线(DCA曲线)进行验证。结果(1)结直肠癌患者术前营养不良发生率为33.7%,年龄、Braden评分是其独立危险因素;(2)训练集中LightGBM算法模型预测结直肠癌患者术前发生营养不良的AUC高于LR、SVM、MLP算法模型(0.941 VS 0.874、0.830、0.831);(3)ROC曲线结果提示,LightGBM算法模型验证集中预测结直肠癌患者术前发生营养不良的AUC为0.926(95%CI:0.882~0.969);校准曲线显示,LightGBM算法模型预测结直肠癌患者术前发生营养不良的曲线与实际发生营养不良一致性良好;DCA曲线结果显示,LightGBM算法模型在阈值概率区间为0.16~0.79可以提供显著临床净收益。结论基于LightGBM算法构建的临床预测模型在预测结直肠癌患者术前发生营养不良中有较高价值,可以为临床人员实施营养管理提供参考。