Because most ensemble learning algorithms use the centralized model, and the training instances must be centralized on a single station, it is difficult to centralize the training data on a station. A distributed ense...Because most ensemble learning algorithms use the centralized model, and the training instances must be centralized on a single station, it is difficult to centralize the training data on a station. A distributed ensemble learning algorithm is proposed which has two kinds of weight genes of instances that denote the global distribution and the local distribution. Instead of the repeated sampling method in the standard ensemble learning, non-balance sampling from each station is used to train the base classifier set of each station. The concept of the effective nearby region for local integration classifier is proposed, and is used for the dynamic integration method of multiple classifiers in distributed environment. The experiments show that the ensemble learning algorithm in distributed environment proposed could reduce the time of training the base classifiers effectively, and ensure the classify performance is as same as the centralized learning method.展开更多
针对利用海量数据构建分类模型时训练数据规模大、训练时间长且碳排放量大的问题,提出面向低能耗高性能的分类器两阶段数据选择方法TSDS(Two-Stage Data Selection)。首先,通过修正余弦相似度确定聚类中心,并将样本数据进行基于不相似...针对利用海量数据构建分类模型时训练数据规模大、训练时间长且碳排放量大的问题,提出面向低能耗高性能的分类器两阶段数据选择方法TSDS(Two-Stage Data Selection)。首先,通过修正余弦相似度确定聚类中心,并将样本数据进行基于不相似点的分裂层次聚类;其次,对聚类结果按数据分布自适应抽样以组成高质量的子样本集;最后,利用子样本集在分类模型上训练,在加速训练过程的同时提升模型精度。在Spambase、Bupa和Phoneme等6个数据集上构建支持向量机(SVM)和多层感知机(MLP)分类模型,验证TSDS的性能。实验结果表明在样本数据压缩比达到85.00%的情况下,TSDS能将分类模型准确率提升3~10个百分点,同时加速模型训练,使训练SVM分类器的能耗平均降低93.76%,训练MLP分类器的能耗平均降低75.41%。可见,TSDS在大数据场景的分类任务上既能缩短训练时间和减少能耗,又能提升分类器性能,从而助力实现“双碳”目标。展开更多
基金the Natural Science Foundation of Shaan’xi Province (2005F51).
文摘Because most ensemble learning algorithms use the centralized model, and the training instances must be centralized on a single station, it is difficult to centralize the training data on a station. A distributed ensemble learning algorithm is proposed which has two kinds of weight genes of instances that denote the global distribution and the local distribution. Instead of the repeated sampling method in the standard ensemble learning, non-balance sampling from each station is used to train the base classifier set of each station. The concept of the effective nearby region for local integration classifier is proposed, and is used for the dynamic integration method of multiple classifiers in distributed environment. The experiments show that the ensemble learning algorithm in distributed environment proposed could reduce the time of training the base classifiers effectively, and ensure the classify performance is as same as the centralized learning method.
文摘针对利用海量数据构建分类模型时训练数据规模大、训练时间长且碳排放量大的问题,提出面向低能耗高性能的分类器两阶段数据选择方法TSDS(Two-Stage Data Selection)。首先,通过修正余弦相似度确定聚类中心,并将样本数据进行基于不相似点的分裂层次聚类;其次,对聚类结果按数据分布自适应抽样以组成高质量的子样本集;最后,利用子样本集在分类模型上训练,在加速训练过程的同时提升模型精度。在Spambase、Bupa和Phoneme等6个数据集上构建支持向量机(SVM)和多层感知机(MLP)分类模型,验证TSDS的性能。实验结果表明在样本数据压缩比达到85.00%的情况下,TSDS能将分类模型准确率提升3~10个百分点,同时加速模型训练,使训练SVM分类器的能耗平均降低93.76%,训练MLP分类器的能耗平均降低75.41%。可见,TSDS在大数据场景的分类任务上既能缩短训练时间和减少能耗,又能提升分类器性能,从而助力实现“双碳”目标。