传统的主曲线算法在小规模数据集上能获得良好的效果,但单节点的计算和存储能力都不能满足海量数据主曲线的提取要求,而算法分布式并行化是目前解决该类问题最有效的途径之一。本文提出基于MapReduce框架的分布式软K段主曲线算法(Distri...传统的主曲线算法在小规模数据集上能获得良好的效果,但单节点的计算和存储能力都不能满足海量数据主曲线的提取要求,而算法分布式并行化是目前解决该类问题最有效的途径之一。本文提出基于MapReduce框架的分布式软K段主曲线算法(Distributed soft k-segments principal curve,DisSKPC)。首先,基于分布式K-Means算法,采用递归粒化方法对数据集进行粒化,以确定粒的大小并保证粒中数据的关联性。然后调用软K段主曲线算法计算每个粒数据的局部主成分线段,并提出用噪声方差来消除在高密集、高曲率的数据区域可能产生的过拟合线段。最后借助哈密顿路径和贪婪算法连接这些局部主成分线段,形成一条通过数据云中间的最佳曲线。实验结果表明,本文所提出的DisSKPC算法具有良好的可行性和扩展性。展开更多
The sharp increase of the amount of Internet Chinese text data has significantly prolonged the processing time of classification on these data.In order to solve this problem,this paper proposes and implements a parall...The sharp increase of the amount of Internet Chinese text data has significantly prolonged the processing time of classification on these data.In order to solve this problem,this paper proposes and implements a parallel naive Bayes algorithm(PNBA)for Chinese text classification based on Spark,a parallel memory computing platform for big data.This algorithm has implemented parallel operation throughout the entire training and prediction process of naive Bayes classifier mainly by adopting the programming model of resilient distributed datasets(RDD).For comparison,a PNBA based on Hadoop is also implemented.The test results show that in the same computing environment and for the same text sets,the Spark PNBA is obviously superior to the Hadoop PNBA in terms of key indicators such as speedup ratio and scalability.Therefore,Spark-based parallel algorithms can better meet the requirement of large-scale Chinese text data mining.展开更多
文摘传统的主曲线算法在小规模数据集上能获得良好的效果,但单节点的计算和存储能力都不能满足海量数据主曲线的提取要求,而算法分布式并行化是目前解决该类问题最有效的途径之一。本文提出基于MapReduce框架的分布式软K段主曲线算法(Distributed soft k-segments principal curve,DisSKPC)。首先,基于分布式K-Means算法,采用递归粒化方法对数据集进行粒化,以确定粒的大小并保证粒中数据的关联性。然后调用软K段主曲线算法计算每个粒数据的局部主成分线段,并提出用噪声方差来消除在高密集、高曲率的数据区域可能产生的过拟合线段。最后借助哈密顿路径和贪婪算法连接这些局部主成分线段,形成一条通过数据云中间的最佳曲线。实验结果表明,本文所提出的DisSKPC算法具有良好的可行性和扩展性。
基金Project(KC18071)supported by the Application Foundation Research Program of Xuzhou,ChinaProjects(2017YFC0804401,2017YFC0804409)supported by the National Key R&D Program of China
文摘The sharp increase of the amount of Internet Chinese text data has significantly prolonged the processing time of classification on these data.In order to solve this problem,this paper proposes and implements a parallel naive Bayes algorithm(PNBA)for Chinese text classification based on Spark,a parallel memory computing platform for big data.This algorithm has implemented parallel operation throughout the entire training and prediction process of naive Bayes classifier mainly by adopting the programming model of resilient distributed datasets(RDD).For comparison,a PNBA based on Hadoop is also implemented.The test results show that in the same computing environment and for the same text sets,the Spark PNBA is obviously superior to the Hadoop PNBA in terms of key indicators such as speedup ratio and scalability.Therefore,Spark-based parallel algorithms can better meet the requirement of large-scale Chinese text data mining.