摘要
对中国2013~2018年高分辨率大气污染分析开放数据集采用传统数据挖掘方法时,面临数据量大、挖掘效率低等难题,改用基于Spark K-means的聚类方法对大气污染物海量信息进行研究。以6种常见大气污染物和5种环境影响因子为例,建立了Pm_(2.5)、Pm_(10)、SO_(2)、NO_(2)、CO、O_(3)和Temp等数据维度模型。对K-means算法选择初始聚类数K值时,利用Gap Statistic算法相比传统K-means算法利用SSE算法确定K值,Gap Statistic算法在高维度样本数据模型中确定K值更合理且直观。
For the high-resolution air pollution reanalysis of air pollution in China in 2013 and 2018,using the traditional data mining method was faced on the problems of large data volume and low mining efficiency,hence,the clustering method based on K-means was used to study the massive information of air pollutants under Spark.Using six common atmospheric pollutants and five environmental impact factors as examples,the data-dimensional model of Pm_(2.5),Pm_(10),So_(2),No_(2),Co,O_(3),Temp et al.is presented.When selecting the initial cluster number K value of the K-means algorithm,the gap statistic algorithm achieves the value of the best cluster number K in the high-dimensional sample data model,which is more convincing than the traditional K-means to determine the K value using the SSE algorithm.It demonstrates that the K values determined using the Gap Statistic algorithm are more reasonable and intuitive than the SSE algorithm.
作者
黄乐成
陈超
韩存鑫
赵彬
HUANG Lecheng;CHEN Chao;HAN Cunxin;ZHAO Bin(School of Computer Science and Engineering,Sichuan University of Light Chemical Technology,Zigong 643000,Sichuan,China)
出处
《实验室研究与探索》
CAS
北大核心
2022年第9期135-139,共5页
Research and Exploration In Laboratory
作者简介
黄乐成(1999-),男,湖南衡阳人,硕士生,研究方向为数据挖掘和数据可视。Tel.:17780426997,E-mail:2534490581@qq.com。