基于Apache Beam的并行化空间数据分析方法

A Parallel Spatial Data Analysis Method Based on Apache Beam

导出

摘要近年来,分布式数据处理框架已经成为高效处理海量空间数据的常用解决方案,但由于不同框架的数据组织结构和计算模式的差异,导致了算法复用性和算法迁移效率低下。目前已有一些平台无关的分布式计算框架,如Apache Beam等,但其不支持空间数据和空间操作。本文提出了一种基于Apache Beam的时空大数据并行空间分析方法,通过对Beam模型的空间扩展,将空间数据的所有操作抽象为对空间并行集合的空间并行转换,以屏蔽底层分布式操作的细节。该扩展的框架能够运行在Spark、Flink等分布式计算引擎上,支持大规模空间数据的高效并行化处理,为大规模空间数据的快速处理提供了一种有效的分布式计算解决方案。 In recent years,distributed data processing frameworks have become a common solution for efficiently processing massive spatial data.Due to the differences in data organization structures and computing modes of different frameworks,algorithm reusability and algorithm migration efficiency are often very low.At present,some platform-independent distributed computing frameworks are available to users,such as Apache Beam,etc.However,these frameworks do not support spatial data and operations.Users need to execute and write their own spatial data processing algorithms on Apache Beam,which is a tedious task.This paper proposes a parallel spatial analysis method for spatiotemporal big data based on Apache Beam.It extends Beam with spatial abstractions by encapsulating spatial data and its operations into Beam abstractions to shield the underlying technical details.The extended framework allows users to execute large-scale spatial data analysis on distributed computing platforms such as Spark and Flink without having to know underlying technical details of distribute computing.This paper provides an effective distributed computing solution for the rapid processing of large-scale spatial data.

作者王翰诚姜良存李皓梁哲恒乐鹏 WANG Hancheng;JIANG Liangcun;LI Hao;LIANG Zheheng;YUE Peng(School of Remote Sensing and Information Engineering,Wuhan University,Wuhan 430079,China;South Digital Technology Co.,Ltd.,Guangzhou 510665,China)

机构地区武汉大学遥感信息工程学院广东南方数码科技股份有限公司

出处《测绘地理信息》 CSCD 2022年第S01期85-88,共4页 Journal of Geomatics

基金国家自然科学基金(42071354,41901315)

关键词空间数据分布式计算 BEAM SPARK Flink spatial data distributed computing Beam Spark Flink

分类号 P208 [天文地球—地图制图学与地理信息工程]

作者简介第一作者:王翰诚,硕士生,主要研究方向为时空大数据处理与应用。E-mail:936075790@qq.com;通信作者:姜良存,主要研究方向为时空大数据处理。E-mail:jiangliangcun@whu.edu.cn