摘要
作为模式识别最基本的分类方法之一,聚类在各个科学领域的数据分析中都扮演着重要的角色。然而随着大数据的出现,聚类分析在前沿发展中不断地面临着计算复杂度和计算成本等新的问题和挑战。通过研究k-means聚类算法的时间复杂度O(nk),针对迭代过程中大量的最近邻计算和其特殊场景,引入KD树作为索引,提出了基于单KD树的近似近邻算法和基于多KD树的交叉搜索算法。将k-means聚类算法的时间复杂度降为O(nlog k),并通过实验验证,基于多树的交叉搜索算法具有与k-means聚类算法相当的聚类质量。
As one of the most basic classification methods for pattern recognition,clustering plays an important role in data analysis in various scientific fields.However,with the emergence of big data,clustering analysis continues to face new problems and challenges in frontier development such as computing complexity and computational cost.By studying the time complexity O(nk)of the k-means clustering algorithm,we introduce the KD-tree as an index for the large number of nearest neighbor calculations,which scenario is special,in the iterative process,and propose approximate nearest neighbor search algorithms based on a single KD-tree or multiple KD-trees.The algorithms reduce the time complexity of the k-means clustering algorithm to O(nlog k).It is verified by experiments that the algorithm based on multiple KD-trees has the comparable clustering quality with the k-means clustering algorithm.
作者
薛丁文
李建中
XUE Dingwen;LI Jianzhong(Department of Computer Science and Technology,Harbin Institute of Technology,Harbin 150001,China)
出处
《智能计算机与应用》
2021年第11期194-197,共4页
Intelligent Computer and Applications
关键词
聚类分析
K-MEANS聚类
KD树
近似近邻
clustering analysis
k-means clustering
KD-tree
approximate nearest neighbor
作者简介
薛丁文(1995-),男,博士研究生,主要研究方向:海量数据聚类分析;李建中(1950-),男,教授,博士生导师,主要研究方向:海量数据计算、无线传感网络。