Many classical clustering algorithms do good jobs on their prerequisite but do not scale well when being applied to deal with very large data sets(VLDS).In this work,a novel division and partition clustering method(DP...Many classical clustering algorithms do good jobs on their prerequisite but do not scale well when being applied to deal with very large data sets(VLDS).In this work,a novel division and partition clustering method(DP) was proposed to solve the problem.DP cut the source data set into data blocks,and extracted the eigenvector for each data block to form the local feature set.The local feature set was used in the second round of the characteristics polymerization process for the source data to find the global eigenvector.Ultimately according to the global eigenvector,the data set was assigned by criterion of minimum distance.The experimental results show that it is more robust than the conventional clusterings.Characteristics of not sensitive to data dimensions,distribution and number of nature clustering make it have a wide range of applications in clustering VLDS.展开更多
提出了在输入-输出积空间中利用监督模糊聚类技术快速建立粗糙数据模型(rough data model,简称RDM)的一种方法.该方法将RDM模型的分类质量性能指标与具有良好特性的Gustafson-Kessel(G-K)聚类算法结合在一起,并通过引入数据对模糊类的...提出了在输入-输出积空间中利用监督模糊聚类技术快速建立粗糙数据模型(rough data model,简称RDM)的一种方法.该方法将RDM模型的分类质量性能指标与具有良好特性的Gustafson-Kessel(G-K)聚类算法结合在一起,并通过引入数据对模糊类的推定隶属度的概念,给出了将模糊聚类模型转化为粗糙数据模型的方法,从而设计出一种通过迭代计算使目标函数最小的两个必要条件方程来获取RDM模型的有效算法,将Kowalczyk方法的多维搜索过程变为以聚类数目为参数的一维搜索,极大地减少了寻优时间.与传统的粗糙集理论和Kowalczyk方法相比,提出的方法具有更好的数据概括能力和噪声数据处理能力.最后,通过不同的数据集实验测试,结果表明了该方法的有效性.展开更多
首先证明了快速核密度估计(Fast kernel density estimate,FKDE)定理:基于抽样子集的高斯核密度估计(KDE)与原数据集的KDE间的误差与抽样容量和核参数相关,而与总样本容量无关.接着本文揭示了基于高斯核形式的图论松弛聚类(Graph-based ...首先证明了快速核密度估计(Fast kernel density estimate,FKDE)定理:基于抽样子集的高斯核密度估计(KDE)与原数据集的KDE间的误差与抽样容量和核参数相关,而与总样本容量无关.接着本文揭示了基于高斯核形式的图论松弛聚类(Graph-based relaxed clustering,GRC)算法的目标表达式可分解成"Parzen窗加权和+平方熵"的形式,即此时GRC可视作一个核密度估计问题,这样基于KDE近似策略,本文提出了大规模图论松弛聚类方法(Scaling up GRC by KDEapproximation,SUGRC-KDEA).较之先前的工作,这一方法的优势在于为GRC作用于大规模数据集提供了更简单和易于实现的方案.展开更多
基金Supported by National Natural Science Foundation of China(60675039)National High Technology Research and Development Program of China(863 Program)(2006AA04Z217)Hundred Talents Program of Chinese Academy of Sciences
基金Projects(60903082,60975042)supported by the National Natural Science Foundation of ChinaProject(20070217043)supported by the Research Fund for the Doctoral Program of Higher Education of China
文摘Many classical clustering algorithms do good jobs on their prerequisite but do not scale well when being applied to deal with very large data sets(VLDS).In this work,a novel division and partition clustering method(DP) was proposed to solve the problem.DP cut the source data set into data blocks,and extracted the eigenvector for each data block to form the local feature set.The local feature set was used in the second round of the characteristics polymerization process for the source data to find the global eigenvector.Ultimately according to the global eigenvector,the data set was assigned by criterion of minimum distance.The experimental results show that it is more robust than the conventional clusterings.Characteristics of not sensitive to data dimensions,distribution and number of nature clustering make it have a wide range of applications in clustering VLDS.
文摘提出了在输入-输出积空间中利用监督模糊聚类技术快速建立粗糙数据模型(rough data model,简称RDM)的一种方法.该方法将RDM模型的分类质量性能指标与具有良好特性的Gustafson-Kessel(G-K)聚类算法结合在一起,并通过引入数据对模糊类的推定隶属度的概念,给出了将模糊聚类模型转化为粗糙数据模型的方法,从而设计出一种通过迭代计算使目标函数最小的两个必要条件方程来获取RDM模型的有效算法,将Kowalczyk方法的多维搜索过程变为以聚类数目为参数的一维搜索,极大地减少了寻优时间.与传统的粗糙集理论和Kowalczyk方法相比,提出的方法具有更好的数据概括能力和噪声数据处理能力.最后,通过不同的数据集实验测试,结果表明了该方法的有效性.
文摘首先证明了快速核密度估计(Fast kernel density estimate,FKDE)定理:基于抽样子集的高斯核密度估计(KDE)与原数据集的KDE间的误差与抽样容量和核参数相关,而与总样本容量无关.接着本文揭示了基于高斯核形式的图论松弛聚类(Graph-based relaxed clustering,GRC)算法的目标表达式可分解成"Parzen窗加权和+平方熵"的形式,即此时GRC可视作一个核密度估计问题,这样基于KDE近似策略,本文提出了大规模图论松弛聚类方法(Scaling up GRC by KDEapproximation,SUGRC-KDEA).较之先前的工作,这一方法的优势在于为GRC作用于大规模数据集提供了更简单和易于实现的方案.