The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is conside...The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is considered a vital process. The data analysis process consists of different tasks, among which the data stream classification approaches face more challenges than the other commonly used techniques. Even though the classification is a continuous process, it requires a design that can adapt the classification model so as to adjust the concept change or the boundary change between the classes. Hence, we design a novel fuzzy classifier known as THRFuzzy to classify new incoming data streams. Rough set theory along with tangential holoentropy function helps in the designing the dynamic classification model. The classification approach uses kernel fuzzy c-means(FCM) clustering for the generation of the rules and tangential holoentropy function to update the membership function. The performance of the proposed THRFuzzy method is verified using three datasets, namely skin segmentation, localization, and breast cancer datasets, and the evaluated metrics, accuracy and time, comparing its performance with HRFuzzy and adaptive k-NN classifiers. The experimental results conclude that THRFuzzy classifier shows better classification results providing a maximum accuracy consuming a minimal time than the existing classifiers.展开更多
随着数据体量的剧增,机器学习方法已逐渐由传统的静态学习模式转向面向流式数据的在线学习模式。任意数据流是指数据实例随着时间以流的方式逐个到达的同时,其特征空间可能会发生任意变化,即旧的特征可能随时消失,新的特征也可能随时出...随着数据体量的剧增,机器学习方法已逐渐由传统的静态学习模式转向面向流式数据的在线学习模式。任意数据流是指数据实例随着时间以流的方式逐个到达的同时,其特征空间可能会发生任意变化,即旧的特征可能随时消失,新的特征也可能随时出现。例如,在环境检测领域,新增传感器或旧传感器突然异常会使得数据流的特征空间发生任意变化。此外,现有面向数据流的在线学习方法大多假设可以获取所有数据实例的真实标签。然而,在真实应用中,由于人工标注数据的代价高昂,数据标签大多是稀疏的。为了解决标签稀疏场景下任意数据流的在线学习问题,提出一种基于被动-主动学习的在线学习算法PAACDS(Passive Aggressive Active Learning for Capricious Data Streams)以及它的变体PAACDS-I。首先,利用在线主动学习方法选择有价值的数据实例,使得可以在最小的监督下建立优越的预测模型。随后,在获得所选择数据实例的查询标签后,结合在线被动-主动更新规则和边界最大化原则来更新基于任意数据流中共享和新增特征空间的动态分类器。最后,将所提算法与现有的最先进方法在12个数据集上进行了比较,大量的实验对比和分析验证了所提算法在任意数据流标签稀疏场景下的有效性。展开更多
传统数据流聚类方法缺乏对高维数据的在线降维能力,导致其聚类性能受限。为解决此问题,提出了一种基于可扩展子空间学习的数据流聚类方法(Scalable Subspace Learning for Clustering Data Streams,S2LCStream)。首先,通过可扩展子空间...传统数据流聚类方法缺乏对高维数据的在线降维能力,导致其聚类性能受限。为解决此问题,提出了一种基于可扩展子空间学习的数据流聚类方法(Scalable Subspace Learning for Clustering Data Streams,S2LCStream)。首先,通过可扩展子空间学习建立历史数据与新增数据之间的投影关系,将新增数据投影至历史数据张成的子空间中,以实时获取其聚类划分。其次,为保持不同时刻聚类划分的准确性,对持续到达的数据流进行数据分布的一致性检测,捕获其中存在的概念漂移,并结合回溯机制对聚类划分进行调整以适应动态变化的数据分布。最后,通过在多个真实数据集上进行测试,验证了所提方法在处理高维数据流的效能。所提方法在保持较高聚类性能的同时,能够高效处理数据流中的概念漂移。展开更多
基金supported by proposal No.OSD/BCUD/392/197 Board of Colleges and University Development,Savitribai Phule Pune University,Pune
文摘The rapid developments in the fields of telecommunication, sensor data, financial applications, analyzing of data streams, and so on, increase the rate of data arrival, among which the data mining technique is considered a vital process. The data analysis process consists of different tasks, among which the data stream classification approaches face more challenges than the other commonly used techniques. Even though the classification is a continuous process, it requires a design that can adapt the classification model so as to adjust the concept change or the boundary change between the classes. Hence, we design a novel fuzzy classifier known as THRFuzzy to classify new incoming data streams. Rough set theory along with tangential holoentropy function helps in the designing the dynamic classification model. The classification approach uses kernel fuzzy c-means(FCM) clustering for the generation of the rules and tangential holoentropy function to update the membership function. The performance of the proposed THRFuzzy method is verified using three datasets, namely skin segmentation, localization, and breast cancer datasets, and the evaluated metrics, accuracy and time, comparing its performance with HRFuzzy and adaptive k-NN classifiers. The experimental results conclude that THRFuzzy classifier shows better classification results providing a maximum accuracy consuming a minimal time than the existing classifiers.
文摘随着数据体量的剧增,机器学习方法已逐渐由传统的静态学习模式转向面向流式数据的在线学习模式。任意数据流是指数据实例随着时间以流的方式逐个到达的同时,其特征空间可能会发生任意变化,即旧的特征可能随时消失,新的特征也可能随时出现。例如,在环境检测领域,新增传感器或旧传感器突然异常会使得数据流的特征空间发生任意变化。此外,现有面向数据流的在线学习方法大多假设可以获取所有数据实例的真实标签。然而,在真实应用中,由于人工标注数据的代价高昂,数据标签大多是稀疏的。为了解决标签稀疏场景下任意数据流的在线学习问题,提出一种基于被动-主动学习的在线学习算法PAACDS(Passive Aggressive Active Learning for Capricious Data Streams)以及它的变体PAACDS-I。首先,利用在线主动学习方法选择有价值的数据实例,使得可以在最小的监督下建立优越的预测模型。随后,在获得所选择数据实例的查询标签后,结合在线被动-主动更新规则和边界最大化原则来更新基于任意数据流中共享和新增特征空间的动态分类器。最后,将所提算法与现有的最先进方法在12个数据集上进行了比较,大量的实验对比和分析验证了所提算法在任意数据流标签稀疏场景下的有效性。
文摘传统数据流聚类方法缺乏对高维数据的在线降维能力,导致其聚类性能受限。为解决此问题,提出了一种基于可扩展子空间学习的数据流聚类方法(Scalable Subspace Learning for Clustering Data Streams,S2LCStream)。首先,通过可扩展子空间学习建立历史数据与新增数据之间的投影关系,将新增数据投影至历史数据张成的子空间中,以实时获取其聚类划分。其次,为保持不同时刻聚类划分的准确性,对持续到达的数据流进行数据分布的一致性检测,捕获其中存在的概念漂移,并结合回溯机制对聚类划分进行调整以适应动态变化的数据分布。最后,通过在多个真实数据集上进行测试,验证了所提方法在处理高维数据流的效能。所提方法在保持较高聚类性能的同时,能够高效处理数据流中的概念漂移。