期刊文献+

无重叠条件严格模式匹配的高效求解算法 被引量:6

Efficient Algorithm for Solving Strict Pattern Matching Under Nonoverlapping Condition
在线阅读 下载PDF
导出
摘要 无重叠条件序列模式挖掘是一种间隙约束序列模式挖掘方法,与同类挖掘方法相比,该方法更容易发现有价值的频繁模式,其核心问题是计算给定模式在序列中的支持度或出现数,进而判定该模式的频繁性.而计算模式支持度问题实质是无重叠条件模式匹配.当前研究采用迭代搜索无重叠出现,然后剪枝无用结点的方式计算模式的支持度,其计算时间复杂度为O(m×m×n×W),其中,m,n和W分别为模式长度、序列长度及最大间隙.为了进一步提高无重叠条件模式匹配计算速度,从而有效地降低无重叠条件序列模式挖掘时间,提出了一种高效的算法,该算法将模式匹配问题转换为一棵网树,然后从网树的最小树根结点出发,采用回溯策略迭代搜索最左孩子方式计算无重叠最小出现,在网树上剪枝该出现后,无需进一步查找并剪枝无效结点即可实现问题的求解.理论证明了该算法的完备性,并将该算法的时间复杂度降低为O(m×n×W).在此基础上,继续指明该问题还存在另外3种相似的求解策略,分别是从最左叶子出发迭代查找最左双亲方式、从最右树根出发迭代查找最右孩子方式和从最右叶子出发迭代查找最右双亲方式.实验结果验证了该算法的性能,特别是在序列模式挖掘中,应用该方法的挖掘算法可以降低挖掘时间. Nonoverlapping conditional sequence pattern mining is a method of gap constrained sequence pattern mining.Compared with similar mining methods,this method is easier to find valuable frequent patterns.The core of the problem is to calculate the support(or the number of occurrences)of a pattern in the sequence,and then determine whether the pattern is frequent.The essence of calculating the support is the pattern matching under nonoverlapping condition.The current studies employ the iterative search to find a nonoverlapping occurrence,and then prune the useless nodes to calculate the support of the pattern.The computational time complexity of these algorithms is O(m×m×n×W),where m,n,and W are the pattern length,sequence length,and maximum gap,respectively.In order to improve the calculation speed of pattern matching under nonoverlapping condition,and effectively reduce sequence pattern mining time,this study proposes an efficient and effective algorithm,which converts the pattern matching problem into a NetTree,then starts from the minroot node of the NetTree,and adopts the backtracking strategy to iteratively search the leftmost child to calculate the nonoverlapping minimum occurrence.After pruning the occurrence on the NetTree,the problem can be solved without further searching and pruning invalid nodes.This study proves the completeness of the algorithm and reduces the time complexity to O(m×n×W).On this basis,the study continues to indicate that there are other three similar solving strategies for this problem,iteratively finds the leftmost parent path from the leftmost leaf,the rightmost child path from the rightmost root,and the rightmost parent path from the rightmost leaf.Extensively experimental results verify the efficiency of the proposed algorithm in this study,especially,the mining algorithm adopting this method can reduce the mining time.
作者 武优西 刘茜 闫文杰 郭磊 吴信东 WU You-Xi;LIU Xi;YAN Wen-Jie;GUO Lei;WU Xin-Dong(School of Artificial Intelligence,Hebei University of Technology,Tianjin 300401,China;Hebei Key Laboratory of Big Data Computing,Tianjin 300401,China;State Key Laboratory of Reliability and Intelligence of Electrical Equipment,Hebei University of Technology,Tianjin 300401,China;School of Electrical Engineering,Hebei University of Technology,Tianjin 300401,China;Key Laboratory of Knowledge Engineering with Big Data(Hefei University of Technology),Ministry of Education,Hefei 230009,China;Mininglamp Academy of Sciences,Mininglamp Technology,Beijing 100084,China)
出处 《软件学报》 EI CSCD 北大核心 2021年第11期3331-3350,共20页 Journal of Software
基金 国家重点研发计划(2016YFB1000901) 国家自然科学基金(61976240,61702157,917446209) 河北省创新能力培养资助项目(CXZZSS2019023)。
关键词 模式匹配 序列模式挖掘 无重叠条件 网树 回溯策略 pattern matching sequence pattern mining nonoverlapping condition NetTree backtracking strategy
作者简介 通讯作者:武优西(1974-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为数据挖掘,机器学习,E-mail:wuc567@163.com;刘茜(1994-),女,硕士,主要研究领域为模式匹配;闫文杰(1983-),男,博士,副教授,CCF专业会员,主要研究领域为机器学习;郭磊(1968-),男,博士,教授,博士生导师,主要研究领域为模式识别,人工神经网络;吴信东(1963-),男,博士,教授,博士生导师,主要研究领域为数据挖掘,基于知识的系统,万维网信息探索.
  • 相关文献

参考文献8

二级参考文献74

  • 1刘殷雷,刘玉葆,陈程.不确定性数据流上频繁项集挖掘的有效算法[J].计算机研究与发展,2011,48(S3):1-7. 被引量:14
  • 2Agrawal R, Srikant R. Mining sequential patterns. In: Proc. of the 11th Int’l Conf. on Data Engineering. Washington: IEEE Computer Society Press, 1995. 3-14. [doi: 10.1109/ICDE.1995.380415].
  • 3Zaki MJ. SPADE: An efficient algorithm for mining frequent sequences. Machine Learning, 2001,42(l-2):31-60. [doi: 10.1023/A: 1007652502315].
  • 4Ji X, Bailey J, Dong G. Mining minimal distinguishing subsequence patterns with gap constraints. Knowledge & Information Systems, 2007,11(3):259-286. [doi: 10.1007/sl0115-006-0038-2].
  • 5Yan X, Han J, Afshar R. CloSpan: Mining closed sequential patterns in large datasets. In: Proc. of the 3rd SIAM Int’l Conf. on Data Mining. SIAM, 2003. 166-177. [doi: 10.1137/1.9781611972733.15].
  • 6Pei J, Wang H, Liu J, Wang K, Wang J, Yu PS. Discovering frequent closed partial orders from strings. IEEE Trans, on Knowledge & Data Engineering, 2006,18(11): 1467-1481. [doi: 10.1109/TKDE.2006.172].
  • 7Zhang M, Kao B, Cheung DW, Yip KY. Mining periodic patterns with gap requirement from sequences. ACM Trans, on Knowledge Discovery from Data, 2007,l(2):Article 7. [doi: 10.1145/1267066.1267068].
  • 8Yang H, Duan L, Dong G, Nummenmaa J, Tang C, Li X. Mining itemset-based distinguishing sequential patterns with gap constraint. In: Proc. of the 21st Int’l Conf. of Database Systems for Advanced Applications. Switzerland: Springer-Verlag, 2015. 39-54. [doi: 10.1007/978-3-319-18120-2_3].
  • 9Ferreira PG, Azevedo PJ. Protein sequence pattern mining with constraints. In: Proc. of the 9th European Conf. on Principles and Practice of Knowledge Discovery in Databases. Berlin, Heidelberg: Springer-Verlag, 2005. 96-107. [doi: 10.1007/11564126 14].
  • 10She R, Chen F, Wang K, Ester M, Gardy JL, Brinkman FSL. Frequent-Subsequence-Based prediction of outer membrane proteins. In: Proc. of the 9th ACM Knowledge Discovery and Data Mining. New York: ACM Press, 2003. 436-445. [doi: 10.1145/956750. 956800].

共引文献50

同被引文献38

引证文献6

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部