摘要
序列模式挖掘是从序列数据中发现用户感兴趣的模式。对比模式挖掘是其中的一类挖掘方法,其特点是在两类或多类别的序列库中找到特征信息,在实际的生活和生产中应用十分广泛。随着数据规模的不断增加,算法的挖掘效率显得尤为重要,但是当前对比模式挖掘仍存在挖掘速度太慢的问题。为了快速挖掘满足密度约束和间隙约束的对比模式,文中提出了一种近似求解算法ADMD(Approximately Distinguishing Patterns Mining Based on Density Constraint),该算法在模式的挖掘过程中允许存在小部分的模式丢失,从而换取挖掘速度的大幅提升。该算法采用网树的特殊结构来计算模式的支持数;采用模式拼接的方式来生成候选模式;采用预判式剪枝策略对模式进行剪枝,以避免大量冗余模式的生成。但由于在剪枝过程中可能会剪掉一部分非冗余模式,造成挖掘结果并非完备,因此该算法是一种近似求解算法。在ADMD算法的基础上,通过在剪枝策略中设定参数k的方式来得到ADMD-k算法,该算法可以通过设定k的取值来调整剪枝程度,从而在挖掘效率和准确率方面取得平衡。最后在真实的蛋白质数据集上将所提算法与其他算法从挖掘的对比模式数量和挖掘速度方面进行对比实验。实验结果表明,在k=1.5的情况下,所提算法仅用不到原来13%的时间,就可以挖掘到99%以上的模式,具有近似度高、速度快的特点。
Sequential patterns mining is to find interest patterns from sequential data.Distinguishing patterns mining is one of the mining methods,which is characterized by finding feature information in two or more categories of sequence databases.It is widely used in real life and production.With the increasing size of data,the efficiency of algorithm mi-ning is particularly important.However,the mining speed of distinguishing patterns mining is too slow at present.In order to quickly mine the distinguishing patterns that satisfy density constraint and gap constraint,this paper proposed an approximate solution algorithm ADMD(Approximately Distinguishing Patterns Mining Based on Density Constraint).This algorithm allows a small number of patterns to be lost in the process of patterns mining in exchange for a large increase in mining speed.In this algorithm,the support of the pattern is calculated by the special structure of the Net tree,the candidate patterns are generated by patterns growth approach,and the patterns are pruned by the prejudgment pruning strategy to avoid the generation of a large number of redundant patterns.However,some non-redundant patterns may be pruned in the pruning process,resulting in incomplete mining results,so the algorithm is an approximate algorithm.Based on ADMD,the ADMD-k algorithm was proposed by setting the parameter k in the pruning strategy.The algorithm can adjust the pruning degree by setting k,to achieve a balance between mining efficiency and accuracy.Finally,in real protein datasets,the number of mining patterns and mining speed are compared with other algorithms.The experimental results verify that when k is 1.5,the proposed algorithm costs no more than 13%of the time,but can find up more than 99%of patterns.Therefore,the proposed algorithm is very effective with high approximation rate and high speed.
作者
柴欣
高一寒
武优西
刘靖宇
CHAI Xin;GAO Yi-han;WU You-xi;LIU Jing-yu(School of Artificial Intelligence,Hebei University of Technology,Tianjin 300401,China;Hebei Province Key Laboratory of Big Data Calculation,Tianjin 300401,China)
出处
《计算机科学》
CSCD
北大核心
2019年第12期26-30,共5页
Computer Science
基金
国家自然科学基金项目(61702157,61571180)资助
关键词
对比模式
挖掘速度
网树
密度约束
剪枝策略
Distinguishing patterns
Mining speed
Net tree
Density constraint
Pruning strategy