摘要
为了解决分布式环境下挖掘全局序列模式常产生过多候选序列,加大网络通信代价问题,提出了一种基于分布式环境下的快速挖掘全局序列模式算法——DMGSP.该算法将分布式环境下的各站点得到的局部序列模式压缩到一种语法序列树上,避免了重复的序列前缀传输.采用合并树中结点序列规则和项序扩展策略,对非频繁序列进行剪枝,有效地约简了候选序列,减少了网络传输量,从而快速生成全局序列模式.算法分析和实验结果表明,在大数据集环境下的DMGSP算法性能优越,能够有效地挖掘全局序列模式.
The current distributed sequential pattern mining algorithms usually generate too many candidate sequences and therefore increase communication overhead. To solve this problem, an efficient algorithm-DMGSP ( distributed mining of global sequential pattern) of mining global sequential pattern on distributed system is proposed. DMGSP algorithm compresses local frequent sequential patterns into a lexicographic sequence tree, and avoids translation of repeated prefixes. By using the sequences regular of merged trees and efficient item and sequence extension pruning, non-frequent subsequence is pruned and candidate sequences can be reduced effectively. Therefore, communication overhead is reduced and global sequential patterns is generated effectively. The theory and experiments show that the performance of DMGSP is superior, which is advantageous for mining global sequential patterns with huge amount of data.
出处
《东南大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2007年第4期574-579,共6页
Journal of Southeast University:Natural Science Edition
基金
国家自然科学基金资助项目(70472033)
江苏省"青蓝工程"基金资助项目
关键词
数据挖掘
分布式系统
全局序列模式
语法序列树
data mining
distributed system
global sequential pattern
lexicographic sequence tree