摘要
应用于数据流环境的数据挖掘算法应首要考虑算法的时空复杂性,而要实现消耗巨大计算资源的容错模式挖掘则更要专注于算法的效率.容错模式挖掘是为了从被噪声干扰的真实世界数据中获取允许一定程度错配的、更加泛化的有用知识.提出一种新的单遍历、高压缩的容错前缀树形概要结构DSFT-tree(Data Stream Fault-Tolerant Frequent PatternTree),用来捕捉最近到达的数据流中的数据元素,并且能够高效移除过期数据,实现最大限度地降低计算资源消耗.利用滑动窗指针和位向量表达法实现容错树形概要结构的高效重构,并进一步基于滑动窗口技术实现了数据流环境下的容错频繁项挖掘.实验采用IBM数据发生器产生事务数据,在合理时间内最终挖掘频繁项的数量为FP-stream算法的1.5倍.
Complexity of data mining algorithm over data stream is the most important and it should be more focused on algorithm efficiency because of the great consumption of algorithm resources.Fault-tolerant frequent pattern mining is more generalized and suitable for extracting interesting knowledge from real-world data stream polluted by noise.An algorithm,called data stream fault-tolerant frequent pattern tree(DSFT-tree),was proposed.It could achieve a frequency-descending and highly compact prefix-tree structure with a single-pass to capture fault-tolerant frequent itemsets in recent sliding window.To completely and efficiently perform the tree restructuring operation,an efficient mechanism based on sliding window pointer and bit-vector representation were utilized to restructure the tree.The efficient reconstruction mechanism greatly reduced the consumption of calculation resources and achieved fault-tolerant frequent itemsets mining.Experimental transaction database was generated by IBM synthetic data generator.The number of frequent itemsets extracted by DSFT-tree is 1.5 fold greater than that extracted by FP-stream.Experimental results show that developed algorithm is an efficient fault-tolerant synopsis.
出处
《北京航空航天大学学报》
EI
CAS
CSCD
北大核心
2011年第5期564-568,共5页
Journal of Beijing University of Aeronautics and Astronautics
基金
国家自然科学基金资助项目(61073041)
关键词
数据流
概要结构
容错模式
前缀树
data stream
synopsis
fault-tolerant frequent pattern
prefix-tree