摘要
依据从大规模数据中抽取的模式来建立分类模型是模式挖掘的重要研究问题之一。一种可行的方法是根据模式集合建立贝叶斯分类模型。然而,目前基于模式的贝叶斯分类模型大多是针对静态数据集合的,通常不能适应于高速动态变化与无限的数据流环境。对此,提出一种数据流环境下基于模式发现的贝叶斯分类学习模型,其采用半懒惰式学习策略,针对分类实例在不断更新的频繁项集合上建立局部的分类模型;为加快流数据处理的速度,提出了结构更为简单的混合树结构,同时提出了给定项限制的模式抽取机制以减少候选项集的生成;对数据流中模式抽取不完全的情况,使用平滑技术处理未被抽取的项。大量实验分析证明,相较于其他数据流分类器,所提模型具有更高的分类正确率。
Utilizing patterns extracted from large scale data to build classification model is one of important research problems.Exploiting patterns to estimate Bayesian probability is a feasible approach.However,most of the existing pattern-based Bayesian classifiers aim at static data set,which cannot adapt to the dynamic data stream environment.A Bayesian classification model,named PBDS(Pattern-based Bayesian classifier for Data Stream),based on pattern discovery over data streams was proposed.PBDS constructs local model for unseen case based on continuously updated frequent item sets with partially-lazy learning method.To accelerate data processing,the simpler data structure,i.e.,hybrid trees structure was proposed,and pattern extracting mechanism was proposed to reduce the generation of candidate itemsets.Smoothing technique was used to handle incomplete itemset extraction in the data stream.Extensive experiments on real-world and synthetic data streams show that PBDS is more accurate than state-of-the-art data stream classifiers.
出处
《计算机科学》
CSCD
北大核心
2017年第7期167-174,202,共9页
Computer Science
基金
国家自然科学基金(61672086)
北京市自然科学基金(4142042)资助
关键词
数据流
频繁模式
贝叶斯
半懒惰式学习
Data stream
Frequent pattern
Bayesian
Partially-lazy learning