摘要
现实生活中网络监控、网络评论以及微博等应用领域涌现了大量文本数据流,这些数据的不完全标记和频繁概念漂移给已有的数据流分类方法带来了挑战。为此,面向不完全标记的文本数据流提出了一种自适应的数据流分类算法。该算法以一个标记数据块作为起始数据块,对未标记数据块首先提取标记数据块与未标记数据块之间的特征集,并利用特征在两个数据块间的相似度进行概念漂移检测,最后计算未标记数据中特征的极性并对数据进行预测。实验表明了算法在分类精度上的优越性,尤其在标记信息较少和概念漂移较为频繁时。
In the real-world applications, a large number of text data stream are emerging, such as network monitoring, network comments and microblogs. However, these data have incomplete labels and frequent concept drifts,which have brought many challenges to existing classification methods of data stream. Thus we proposed a self-adaptation classifi- cation algorithm for incomplete labeled text data stream in this paper. The proposed algorithm uses a labeled data chunk as the starting one,and extracts features between the labeled data chunk and the unlabeled data chunk. Meanwhile, for unlabeled data chunks, it uses the similarity of features between two data chunks to test concept drift. Finally, the polari- ty of features of the unlabeled data chunks is calculated to predict the instances. The experimental results show our al- gorithm can improve the classification accuracy, especially in the data cases with less label information and more con- cepts drifts.
出处
《计算机科学》
CSCD
北大核心
2016年第12期179-182,194,共5页
Computer Science
基金
教育部创新团队(IRT13059)
国家自然科学基金(61305063
61273292)
博士点项目基金(20130111110011)资助
关键词
不完全标记
自适应
数据流
概念漂移
Incomplete labeled, Self-adaptation,Data stream,Concept drift