摘要
社交网络平台产生海量的短文本数据流,具有快速、海量、概念漂移、文本长度短小、类标签大量缺失等特点.为此,文中提出基于向量表示和标签传播的半监督短文本数据流分类算法,可对仅含少量有标记数据的数据集进行有效分类.同时,为了适应概念漂移,提出基于聚类簇的概念漂移检测算法.在实际短文本数据流上的实验表明,相比半监督分类算法和半监督数据流分类算法,文中算法不仅提高分类精度和宏平均,还能快速适应数据流中的概念漂移.
The huge volume of short text streams produced by social Network is fast, high-volume and it contains concept drift, short length of texts and massive unlabeled data. Therefore, a semisupervised short text stream classification algorithm based on vector representation and label propagation is proposed in this paper to classify short text stream with a few labeled data. Besides, to adapt to the concept drift, a concept drift detection algorithm based on clusters is proposed. Experimental results on real short text streams show that the proposed algorithm improves the classification accuracy and the macro average compared with classical semi-supervised classification algorithms and semi-supervised data stream classification algorithms, and it adapts to the concept drift quickly in data stream.
作者
王海燕
胡学钢
李培培
WANG Haiyan;HU Xuegang;LI Peipei(School of Computer and Information,Hefei University of Technology,Hefei 230601;Anhui Province Key Laboratory of Industry Safety and Emergency Technology,Hefei University of Technology,Hefei 230009)
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2018年第7期634-642,共9页
Pattern Recognition and Artificial Intelligence
基金
国家重点研发计划项目(No.2016YFC0801406)
国家自然科学基金项目(No.61503112
61673152)资助~~
关键词
短文本数据流
半监督分类
标签传播
概念漂移
Short Text Stream
Semi-supervised Classification
Label Propagation
Concept Drift