摘要
采用少量已标记和大量未标记文档进行文本分类已成为一种重要研究趋势 .在分析了 EM和联合训练 (Co-training)两类算法的基础上 ,提出一种新的协同训练算法 .该算法利用 Bayes和 TFIDF两种分类器结合少量已标记和大量未标记文档协同增量训练 .实验结果表明 ,协同训练算法正确率较高 ,平均错误率较 EM和联合训练低 。
The problem of combining a small set of labeled data with a large pool of unlabeled data for text classification task has been extensively studied. After introduction and analyses of EM and Co-training algorithms, Presented a new “co-operatived” training algorithm. Co-operated TFIDF and NB algorithms to incorporate labeled data with unlabeled data in training process incrementally. Experimental results show that Co-operative training algorithm achieves higher accuracy rate and lower average error than EM and Co-training, and performs better.
出处
《小型微型计算机系统》
CSCD
北大核心
2004年第12期2243-2246,共4页
Journal of Chinese Computer Systems
基金
国家自然科学基金 ( 60 2 72 0 5 1)资助
关键词
文本分类
半监督算法
联合训练算法
EM算法
协同增量训练
text classification
semi supervise algorithm
Co-training algorithm
EM algorithm
Co-operative training incrementally