摘要
该文根据垃圾博客和正常博客在统计特征上的差异,对多种针对博客分类有效的统计特征进行了分析,提出基于博客页面统计特征的过滤方法。在Blog06数据集上的实验表明,该方法的过滤准确性达到97%,比基于词频特征的过滤方法提高了约7%,在不同规模训练集上的准确性保持在95%左右,具有更好的泛化能力。
In this paper, we analyze many effective statistical features for splog filtering by investigating the differences between splogs and normal blogs. Then we present a splog filtering approach based on statistical characteristics of hlog content. The experimental results on Blog06 data set show that the approach can reach an accuracy of ,97%, which improves by 7% compared with term frequency based method. And with the test size increasing, its accuracy keeps around 95%, indicating a good generalization ability.
出处
《中文信息学报》
CSCD
北大核心
2008年第6期86-91,共6页
Journal of Chinese Information Processing
基金
国家973课题资助项目(2004CB318109)
国家863计划资助项目(2007AA01Z441)
关键词
计算机应用
中文信息处理
内容分析
垃圾博客过滤
统计特征
词频特征
泛化能力
computer application
Chinese information processing
content analysis
splog filtering
statistical leature
term frequency feature
generalization ability