期刊文献+

基于统计特征的垃圾博客过滤 被引量:6

Splog Filtering Based on Content Analysis
下载PDF
导出
摘要 该文根据垃圾博客和正常博客在统计特征上的差异,对多种针对博客分类有效的统计特征进行了分析,提出基于博客页面统计特征的过滤方法。在Blog06数据集上的实验表明,该方法的过滤准确性达到97%,比基于词频特征的过滤方法提高了约7%,在不同规模训练集上的准确性保持在95%左右,具有更好的泛化能力。 In this paper, we analyze many effective statistical features for splog filtering by investigating the differences between splogs and normal blogs. Then we present a splog filtering approach based on statistical characteristics of hlog content. The experimental results on Blog06 data set show that the approach can reach an accuracy of ,97%, which improves by 7% compared with term frequency based method. And with the test size increasing, its accuracy keeps around 95%, indicating a good generalization ability.
出处 《中文信息学报》 CSCD 北大核心 2008年第6期86-91,共6页 Journal of Chinese Information Processing
基金 国家973课题资助项目(2004CB318109) 国家863计划资助项目(2007AA01Z441)
关键词 计算机应用 中文信息处理 内容分析 垃圾博客过滤 统计特征 词频特征 泛化能力 computer application Chinese information processing content analysis splog filtering statistical leature term frequency feature generalization ability
  • 相关文献

参考文献7

  • 1Kolari P., and Finin T., Joshi A.. SVMs for the blogosphere: Blog identification and splog detection [C]//Proc. of the AAAI Spring Symp. on Computational Approaches to Analyzing Weblogs. California: AAAI Press, 2006: 92-99.
  • 2Kolari P. , Java A. , Finin T. , Mayfield J. , Joshi A. , Martineau J.. Blog Track Open Task: Spam Blog Classification[R]. TREC 2006 Blog Track Notebook.
  • 3Kolari P. , Java A. , Finin T.. Characterizing the splogosphere[C]//Proc, of the World Wide Web 2006 Workshop on the Webloggging Ecosystem: Aggregation, Analysis and Dynamics. Edinburgh, 2006.
  • 4Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tate mura, Belle L. Tseng. Splog Detection using self-sim ilarity analysis on blog temporal dynamics[C]//Proc of the ACM Workshop on Adversarial information re trieval on the web. 2007: 1-8.
  • 5Salvetti F., Nicolov N.. Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach[C]//Proc. of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, 137-140.
  • 6Ntoulas A. , Najork M. , Manasse M. , Fetterly D.. Detecting spam web pages through content analysis [C]//Proc. of the 15th international conference on World Wide Web, Edinburgh, Scotland, 2006:83-92.
  • 7Macdonald C. , Ounis I.. The TREC Blog06 Collection: Creating and Analysing a Blog Test Collection[R]. DCS Technical Report TR-2006-224. Department of Computing Science, University of Glasgow. 2006.

同被引文献103

  • 1魏红宁.决策树剪枝方法的比较[J].西南交通大学学报,2005,40(1):44-48. 被引量:42
  • 2WAN Xiao-jun. Using bilingual knowledge and ensemble techniques for unsupervised Chinese sentiment analysis[ C]//Proc of Conference on Empirical Methods in Natural Language Processing. 2008:553- 561.
  • 3PANG Be, LEE L. Opinion mining and sentiment analysis [ J ]. Foundations and Trends in Information Retrieval, 2008, 2 (1- 2) :1-135.
  • 4SU Qi, XU Xin-ying, GUO Hong-lei, et al. Hidden sentiment associ- ation in Chinese Web opinion mining[ C ]//Proc of the 17th Interna- tional Conference on World Wide Web. New York: ACM Press, 2008:959 - 968.
  • 5TITOV I, McDONALD R. Modeling online reviews with multi-grain topic models [ C ]//Proc of the 17th International Conference on World Wide Web. New York : ACM Press,2008 : 111- 120.
  • 6CHOI Y, CARDIE C. Learning with compositional semantics as structural inference for subsentential sentiment analysis [ C ]//Proc of Conference on Empirical Methods in Natural Language Processing. 2008 : 793- 801.
  • 7ZHAO Jun, LIU Kang, WANG Gen. Adding redundant features for CRFs-based sentence sentiment classification [ C ]//Proc of Confer- ence on Empirical Methods in Natural Language Processing. 2008: 117-126.
  • 8ZHANG Min, YE Xin-yao. A generation model to unify topic rele- vance and lexicon-based sentiment for opinion retrieval[ C ]//Proc of the 31 st International Conference on Research and Development in In- formation Retrieval. 2008:411-418.
  • 9LIU Bing. Web data mining: exploring hyperlinks, contents and us- age data[ M]. New York: Springer, 2007:441-448.
  • 10HU Min-qing, LIU Bing. Mining and summarizing customer reviews [ C ]//Proc of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004 : 165-177.

引证文献6

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部