期刊文献+

基于组合特征的动态垃圾博客过滤算法 被引量:2

Dynamic Splog Filtering Algorithm Based on Combined Features
下载PDF
导出
摘要 近几年,垃圾博客过滤成为国际上新的热点研究领域。现有的过滤算法大多基于词频特征分类,特征冗余并缺乏关联性。为了解决此问题,提出一种基于组合特征的动态垃圾博客过滤算法(CFDSD),该算法采用作者属性和自相似特征来解决特征冗余和关联性低的问题,并应用贝叶斯分类算法优化词频特征分类。实验表明,该算法能适应博客随时间变化而动态更新的特点,同时提高了过滤效率。 Splog filtering has become a new hot area in the international in recent years.Most of the traditional filtering algorithms are based on word frequency feature classification,which is quite redundancy and lack of relevance.Accor-ding to this problem,a dynamic filtering algorithm based on the combination of features for splog(CFDSD) was proposed to solve the problem of low relevance and redundancy.The CFDSD algorithm uses self-similarity feathers and the attributes of author,at the same time adopts the Bayesian classification algorithm to optimize word frequency feature classification.Experiments show that the algorithm is adaptable to dynamical updated features of the blog with time changes,and improves filtering efficiency,while reducing the time to filter splog.
出处 《计算机科学》 CSCD 北大核心 2012年第5期177-179,212,共4页 Computer Science
基金 国家自然科学基金项目(60603047) 教育部留学回国人员科研启动基金资助项目 辽宁省科技计划项目(2008216014) 辽宁省教育厅高等学校科研基金(L2010229) 大连市优秀青年科技人才基金(2008J23JH026)资助
关键词 垃圾博客过滤 词频特征 自相似特征 组合特征 贝叶斯分类 Splog filtering Term frequency features Self-similarity features Combined features Bayesian classification
  • 相关文献

参考文献12

  • 1Nanno T, Fujiki T, Suzuki Y. Automatically collecting, monitoring,and mining Japanese weblogs[C]//Proceedings of the 13^th International World Wide Web Conference on Alternate Track Papers & Posters. ACM Press(WWW Alt. '04),2004:320 321.
  • 2Sato Y, Utsuro T, Fukuhara T. Analysing features of Japanese splogs and characteristics of keywords[C]//Proc. 4th AIRWeb. 2008.
  • 3Kolari P,Finin T,Joshi A. SVMs for the blogosphere: Blog iden tification and splog detection [C]// Proc. of the AAAI Spring Symp. on Computational Approaches to Analyzing Weblogs. California: AAAI Press, 2006 : 92-99.
  • 4Melville P,Gryc W, Lawrence R D. Sentiment Analysis of Blog by Combining Lexical Knowledge with Text Classification[C]// Proc KDD 09. June 2009.
  • 5Ru Yu, Sundaram L H, Chi Yun. Splog Detection Using Self-similarity Analysis on Blog Temporal Dynamics [C]//Proc 5th AIR Web Press. 2007.
  • 6Katayama T, Utsuro T, Sato Y. An Empirical Study on Selective Sampling in Active Learning for Splog Detection[C]//Proc 4th AIRWeb Press. 2009.
  • 7Kolari P, Finin T,Joshi A. Svrns for the blogosphere: Blog identification and splog detection[C]//AAAI Spring Symposium on Computational Approaches to Analysing Weblogs. Baltimore County: Computer Science and Electrical Engineering. University of Maryland, March 2006.
  • 8Cormack G V,Smucker M D,Clarke C L A. Efficient and effective spam filtering and re-ranking for large Web datasets[J]. Computing Research Repository, 2010,14 (5) : 441-465.
  • 9魏红宁.决策树剪枝方法的比较[J].西南交通大学学报,2005,40(1):44-48. 被引量:43
  • 10刘玮,廖祥文,许洪波,王丽宏.基于统计特征的垃圾博客过滤[J].中文信息学报,2008,22(6):86-91. 被引量:6

二级参考文献16

  • 1Kolari P., and Finin T., Joshi A.. SVMs for the blogosphere: Blog identification and splog detection [C]//Proc. of the AAAI Spring Symp. on Computational Approaches to Analyzing Weblogs. California: AAAI Press, 2006: 92-99.
  • 2Kolari P. , Java A. , Finin T. , Mayfield J. , Joshi A. , Martineau J.. Blog Track Open Task: Spam Blog Classification[R]. TREC 2006 Blog Track Notebook.
  • 3Kolari P. , Java A. , Finin T.. Characterizing the splogosphere[C]//Proc, of the World Wide Web 2006 Workshop on the Webloggging Ecosystem: Aggregation, Analysis and Dynamics. Edinburgh, 2006.
  • 4Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tate mura, Belle L. Tseng. Splog Detection using self-sim ilarity analysis on blog temporal dynamics[C]//Proc of the ACM Workshop on Adversarial information re trieval on the web. 2007: 1-8.
  • 5Salvetti F., Nicolov N.. Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach[C]//Proc. of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, 137-140.
  • 6Ntoulas A. , Najork M. , Manasse M. , Fetterly D.. Detecting spam web pages through content analysis [C]//Proc. of the 15th international conference on World Wide Web, Edinburgh, Scotland, 2006:83-92.
  • 7Macdonald C. , Ounis I.. The TREC Blog06 Collection: Creating and Analysing a Blog Test Collection[R]. DCS Technical Report TR-2006-224. Department of Computing Science, University of Glasgow. 2006.
  • 8Oates T, Jemen D. The effects of training set sizes on decision tree[A]. Proc of the 14th Int'l Conf on Machine Learning[C]. Nashville: Morgan Kaufman, 1997. 254-262.
  • 9Breslow L A, A_ha D W. Simplifying decision trees: a survey[J]. Knowledge Engineering Review, 1997. 12( 1 ) : 1-40.
  • 10Breiman L, Friedman J, Olshen R A, et al. Classification and regression trees[ M]. Belmont: Wadsworth, 1984. 1-358.

共引文献47

同被引文献7

引证文献2

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部