基于统计特征的垃圾博客过滤被引量：6

Splog Filtering Based on Content Analysis

下载PDF

导出

摘要该文根据垃圾博客和正常博客在统计特征上的差异,对多种针对博客分类有效的统计特征进行了分析,提出基于博客页面统计特征的过滤方法。在Blog06数据集上的实验表明,该方法的过滤准确性达到97%,比基于词频特征的过滤方法提高了约7%,在不同规模训练集上的准确性保持在95%左右,具有更好的泛化能力。 In this paper, we analyze many effective statistical features for splog filtering by investigating the differences between splogs and normal blogs. Then we present a splog filtering approach based on statistical characteristics of hlog content. The experimental results on Blog06 data set show that the approach can reach an accuracy of ,97%, which improves by 7% compared with term frequency based method. And with the test size increasing, its accuracy keeps around 95%, indicating a good generalization ability.

作者刘玮廖祥文许洪波王丽宏

机构地区中国科学院计算技术研究所信息智能与信息安全研究中心国家计算机网络与信息安全管理中心

出处《中文信息学报》 CSCD 北大核心 2008年第6期86-91,共6页 Journal of Chinese Information Processing

基金国家973课题资助项目(2004CB318109) 国家863计划资助项目(2007AA01Z441)

关键词计算机应用中文信息处理内容分析垃圾博客过滤统计特征词频特征泛化能力 computer application Chinese information processing content analysis splog filtering statistical leature term frequency feature generalization ability

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献7

1Kolari P., and Finin T., Joshi A.. SVMs for the blogosphere: Blog identification and splog detection [C]//Proc. of the AAAI Spring Symp. on Computational Approaches to Analyzing Weblogs. California: AAAI Press, 2006: 92-99.
2Kolari P. , Java A. , Finin T. , Mayfield J. , Joshi A. , Martineau J.. Blog Track Open Task: Spam Blog Classification[R]. TREC 2006 Blog Track Notebook.
3Kolari P. , Java A. , Finin T.. Characterizing the splogosphere[C]//Proc, of the World Wide Web 2006 Workshop on the Webloggging Ecosystem: Aggregation, Analysis and Dynamics. Edinburgh, 2006.
4Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tate mura, Belle L. Tseng. Splog Detection using self-sim ilarity analysis on blog temporal dynamics[C]//Proc of the ACM Workshop on Adversarial information re trieval on the web. 2007: 1-8.
5Salvetti F., Nicolov N.. Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach[C]//Proc. of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, 137-140.
6Ntoulas A. , Najork M. , Manasse M. , Fetterly D.. Detecting spam web pages through content analysis [C]//Proc. of the 15th international conference on World Wide Web, Edinburgh, Scotland, 2006:83-92.
7Macdonald C. , Ounis I.. The TREC Blog06 Collection: Creating and Analysing a Blog Test Collection[R]. DCS Technical Report TR-2006-224. Department of Computing Science, University of Glasgow. 2006.

同被引文献103

1魏红宁.决策树剪枝方法的比较[J].西南交通大学学报,2005,40(1):44-48. 被引量：42
2WAN Xiao-jun. Using bilingual knowledge and ensemble techniques for unsupervised Chinese sentiment analysis[ C]//Proc of Conference on Empirical Methods in Natural Language Processing. 2008:553- 561.
3PANG Be, LEE L. Opinion mining and sentiment analysis [ J ]. Foundations and Trends in Information Retrieval, 2008, 2 (1- 2) :1-135.
4SU Qi, XU Xin-ying, GUO Hong-lei, et al. Hidden sentiment associ- ation in Chinese Web opinion mining[ C ]//Proc of the 17th Interna- tional Conference on World Wide Web. New York: ACM Press, 2008:959 - 968.
5TITOV I, McDONALD R. Modeling online reviews with multi-grain topic models [ C ]//Proc of the 17th International Conference on World Wide Web. New York : ACM Press,2008 : 111- 120.
6CHOI Y, CARDIE C. Learning with compositional semantics as structural inference for subsentential sentiment analysis [ C ]//Proc of Conference on Empirical Methods in Natural Language Processing. 2008 : 793- 801.
7ZHAO Jun, LIU Kang, WANG Gen. Adding redundant features for CRFs-based sentence sentiment classification [ C ]//Proc of Confer- ence on Empirical Methods in Natural Language Processing. 2008: 117-126.
8ZHANG Min, YE Xin-yao. A generation model to unify topic rele- vance and lexicon-based sentiment for opinion retrieval[ C ]//Proc of the 31 st International Conference on Research and Development in In- formation Retrieval. 2008:411-418.
9LIU Bing. Web data mining: exploring hyperlinks, contents and us- age data[ M]. New York: Springer, 2007:441-448.
10HU Min-qing, LIU Bing. Mining and summarizing customer reviews [ C ]//Proc of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004 : 165-177.

引证文献6

1何苑,谭红叶.基于多结构特征的垃圾博客识别研究[J].计算机工程与设计,2010,31(22):4932-4935. 被引量：6
2邓冰娜,王煜,刘宇.一种应用于博客的垃圾评论识别方法[J].郑州大学学报（理学版）,2011,43(1):65-69. 被引量：7
3杨风雷,黎建辉.用户生成内容中的垃圾意见研究综述[J].计算机应用研究,2011,28(10):3601-3605. 被引量：12
4贾志洋,夏幼明,高炜,王勇刚.搜索引擎垃圾网页检测模型研究[J].重庆文理学院学报（自然科学版）,2011,30(5):53-58. 被引量：1
5任永功,尹明飞,杨荣杰.基于组合特征的动态垃圾博客过滤算法[J].计算机科学,2012,39(5):177-179. 被引量：2
6何苑,郝梦岩,谭红叶.基于最小风险贝叶斯的垃圾博客识别算法研究[J].山西大学学报（自然科学版）,2014,37(1):42-47. 被引量：1

二级引证文献26

1杨风雷,黎建辉.用户生成内容中的垃圾意见研究综述[J].计算机应用研究,2011,28(10):3601-3605. 被引量：12
2邱云飞,王建坤,邵良杉,刘大有.基于用户行为的产品垃圾评论者检测研究[J].计算机工程,2012,38(11):254-257. 被引量：16
3赵宇翔,范哲,朱庆华.用户生成内容(UGC)概念解析及研究进展[J].中国图书馆学报,2012,38(5):68-81. 被引量：280
4贾佳,宋恩梅,苏环.社会化问答平台的答案质量评估——以“知乎”、“百度知道”为例[J].信息资源管理学报,2013,3(2):19-28. 被引量：80
5宋海霞,严馨,余正涛,石林宾,苏斐.基于自适应聚类的虚假评论检测[J].南京大学学报（自然科学版）,2013,49(4):433-438. 被引量：33
6李妙玲.用户生成内容研究综述[J].图书馆学研究,2013(16):21-27. 被引量：16
7伍杰华,倪振声.改进多分类器集成AdaBoost算法的Web主题分类[J].计算机应用与软件,2013,30(11):64-67. 被引量：2
8黄铃,李学明.基于AdaBoost的微博垃圾评论识别方法[J].计算机应用,2013,33(12):3563-3566. 被引量：6
9何苑,郝梦岩,谭红叶.基于最小风险贝叶斯的垃圾博客识别算法研究[J].山西大学学报（自然科学版）,2014,37(1):42-47. 被引量：1
10郭跇秀,吕学强,李卓.广告型微博的识别方法[J].小型微型计算机系统,2014,35(12):2702-2707. 被引量：3

1任永功,尹明飞,杨荣杰.基于组合特征的动态垃圾博客过滤算法[J].计算机科学,2012,39(5):177-179. 被引量：2
2张杰,陈怀新.基于归一化词频贝叶斯模型的文本分类方法[J].计算机工程与设计,2016,37(3):799-802. 被引量：10
3郑燕玉,李冬.基于博客的研究性学习初探[J].教育信息技术,2008(7):16-18.
4李妍坊,许歆艺,刘功申.面向情感倾向性识别的特征分析研究[J].计算机技术与发展,2014,24(9):33-36. 被引量：3
5于琨,耿焕同,寇苏玲,张婷慧,蔡庆生.用于Email分类的综合特征表示方法[J].小型微型计算机系统,2006,27(5):930-932.
6席萌,郭巧.基于语境关联的Web信息过滤算法[J].华中科技大学学报（自然科学版）,2003,31(S1):102-104. 被引量：1
7张涛,谢旳.基于博客的知识管理系统[J].科技情报开发与经济,2005,15(16):199-200. 被引量：4
8何苑,谭红叶.基于多结构特征的垃圾博客识别研究[J].计算机工程与设计,2010,31(22):4932-4935. 被引量：6
9张圣超,于兆民,冉万中.基于博客的师生交互平台[J].实验科学与技术,2006,4(5):98-100. 被引量：2
10陆红燕.基于博客的大学英语研究性学习[J].今日科苑,2008(12):255-255.

中文信息学报

2008年第6期

浏览历史

内容加载中请稍等...

基于统计特征的垃圾博客过滤被引量：6

参考文献7

同被引文献103

引证文献6

二级引证文献26

相关作者

相关机构

相关主题

浏览历史

基于统计特征的垃圾博客过滤 被引量：6

参考文献7

同被引文献103

引证文献6

二级引证文献26

相关作者

相关机构

相关主题

浏览历史

基于统计特征的垃圾博客过滤被引量：6