一个基于分层的网页文本过滤系统

A Hierarchical Approach to Filter Web Pages

下载PDF

导出

摘要设计并实现了一种高效率、高性能的网页文本过滤系统,该系统采用分层过滤策略,包括实时过滤和事后分析。实时过滤模块是基于Linux下的IP Queue机制实现的,采用高效的过滤策略,在保证过滤实时性的同时也保证了过滤的准确性;事后分析模块研究过滤系统经过协议还原后备份的网页文本,通过网页预处理、非法关键词抽取、特征选择等步骤,实现了基于二元模型的文本过滤方法,该方法在一定大小的词语距离窗口内,采用包含非法关键词的二元词串作为特征,解决了使用二元词串带来数据稀疏的问题,同时保留了二元词串的强类别分辨能力的特征。实验表明,文章实现的过滤系统有较高的效率和准确率,用于事后分析的基于二元模型的文本过滤方法达到了较高的性能,其准确率、召唤率和F1的值分别为:96.98%,85.75%和91.02%。 This paper design and implement an efficient,high-performance web text filtering system that uses hierarchical filtering strategy,including real-time filter and an off-line analysis.Real-time filtering module is based on IP Queue in Linux.With efficient filtering strategy,the module ensures the real-time and the accuracy.The offline analysis module studies Web text which is saved on database after protocol revert,then through web clean,illegal keywords extraction,feature selection and so on,we present a new present a new approachpresent a new approachpresent a new approachtext filtering approach based on bigram,this approach extracts bigram which contains illegal keywords as feature in a certain size of extraction window,it solved sparse data problem on using bi-gram,while retains the strong ability of bi-gram on classification as a feature.The experimental results show that our system achieved a high efficiency;the text filter method based on bigram also has a high performance.The precision,recall and F1-measure are as following： 96.98%,85.75% and 91.02%.

作者周聚李培峰朱巧明

机构地区苏州大学计算机科学与技术学院苏州大学江苏省计算机信息处理技术重点实验室

出处《计算机与数字工程》 2010年第8期18-21,共4页 Computer & Digital Engineering

基金国家自然科学基金(编号:90920004 60970056 60873150) 江苏省自然科学基金(编号:BK2008160) 江苏省高校自然科学重大基础研究项目(编号:08KJA520002)资助

关键词分层过滤文本过滤二元词串抽取窗口 hierarchical filtering text filtering bigram extraction window

分类号 TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献5

1Sebastiani F.Machine learning in automated text categorization[J].ACM Computing Surveys,2002,34(1):1-47.
2Salton G,McGill M.J.Introduction to modern information retrieval[R].McGraw Hill Book Company,1983.
3Yang Y.,Liu X.A re-examination of text categorization methods[C]//Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1999:42-49.
4James M.Libipq[EB/OL].http://www.cs.princeton.edu/-nakao/libipq.htm.
5樊兴华,孙茂松.一种高性能的两类中文文本分类方法[J].计算机学报,2006,29(1):124-131. 被引量：70

二级参考文献11

1Lewis D. D.. An evaluation of phrasal and clustered representalions on a text categorization task. In: Proceedings of SIGIR'92,the 15st ACM International Conference on Research and Development in Information Retrieval, Copenhagen, Denmark,1992, 37-50.
2Sebastiani F,. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1): 1-47.
3Lewis D.. Naive bayes at forty: The independence assumption in information retrieval. In: Proceedings of the 10th European Conference on Machine Learning, Chemnitz, Germany, 1998,4-15.
4Salton G.. Automatic Text Processing: The Transformation,Analysis, and Retrieval of Information by Computer. Reading,MA: Addison Wesley, 1989.
5Mitchell T. M.. Machine Learning. New York: McCraw Hill,1996.
6Joachims T.. Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning,Chemnitz, Germany, 1998, 137-142.
7Yang Y. , Liu X.. A Re-examination of text categorization methods. In: Proceedings of SIGIR'99, the 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, CA, 1999, 42-49.
8樊兴华.因果推理和文本分类.清华大学博士后出站报告,2004.
9Larkey L. S.. Automatic essay grading using text categorization techniques.. In: Proceedings of SIGIR'98, the 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998, 90-95.
10Dumais S. T. , Platt J. , Hecherman D. , Sahami M.. Inductive learning algorithms and representation for text categorization.In: Proceedings of CIKM'98, the 7th ACM International Conference on Information and Knowledge Management, Bethesda, MD, 1998, 148-155.

共引文献69

1孙登林,李生红,荆涛,刘功申.一种针对不良主题的文本过滤方法[J].信息安全与通信保密,2008,30(2):92-93. 被引量：4
2王细薇,樊兴华,赵军.一种基于特征扩展的中文短文本分类方法[J].计算机应用,2009,29(3):843-845. 被引量：36
3彭昱忠,元昌安,王艳,覃晓.基于内容理解的不良信息过滤技术研究[J].计算机应用研究,2009,26(2):433-438. 被引量：19
4彭京,杨冬青,唐世渭,王腾蛟,高军.基于概念相似度的文本相似计算[J].中国科学（F辑:信息科学）,2009,39(5):534-544. 被引量：17
5张雪英.基于机器学习的文本自动分类研究进展[J].情报学报,2006,25(6):730-739. 被引量：11
6LI Yanling,DAI Guanzhong,ZHU Yehang,QIN Sen.A High-Performance Extraction Method for Public Opinion on Internet[J].Wuhan University Journal of Natural Sciences,2007,12(5):902-906. 被引量：3
7刘磊,刘克彬,韩颖,李芳.基于两次分类的校友搜索系统的设计与实现[J].小型微型计算机系统,2007,28(10):1916-1920.
8耿焕同,李杰.范例推理在文本自动分类中的应用研究[J].情报理论与实践,2007,30(6):837-840. 被引量：1
9李艳玲,戴冠中,朱烨行.基于类别空间模型的文本倾向性分类方法[J].计算机应用,2007,27(9):2194-2196. 被引量：12
10李慧,施荣华.基于混合模式的网页过滤系统研究[J].信息技术,2007,31(12):73-76.

1周聚,李培峰,朱巧明.一种基于二元模型的分层文本过滤方法[J].计算机应用与软件,2011,28(7):16-18.
2陈谷平.反垃圾邮件三层技术保平安[J].互联网天地,2006(11):28-29.
3郑乃千.一种基于网络协议分层过滤的入侵检测系统研究[J].长治学院学报,2011,28(2):27-29.
4张燕,谭方勇.一个分层的入侵防御系统模型[J].电脑知识与技术,2010(2):853-854.
5张囡囡.一个高效的垃圾短信实时过滤系统的设计[J].品牌（理论月刊）,2015(2):178-178. 被引量：1
6刘艳民.中文网页分类方法的研究[J].微电子学与计算机,2009,26(9):166-169. 被引量：3
7段军峰,黄维通,陆玉昌.中文网页分类研究与系统实现[J].计算机科学,2007,34(6):210-213. 被引量：12
8黎旭昌.利用IP Queue机制编写用户态防火墙[J].开放系统世界,2004(10):70-72.
9李楠萼,卢显良.分层垃圾邮件过滤器的设计与实现[J].计算机应用,2005,25(B12):58-60. 被引量：2
10杨陟卓,黄河燕.基于词语距离的网络图词义消歧[J].软件学报,2012,23(4):776-785. 被引量：22

计算机与数字工程

2010年第8期

浏览历史

内容加载中请稍等...

一个基于分层的网页文本过滤系统

参考文献5

二级参考文献11

共引文献69

相关作者

相关机构

相关主题

浏览历史