期刊文献+

一个基于分层的网页文本过滤系统

A Hierarchical Approach to Filter Web Pages
下载PDF
导出
摘要 设计并实现了一种高效率、高性能的网页文本过滤系统,该系统采用分层过滤策略,包括实时过滤和事后分析。实时过滤模块是基于Linux下的IP Queue机制实现的,采用高效的过滤策略,在保证过滤实时性的同时也保证了过滤的准确性;事后分析模块研究过滤系统经过协议还原后备份的网页文本,通过网页预处理、非法关键词抽取、特征选择等步骤,实现了基于二元模型的文本过滤方法,该方法在一定大小的词语距离窗口内,采用包含非法关键词的二元词串作为特征,解决了使用二元词串带来数据稀疏的问题,同时保留了二元词串的强类别分辨能力的特征。实验表明,文章实现的过滤系统有较高的效率和准确率,用于事后分析的基于二元模型的文本过滤方法达到了较高的性能,其准确率、召唤率和F1的值分别为:96.98%,85.75%和91.02%。 This paper design and implement an efficient,high-performance web text filtering system that uses hierarchical filtering strategy,including real-time filter and an off-line analysis.Real-time filtering module is based on IP Queue in Linux.With efficient filtering strategy,the module ensures the real-time and the accuracy.The offline analysis module studies Web text which is saved on database after protocol revert,then through web clean,illegal keywords extraction,feature selection and so on,we present a new present a new approachpresent a new approachpresent a new approachtext filtering approach based on bigram,this approach extracts bigram which contains illegal keywords as feature in a certain size of extraction window,it solved sparse data problem on using bi-gram,while retains the strong ability of bi-gram on classification as a feature.The experimental results show that our system achieved a high efficiency;the text filter method based on bigram also has a high performance.The precision,recall and F1-measure are as following: 96.98%,85.75% and 91.02%.
出处 《计算机与数字工程》 2010年第8期18-21,共4页 Computer & Digital Engineering
基金 国家自然科学基金(编号:90920004 60970056 60873150) 江苏省自然科学基金(编号:BK2008160) 江苏省高校自然科学重大基础研究项目(编号:08KJA520002)资助
关键词 分层过滤 文本过滤 二元词串 抽取窗口 hierarchical filtering text filtering bigram extraction window
  • 相关文献

参考文献5

  • 1Sebastiani F.Machine learning in automated text categorization[J].ACM Computing Surveys,2002,34(1):1-47.
  • 2Salton G,McGill M.J.Introduction to modern information retrieval[R].McGraw Hill Book Company,1983.
  • 3Yang Y.,Liu X.A re-examination of text categorization methods[C]//Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1999:42-49.
  • 4James M.Libipq[EB/OL].http://www.cs.princeton.edu/-nakao/libipq.htm.
  • 5樊兴华,孙茂松.一种高性能的两类中文文本分类方法[J].计算机学报,2006,29(1):124-131. 被引量:70

二级参考文献11

  • 1Lewis D. D.. An evaluation of phrasal and clustered representalions on a text categorization task. In: Proceedings of SIGIR'92,the 15st ACM International Conference on Research and Development in Information Retrieval, Copenhagen, Denmark,1992, 37-50.
  • 2Sebastiani F,. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1): 1-47.
  • 3Lewis D.. Naive bayes at forty: The independence assumption in information retrieval. In: Proceedings of the 10th European Conference on Machine Learning, Chemnitz, Germany, 1998,4-15.
  • 4Salton G.. Automatic Text Processing: The Transformation,Analysis, and Retrieval of Information by Computer. Reading,MA: Addison Wesley, 1989.
  • 5Mitchell T. M.. Machine Learning. New York: McCraw Hill,1996.
  • 6Joachims T.. Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning,Chemnitz, Germany, 1998, 137-142.
  • 7Yang Y. , Liu X.. A Re-examination of text categorization methods. In: Proceedings of SIGIR'99, the 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, CA, 1999, 42-49.
  • 8樊兴华.因果推理和文本分类.清华大学博士后出站报告,2004.
  • 9Larkey L. S.. Automatic essay grading using text categorization techniques.. In: Proceedings of SIGIR'98, the 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998, 90-95.
  • 10Dumais S. T. , Platt J. , Hecherman D. , Sahami M.. Inductive learning algorithms and representation for text categorization.In: Proceedings of CIKM'98, the 7th ACM International Conference on Information and Knowledge Management, Bethesda, MD, 1998, 148-155.

共引文献69

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部