摘要
设计并实现了一种高效率、高性能的网页文本过滤系统,该系统采用分层过滤策略,包括实时过滤和事后分析。实时过滤模块是基于Linux下的IP Queue机制实现的,采用高效的过滤策略,在保证过滤实时性的同时也保证了过滤的准确性;事后分析模块研究过滤系统经过协议还原后备份的网页文本,通过网页预处理、非法关键词抽取、特征选择等步骤,实现了基于二元模型的文本过滤方法,该方法在一定大小的词语距离窗口内,采用包含非法关键词的二元词串作为特征,解决了使用二元词串带来数据稀疏的问题,同时保留了二元词串的强类别分辨能力的特征。实验表明,文章实现的过滤系统有较高的效率和准确率,用于事后分析的基于二元模型的文本过滤方法达到了较高的性能,其准确率、召唤率和F1的值分别为:96.98%,85.75%和91.02%。
This paper design and implement an efficient,high-performance web text filtering system that uses hierarchical filtering strategy,including real-time filter and an off-line analysis.Real-time filtering module is based on IP Queue in Linux.With efficient filtering strategy,the module ensures the real-time and the accuracy.The offline analysis module studies Web text which is saved on database after protocol revert,then through web clean,illegal keywords extraction,feature selection and so on,we present a new present a new approachpresent a new approachpresent a new approachtext filtering approach based on bigram,this approach extracts bigram which contains illegal keywords as feature in a certain size of extraction window,it solved sparse data problem on using bi-gram,while retains the strong ability of bi-gram on classification as a feature.The experimental results show that our system achieved a high efficiency;the text filter method based on bigram also has a high performance.The precision,recall and F1-measure are as following: 96.98%,85.75% and 91.02%.
出处
《计算机与数字工程》
2010年第8期18-21,共4页
Computer & Digital Engineering
基金
国家自然科学基金(编号:90920004
60970056
60873150)
江苏省自然科学基金(编号:BK2008160)
江苏省高校自然科学重大基础研究项目(编号:08KJA520002)资助
关键词
分层过滤
文本过滤
二元词串
抽取窗口
hierarchical filtering
text filtering
bigram
extraction window