摘要
针对互联网中出现的大量不良内容,分析出其主要特征,首次提出将不良网页的文本特征与搜索引擎中网络爬虫相结合的技术来主动寻找互联网中的不良网页及不良网站,并将结果分级别反馈到用户层以便对不良网页和网站进行处理,以达到净化网络环境的目的.实验结果表明,所提出的算法能够有效检测不良网页,并且能够很好地应对不良网站的反关键字过滤策略.
Internet is making massive amounts of harmful information, and it is very important to remove as much harmful information as possible to purify the internet. After the analysis of a large amount of harmful information on the internet, the key text features of harmful contents are presented. The novel approach is to find harmful Webpage and site by embedding the harmful text features into the Web spider of the search engine, and generate multi-level results to the users so that they can deal with the harmful Webpage and site to purify the internet environment. The experiments show that the proposed algorithm is capable to detect unhealthy Webpage effectively, and cope with the strategy of anti-keywords filtering from the unhealthy Website.
出处
《郑州大学学报(理学版)》
CAS
北大核心
2010年第2期26-30,共5页
Journal of Zhengzhou University:Natural Science Edition
基金
国家自然科学基金资助项目
编号60973120
60903073
国家863计划项目
编号2007AA01Z440
四川省科技攻关项目
编号2008GZ0009
关键词
主题网络爬虫
不良网页
文本特征
topic-focused Web crawler
unhealthy Webpage
text feature