摘要
以色情网站为代表的万维网非法资源已经成为互联网应用普及过程中的重大挑战.由于色情网站与普通网站的内容特征、结构形式和访问者群体都有显著的差异,这造成了用户对色情网站和普通网站的访问行为的差异.在某商业搜索引擎的协助下,收集了海量规模互联网用户访问日志,基于对日志中所记载用户行为的挖掘,验证了用户访问色情网站与普通网站时的行为确实具有明显的差异.基于此类差异设计了一系列用户行为特征,并结合机器学习方法,设计了基于用户行为的色情网站识别方法.实验表明,该方法可以较准确、高效地从网站中识别色情网站.
The problem of illegal Web resources, especially pornography sites, poses a major challenge for Web-related applications. Due to the significant differences in page content, site structure and visitors, user behavior patterns on pornography Web sites and ordinary Web sites can be separated from each other. With the help of a popular commercial search engine in China, large scale user behavior data is collected and it is found that when users surf in porn sites, their behaviors are significantly different from that when they are visiting ordinary Web sites. These differences in user behavior patterns can help us separate porn sites from other ones. A number of behavior features are proposed and combined with machine learning algorithms to develop a porn site identification method. Experimental results show effectiveness of the proposed method.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2013年第2期430-436,共7页
Journal of Computer Research and Development
基金
国家"八六三"高技术研究发展计划基金项目(2011AA01A205)
国家自然科学基金项目(60903107
61073071)
高等学校博士学科点专项科研基金项目(20090002120005)
关键词
色情网站
网络非法资源
用户行为分析
搜索引擎
网络浏览
pornography site
illegal Web resources
user behavior analysis
search engine
Web browsing