In the global information era,people acquire more and more information from the Internet,but the quality of the search results is degraded strongly because of the presence of web spam.Web spam is one of the serious pr...In the global information era,people acquire more and more information from the Internet,but the quality of the search results is degraded strongly because of the presence of web spam.Web spam is one of the serious problems for search engines,and many methods have been proposed for spam detection.We exploit the content features of non-spam in contrast to those of spam.The content features for non-spam pages always possess lots of statistical regularities; but those for spam pages possess very few statistical regularities,because spam pages are made randomly in order to increase the page rank.In this paper,we summarize the regularities distributions of content features for non-spam pages,and propose the calculating probability formulae of the entropy and independent n-grams respectively.Furthermore,we put forward the calculation formulae of multi features correlation.Among them,the notable content features may be used as auxiliary information for spam detection.展开更多
基金supported by the National Science Foundation of China(No.61170145,61373081)the Specialized Research Fund for the Doctoral Program of Higher Education of China(No.20113704110001)+1 种基金the Technology and Development Project of Shandong(No.2013GGX10125)the Taishan Scholar Project of Shandong,China
文摘In the global information era,people acquire more and more information from the Internet,but the quality of the search results is degraded strongly because of the presence of web spam.Web spam is one of the serious problems for search engines,and many methods have been proposed for spam detection.We exploit the content features of non-spam in contrast to those of spam.The content features for non-spam pages always possess lots of statistical regularities; but those for spam pages possess very few statistical regularities,because spam pages are made randomly in order to increase the page rank.In this paper,we summarize the regularities distributions of content features for non-spam pages,and propose the calculating probability formulae of the entropy and independent n-grams respectively.Furthermore,we put forward the calculation formulae of multi features correlation.Among them,the notable content features may be used as auxiliary information for spam detection.