期刊文献+

基于主题的网页噪音去除机制 被引量:8

Web pages noise removal based on focused topics
下载PDF
导出
摘要 由于主题的缺失,传统的网页噪音去除算法均是通过一些启发式的规则判断哪些是有用信息,哪些是噪音信息。而在主题爬行的环境下,由于有了明确的主题,可以使用一些不同的方法来发现网页噪音。提出了一种基于主题的网页噪音去除算法,通过构造网页DOM树的一个变种,即内容块树,利用分类器判断网页的噪音块。实验结果表明,该方法噪音去除精度是87%,而以前的方法仅有42%。 In the absence of topic, traditional web page noise removal algorithm judges content block which one is noise and which one is not with some heuristic rules. But within the environment of focused crawling, clear topic presents, higher precision and better effect is achieved in a different way. A noise removal algorithm based on focused topic is proposed. After a variation of DOM (doCument object module) tree of web pages is constructed, i.e. content block tree, noise segment will be judged by a trained classifier. Experimental results demonstrate that the precision of our method is 87%, which is much better than previous method whose precision is 42%.
出处 《计算机工程与设计》 CSCD 北大核心 2008年第8期2072-2074,2084,共4页 Computer Engineering and Design
基金 国家自然科学基金项目(60373099) 教育部"符号计算与知识工程"重点实验室基金项目(93K-17)
关键词 WEB网页 噪音去除 信息提取 预处理 web pages noise removal information extraction preprocessing
  • 相关文献

参考文献9

  • 1Lin Shian-Hua, Ho Jan-Ming. Discovering informative content blocks from web documents[C].SIGKDD.New York:ACM Press,2002:588-593.
  • 2张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57
  • 3常育红,姜哲,朱小燕.基于标记树表示方法的页面结构分析[J].计算机工程与应用,2004,40(16):129-132. 被引量:24
  • 4Wang Jiying,Lochovsky F H.Data-rich section extraction from HTML pages[C].Proceeding of the Third International Conference on Web Information Systems Engineering(Workshops).Singapore'IEEE Computer Society,2002:313-322.
  • 5欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2005,45(S1):1743-1747. 被引量:70
  • 6Cai Deng, Yu Shi-peng, Wen Ji-rong, et al. Extracting content structure for Web pages Based on visual representation[C].Proceeding of the 6th Asia Pacific Web conference.Xi'an:Springer, 2003:406-417.
  • 7代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2004,18(1):26-32. 被引量:228
  • 8Cai D,Yu S,Wen J.R, et al.VIPS: A vision based page segmentation algorithm[R].Microsoft Technical Report,2003.
  • 9Yu S,Cai D,Wen J-R,et a.Improving pseudo-relevance feedback in web information retrieval using web page segmentation[C]. Budapest, Hungary: Proceedings of Twelfth World Wide Web Conference,2003.

二级参考文献25

  • 1荆涛,左万利.基于可视布局信息的网页噪音去除算法[J].华南理工大学学报(自然科学版),2004,32(z1):84-87. 被引量:21
  • 2张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57
  • 3常育红,姜哲,朱小燕.基于标记树表示方法的页面结构分析[J].计算机工程与应用,2004,40(16):129-132. 被引量:24
  • 4黄昌宁 等.对自动分词的反思[A]..语言计算与基于内容的文本处理[C].北京:清华大学出版社,2003,7.26-38.
  • 5Shian-Hua Lin, Jan-Ming Ho. Discovering informative content blocks from Web documents. In: SIGKDD, 2002
  • 6Soumen Chakrabarti, Mukul M. Joshi and Vivek B. Tawde.Enhanced topic distillation using text, markup tags, and hyperlinks. In: SIGIR, 2001
  • 7S. Chakrabarti, M. Joshi, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In :WWW, Hawaii. ACM, 2002
  • 8Yiming Yang. Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, 1995
  • 9Li Xiaoli and Shi Zhongzhi. Innovating Web page classification through reducing noise. Journal of Computer Science & Technology, 2002 ,17(1): 9 ~ 17
  • 10http://162. 105.80.84/cgi-bin/getdirectory? ccode = 0

共引文献347

同被引文献84

引证文献8

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部