期刊文献+

基于局部最优标签树的网页净化方法

An Approach to Purify Web Pages Based on the Local Optimal DOM Tree
下载PDF
导出
摘要 新闻网页里面包含大量文字分段标签,相比网页其它区域的噪音内容,其主题内容区域的文字分段标签较多。根据这一特点引入局部最优标签树搜索算法。通过搜寻同级节点中分段标签最多的容器节点,消除其它容器节点,从而实现网页净化方法。实验证明方法实现简单、净化效果明显,特别是对新闻类主题文字网页净化效果显著。 A news web page has a lot of paragraph tags, most of which exist in topic zones, and a little in noise zones. According to this feature, a novel purification approach is proposd based on the local optimal DOM tree algorithm. Through searching sibling nodes for the one with the most number of paragraph tags, eliminating the other nodes, a purified DOM tree is got. That is the tree for the purified Web page. This approach is simple and significant, especially to the topic text Web pages.
出处 《科学技术与工程》 北大核心 2012年第35期9556-9561,共6页 Science Technology and Engineering
基金 重庆第二师范学院研究项目(KY201176C、KY201175C)资助
关键词 网页净化 信息提取 HTML标签 局部最优 网页噪音 Web page purification information extraction HTML tags local optimal Web pagenoise9561
  • 相关文献

参考文献11

  • 1张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57
  • 2毛先领,何靖,闫宏飞.网页去噪音:研究综述.汁算机研究 与发展,2010,47(12) : 2025 -2036.
  • 3Gibson D, Pun era K, Tomkins A. The volume and evolution of Webpage templates. Proc of the 14 th Int Conf on World Wide Web. NewYork: ACM, 2005: 830-839.
  • 4Yi L, Liu B, Li X. Eliminating noisy information in Web pages fordata mining. Proc of the 9 th ACM SIGKDD Int Conf on KnowledgeDiscovery and Data Mining. New York : ACM, 2003 ; 296-305.
  • 5Yi L, Liu B. Web page cleaning for Web mining through featureweighting . Proc of the 18 th Int Joint Conf on Artificial Intelligence(IJCAI-03). San Francisco : Morgan Kaufmann, 2003 ; 43-50.
  • 6Cai D, Yu S, Wen J R, et al. Extracting content structure for Webpages based on visual representation. Web Technologies and Applications :5 th Asia-Pacific Web Conf. Berlin : Springer, 2003 :406-417.
  • 7Song R, Liu H, Wen J R, ei al. Learning block importance modelsfor Web pages. Proc of the 13 th Int Conf on World Wide Web. NewYork: ACM, 2004: 211-220.
  • 8Cai D, Yu S, Wen J R, et al. VIPS: a vision based page segmentation algorithm, MSR-TR-2003 -79 [ R/OL ]. Seattle, USA : Microsoft,(2003-11 ) [ 2009-02-01 ] . http ://research, microsoft. com/ apps/pubs/default, aspx? id = 70027.
  • 9Yu S, Cai D, Wen J R, ef al. Improving pseudo-relevance feedbackin Web information retrieval using Web page segmentation. Proc ofthe 12 th World Wide Web Conf. New York: ACM, 2003.
  • 10万乐,左万利,高金.基于主题的网页噪音去除机制[J].计算机工程与设计,2008,29(8):2072-2074. 被引量:8

二级参考文献20

  • 1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2005,45(S1):1743-1747. 被引量:70
  • 2张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57
  • 3常育红,姜哲,朱小燕.基于标记树表示方法的页面结构分析[J].计算机工程与应用,2004,40(16):129-132. 被引量:24
  • 4Shian-Hua Lin, Jan-Ming Ho. Discovering informative content blocks from Web documents. In: SIGKDD, 2002
  • 5Soumen Chakrabarti, Mukul M. Joshi and Vivek B. Tawde.Enhanced topic distillation using text, markup tags, and hyperlinks. In: SIGIR, 2001
  • 6S. Chakrabarti, M. Joshi, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In :WWW, Hawaii. ACM, 2002
  • 7Yiming Yang. Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, 1995
  • 8Li Xiaoli and Shi Zhongzhi. Innovating Web page classification through reducing noise. Journal of Computer Science & Technology, 2002 ,17(1): 9 ~ 17
  • 9http://162. 105.80.84/cgi-bin/getdirectory? ccode = 0
  • 10http://e. pku. edu. cn

共引文献60

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部