期刊文献+

一种基于位置信息的Web页面分割方法 被引量:3

A POSITION INFORMATION-BASED WEB PAGE SEGMENTATION METHOD
下载PDF
导出
摘要 提出并实现了一种针对HTML文档的页面分割方法,其目的是为了能有效提取新闻网页的正文以进行数据挖掘。基本思想是通过模拟网页浏览器的部分渲染工作,来还原HTML文档中每个标签在浏览器窗口上的显示位置,并以此对页面分割,用于提取一些重要区域的信息。在实验中,对10多个知名新闻站点如新浪、网易、TOM新闻等,利用这一方法提取其网页中的新闻正文,准确率在88.5%左右,表明了这一方法的有效性和可行性。 In this paper a position-based page segmentation method against HTML documents is presented and implemented, which intends to effectively extract the content of news sites for data mining. The basic idea is to restore the display position of each tag of the HTML document in browser window by simulating part of the rendering process that web browser does, and then to segment the page by this for extracting some information in important areas. This method has been used on ten more noted news websites such as Sina, NetEase and Tom news, etc. , in the experiments. The extracted news contents in their webpage with this method have the accurate rate up to 88.5% ,and this proves the effectiveness and feasibility of this method.
出处 《计算机应用与软件》 CSCD 2009年第7期155-159,共5页 Computer Applications and Software
关键词 网页分割 HTML文档 网页浏览器 信息抽取 Page segmentation HTML document Web browser Information extraction
  • 相关文献

参考文献12

  • 1Vadrevu S,Gelgi F.Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge.World Wide Web,2007,10:157.
  • 2Arasu A,Garcia-Molina H.Extracting Structured Data from Web Pages.International Conference on Management of Data,Proceedings of the 2003 ACM SIGMOD international conference on Management of data,2003.
  • 3Deng Cai,Shipeng Yu,Ji-Rong Wen,et al.VIPS:a Vision-based Page Segmentation Algorithm.http://research.microsoft.com/~jrwen/jrwen_files/publications/VIPS_Technical%20Report.PDF 2003.
  • 4Kovacevic M,Diligenti M,Gori M,et al.Recognition of Common Areas in a Web Page Using Visual Information:a possible application in a page classification.Second IEEE International Conference on Data Mining (ICDM'02),2002:250.
  • 5Peifeng Xiang,Xin Yang,Yuanchun Shi.Effective Page Segmentation Combining Pattern Analysis and Visual Separators for Browsing on Small Screens.Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI'06),2006:831.
  • 6Jinlin Chen,Baoyao Zhou,Jin Shi,et al.Function-Based Object Model Towards Website Adaptation.Proceedings of the 10th international conference on World Wide Web,2001:587.
  • 7于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. 被引量:55
  • 8朱精南,赵明生.网页版面中区域几何信息的确定[J].计算机工程,2004,30(10):45-48. 被引量:4
  • 9Layout Engine Technical Documentation.http://www.mozilla.org/newlayout/doc/.
  • 10高波.嵌入式浏览器开发.http://jserv.sayya.org/netbit/.

二级参考文献15

  • 1[1]HTM L4.0 Spccification. W3C Recommendation, 1998-04-24
  • 2[2]Document Object Model(DOM) Level 2 HTML Specification(Version 1.0).W3C Working Draft,2000-11-13
  • 3EMBLEY DW,JIANG YS,NG YK.Record-Boundary Discovery in Web Documents[A].SIGMOD'99 Proceedings[C].1999.
  • 4EMBLEY DW,LI X.Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents[A].WebDB'00 Proceedings[C].2000.
  • 5LIM SJ,NG YK.Extracting Structures of HTML Documents Using a High-Level Stack Machine[M].Information Networking in Asia,Gordon and Breach Science Publishers,Newark,New Jersey,2001.
  • 6LIM SJ,NG YK,YANG XC.Integrating HTML Tables Using Semantic Hierarchies And Meta-Data Sets[A].International Database Engineering and Applications Symposium(IDEAS'02)[C].Edmonton,Canada,2002.
  • 7LIM SJ,NG YK.A Heuristic Approach for Converting HTML Documents to XML Documents[A].Proceedings of the Sixth International Conference on Rules and Objects in Databases(DOOD 2000)[C].London,England,2000.1182-1196.
  • 8LIN SH,HO JM.Discovering Informative Content Blocks from Web Documents[A].KDD 2002[C].2002.588-593.
  • 9YU SP,CAI D,WEN JR,et al.Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation[EB/OL].http://research.microsoft.com/research/pubs/view.aspx?type=Technical%20Report&id=632,2002-12.
  • 10WEN JR,SONG RH,CAI D,et al.Microsoft Research Asia at The Web Track of TREC 2003[A].The Twelfth Text Retrieval Conference(TREC'12)[C].2003.

共引文献55

同被引文献27

  • 1于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. 被引量:55
  • 2王芳,于浩,谭红叶,赵铁军.基于链接分块的相关链接提取方法[J].计算机工程与应用,2006,42(31):110-113. 被引量:2
  • 3胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6):1-9. 被引量:16
  • 4刘迁,焦慧,贾惠波.信息抽取技术的发展现状及构建方法的研究[J].计算机应用研究,2007,24(7):6-9. 被引量:41
  • 5Baumgartner R, Flesca S, Gottlob G with Lixto [ C ]//Proc. of the Intl. (VLDB'01) ,2001:119 - 128.
  • 6Visual web information extraction Conf. on Very Large Data Bases Zhai Y, Liu B. Extracting Web Data Using Instance-Based Learning [ C ]//Proc. of the 6th Intl. Cone on Web Information Systems Engi- neering( WISE' 05 ) ,2005:318 - 331.
  • 7Gupta S, Kaiser G, Neistadt D, et al. DOM-based Content Extraction of HTML Documents [ C]//proceedings 12th International World Wide Web Conference ,2003.
  • 8Cai D,Yu S,Wen J R,et al. VIPS:A vision-based page segmentation al- gorithm[ R 1. Microsoft Technical Report, MSR-TR-2003-79. 2003 : 10.
  • 9Cai D, Yu S, Wen J R, et al. VIPS: Improving Pseudo- Relevance Feedback in Web Information Retrieval Using Web Page Segmentation [ C ]//Proceeding of The 12th International Conference on World Wide Web,2003.
  • 10Abel O, Li Longzhuang, Liu Yonghuai. Visual Segmen- tation-Based Data Record Extraction from Web Documents [ C ]//Proceedings of IEEE International Conference on Information Reuse and Integration, 2007: 502-507.

引证文献3

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部