期刊文献+

基于布局相似性的网页正文内容提取研究 被引量:10

Study of Web pages content extraction based on layout similarity
下载PDF
导出
摘要 合理的网页正文提取技术可以将海量互联网数据中冗余的、重复的、无用的信息去除,获取更加有实际意义和价值的数据。经过对网页的观察,发现同一网站下的网页具有在内容布局和样式结构上非常相似的特点,提出并实现了一种基于布局相似性的网页正文提取方法,即通过比对来自同一网站同一专题的网页DOM树中节点数据信息的相似性来实现正文提取,并对相关问题进行了尝试性的研究和实现。实验证明该方法思路简单、实用性强、普适性好,在满足较高准确率的同时,能为众多互联网内容分析应用提供支撑。 Appropriate Web content extraction technique can remove the data which is redundant, repetitive and useless from massive Web pages while extracting more meaningful and more useful data. Through the observation of Web pages, this paper proposed and implemented a Web content extraction method based on the layout similarity that the pages under the same Web site showed similar in content layout and style structure. It ,achieves the purpose of main content extraction by comparing the similarity of the DOM node structure data from the Web pages belong to the same topic of the same sites. It also did some tenta- tive research and implementation on some other content relevent to this content extraction method. Experiments prove that this method is simple, pratical and universal, and it can not only meet the requirement of both high accuracy but also provide sup- port for more Internet applications of content analysis.
出处 《计算机应用研究》 CSCD 北大核心 2015年第9期2581-2586,共6页 Application Research of Computers
基金 国家自然科学基金面上项目(61375039) 国家自然科学基金青年资助项目(61005029) 中国科学院计算机网络信息中心"一三五"规划重点培育方向专项基金资助项目(CNIC_PY_1402)
关键词 布局相似性 网页正文提取 信息检索 layout similarity Web page content extract information retrieval
  • 相关文献

参考文献17

  • 1Gupta S,Kaiser G E,Grimm P,et al.Automating content extraction of HTML documents[J].World Wide Web,2005,8(2):179-224.
  • 2Guo Yan,Tang Huifeng,Song Linhai,et al.ECON:an approach to extract content from Web news page[C]//Proc of the 12th International Asia-Pacific Web Conference.[S.l.]:IEEE Press,2010:314-320.
  • 3Mane T B,Potdar G P.Template extraction from heterogeneous Web pages[J].International Journal of Advanced Computer Research,2012,2(6):197-201.
  • 4Kadam V,Devale P R.A methodology for template extraction from heterogeneous Web pages[J].Indian Journal of Compute Science and Engineering,2012(3):449-452.
  • 5Ma Ling,Goharian N,Chowdhury A,et al.Extracting unstructed data from template generated Web documents[C]//Proc of the 12th International Conference on Information and Knowledge Management.New York:ACMPress,2003:512-515.
  • 6Reis D,Golgher P,Silva A,et al.Automatic Web news extraction using tree edit distance[C]//Proc of the 13th International Conference on World Wide Web.New York:ACM Press,2004:502-511.
  • 7Vieira K,SilvaI A,Pinto N,et al.A fast and robust method for Web page template detection and removal[C]//Proc of the 15th ACM International Conference on Information and Knowledge Management.New York:ACM Press,2006:258-267.
  • 8Cai Deng,Yu Shipeng,Wen Jirong,et al.VIPS:a vision-based page segmentation algorithm,MSR-TR-3003-79[R].[S.l.]:Microsoft Research,2003.
  • 9Cai Deng,Yu Shipeng,Wen Jirong,et al.Extracting content structure for Web pages based on visual representation[J].Web Technologies and Applications,2003,2642:406-417.
  • 10Mehta R,Mitra P,Karnick H.Extracting semantic structure of Web document using content and visual information[C]//Proc of the 14thSpecial Interest Tracks and Posters of International Conference on World Wide Web.New York:ACM Press,2005:928-929.

二级参考文献6

  • 1张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57
  • 2于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. 被引量:55
  • 3LIN S-H,HO J-M.Discovering informative content blocks from Web documents[A].the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'02)[C].July,2002.
  • 4DENG C,YU SP,WEN JR,et al.VIPS:A Vision-Based Page Segmentation,MSR-TR-2003-79[R].2003.
  • 5KOVACEVIC M.Recognition of common areas in web page using visual information:A possible application in a page classification[A].Proceedings of ICDM02[C].Maebashi,Japan:IEEE Press,2002.250-258.
  • 6HANZLIK S.Gorilla Design Studios Presents:The Hosts File[EB/OL].http://aocs-net.com/hosts/,2006.

共引文献32

同被引文献77

引证文献10

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部