期刊文献+

基于布局特征与语言特征的网页主要内容块发现 被引量:8

A WebPage Content Block Detection Method Based on Layout Features and Language Features
下载PDF
导出
摘要 本文综合分析了网页内容块各方面的特征,提出了一个联合使用布局特征和语言特征的网页主要内容块发现方法,有效地解决了以往模型中通用性与高准确率不能共存的缺点。该方法使用网页视觉块树表示网页,对网页内容块的布局特征和语言特征分别建立了独立的分类器,然后组合这两个分类器来进行网页内容块分类。实验结果表明,在保持非噪音块召回率在90%以上的同时,组合分类器的准确率达到85%,比只使用布局特征的分类器提高5个百分点,比只使用语言特征的分类器提高15个百分点;在5个站点上的分类结果表明组合分类器在不同站点上性能稳定,具有良好的通用性。 This paper analyzed the different feature types of webpage blocks, and presented a webpage content block detection rnethod based on layout features and language features, which effectively resolved the seesaw problern be-tween detection accuracy and model generality across different types of webpages. The method used the vision-block tree to represent webpage, built two individual classifiers respectively for webpage's layout features and language features, and used different strategies to combine these two classifiers. The experimental results show that, with holding the content block detection recall higher than 90%, the combined classifiers' accuracy can reach 85 per-cents, 5 percents higher than the classifier using only the layout features, and 15 percents higher than the classifier using only the language features; and the experimental results also show that the combined classifiers obtained good detection performance over five selected websites which means that it have good generality.
出处 《中文信息学报》 CSCD 北大核心 2008年第1期15-21,共7页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(60673042) 北京市自然科学基金资助项目(4052027 4073043)
关键词 计算机应用 中文信息处理 网页清理 主要内容块发现 网页切分 布局特征 语言特征 computer application Chinese information processing webpage cleaning content block detection webpage segment layout feature language feature
  • 相关文献

参考文献11

  • 1Rupesh R.Mehta,Harish Karnick,and Pabitra Mitra.Semantic Structure Analysis of Web Documents.Digital Document Processing[M],Springer 2007.
  • 2Deng Cai,Shipeng Yu,Ji-Rong Wen and WeiYing Ma.VIPS:A Vision based Page Segmentation Algorithm[R].MSR-TR-2003-79.2003.
  • 3Lan Yi,Bing Liu,Xiaoli Li.Eliminating Noisy Information in Web Pages for Data Mining[A].The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining[C].2003.
  • 4Ziv Bar-Yossef,Sridhar Rajagopalan.Template Detection via Data Mining and its Applications[A].The eleventh international world wide web conference[C].2002.
  • 5Suhit Gupta,Gail Kaiser,David Neistadt,Peter Grimm.DOM-based Content Extraction of HTML Documents[A].The Twelfth International World Wide Web Conference[C].2003.
  • 6Deepayan CHakrabarti,Ravi Kumar,Kunal Punera.Page-level Template Detection via Isotonic Smoothing[A].The 16th International World Wide Web Conference[C].2007.
  • 7Sandip Debnath,Prasenjit Mitra,C.Lee Giles.Automatic Extraction of Informative Blocks from Webpages[A].2005 ACM Symposium on Applied Computing[C].2005.
  • 8Ruihua Song,Haifeng Liu,Ji-Rong Wen,Wei-Ying Ma.Learning Block Importance Models for Web Pages[A].13th International WWW Conference[C].2005.
  • 9Shian-Hua Lin,Jan-Ming Ho.Discovering Informative Content Blocks from Web Document[A].The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining[C].2002.
  • 10Rupesh R.Mehta,Pabitra Mitra,Harish Karnick.Extracting Semantic Structure of Web Documents Using Content and Visual Information[A].13th International WWW Conference[C].2005.

同被引文献78

引证文献8

二级引证文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部