期刊文献+

一种基于标点密度的网页正文提取方法 被引量:2

A Method of Webpage Content Extraction based on Point Density
下载PDF
导出
摘要 本文提出了一种基于DOM树的正文提取方法。该方法是在基于DOM树的文本密度的正文提取算法的框架上改进而来的。基于对文言文翻译网站的观察,本方法使用标点符号密度取代原方法的文本密度。通过随机选取50篇文言文翻译网页作为测试集,本文提出的方法获得了更好的准确率、召回率和F值。 This paper proposes a DOM based content extraction method. It is improved from the DOM based content extraction via text density. Based on the observation of classical Chinese translation websites,the paper uses point density to replace text density. 50 classical Chinese translaiton webpages are randomly chosen as the test data set,the proposed method obtains better precision,recall,and F- measure.
作者 杨钦 杨沐昀
出处 《智能计算机与应用》 2015年第4期42-44,47,共4页 Intelligent Computer and Applications
关键词 DOM 标点密度 文本密度 正文提取 DOM Point Density Text Density Content Extraction
  • 相关文献

参考文献15

  • 1PUNERA K, GIBSON D, TOMKINS A. The volume and evolution ofWeb Page Templates [ C]// Special interest tracks and posters of the14th international conference on World Wide Web, Chiba: ACM,2005:830 -839.
  • 2RAHMANA F R, ALAM H,HARTONO R. Content extraction fromhtml documents [ C ] //18t Int. Workshop on Web Document Analysis(WDA2001 ) , Seattle; [s. n. ],2001 : 1 -4.
  • 3FINN A, KUSHMER1CK N, SMYTH B. Fact or fiction: Contentclassification for digital libraries[C]// DELOS Workshops, Citeseer:Dublin, 2001:1 -6.
  • 4PINTOD, BRANSTEIN M, COLEMAN R, et al. QuASM: A systemfor question answering using semi - structured data[ C ] //Proceedingsof the 2nd ACM/IEEE - CS joint conference on Digital libraries, NewYork:ACM, 2002: 46-55.
  • 5DEBNATHS,MITRA P,GILES C L. Automatic extraction of in-formative blocks from webpages [ C ]//Proceedings of the Acm Sac,Santa Fe : ACM,2005 : 1722 - 1726.
  • 6GUPTA S,KAISER G,STOLFO S. Extracting context to improve ac-curacy for HTML content extraction [ C ]//Special interest tracks andposters of the 14th international conference on World Wide Web, Chi-ba; ACM, 2005: 1114-1115.
  • 7G0TTR0N T. Combining content extraction heuristics: the combinEsystem[ C ]//Proceedings of the 10th International Conference on In-formation Integration and Web - based Applications & Services,Linz:ACM, 2008: 591 -595.
  • 8MANTRATZIS C,ORGUN M,CASSIDY S. Separating XHTML con-tent from navigation clutter using DOM — structure block analysis[C ] // Hypertext ’ 05 Proceedings of the Sixteenth Acm Conferenceon Hypertext & Hypermedia, New York: ACM, 2005 : 145 -147.
  • 9GOTTRONT. Content code blurring: A new approach to content ex-traction[ C]// Proceedings of the 2008 19th International Conferenceon Database and Expert Systems Application,[ S. 1. ] : IEEE Comput-er Society, 2008:29 -33.
  • 10WENINGERT, HSU W H, HAN J. CETR: content extraction viatag ratios[ C]// Proceedings of the 19th international conference onWorld wide web, Raleigh:ACM, 2010:971 -980.

同被引文献12

引证文献2

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部