期刊文献+

基于文本对象模型的自动化网页内容提取方法 被引量:3

Automated Web Page Content Extraction Method Based on Document Object Model
下载PDF
导出
摘要 网页内容提取在信息检索、文本分析以及网络资源数据处理等领域具有重要的工程与应用价值.针对网页中的大量无关内容及网页结构的异构性所造成的网页内容提取难题,提出一种基于文本对象模型(DOM)的自动化网页内容提取方法.首先,在节点过滤后,对网页的DOM模型进行压缩,便于后续分析处理;然后,提出基于文本-链接密度的内容提取方法来识别网页内容;最后,基于节点熵来识别并去除网页内容中的噪声链接.实验结果表明,相比于传统的网页内容提取方法,该方法的准确率和F1分数均有明显提升,而召回率仅有轻微下降. Web content extraction has great engineering and application value in the fields of information retrieval,text analysis and network resource data processing.In view of the problem of web content extraction caused by useless information on web pages and the heterogeneity of web page structures,this paper proposes an automated web page content extraction method based on Document Object Model(DOM).Firstly,for DOMs generated from original web pages,we remove useless nodes from them and then compress the models,which facilitates subsequent processing.Then,we identify the web page content based on text and hyperlink density.Finally,we identify the noise hyperlinks based on node entropy and remove them from the content.The experimental results show that compared with the traditional methods of web page content extraction,the accuracy and F1 score of our method are obviously improved while there is only a slight decline on recall.
出处 《上海交通大学学报》 EI CAS CSCD 北大核心 2018年第10期1363-1369,共7页 Journal of Shanghai Jiaotong University
基金 国家自然科学基金资助项目(61373030)
关键词 文本对象模型 网页内容提取 文本密度 节点熵 document object model(DOM) content extraction of web pages text density node entropy
  • 相关文献

参考文献1

二级参考文献17

  • 1黄文蓓,杨静,顾君忠.基于分块的网页正文信息提取算法研究[J].计算机应用,2007,27(B06):24-26. 被引量:32
  • 2Gupta S,Kaiser G E,Grimm P,et al.Automating content extraction of HTML documents[J].World Wide Web,2005,8(2):179-224.
  • 3Guo Yan,Tang Huifeng,Song Linhai,et al.ECON:an approach to extract content from Web news page[C]//Proc of the 12th International Asia-Pacific Web Conference.[S.l.]:IEEE Press,2010:314-320.
  • 4Mane T B,Potdar G P.Template extraction from heterogeneous Web pages[J].International Journal of Advanced Computer Research,2012,2(6):197-201.
  • 5Kadam V,Devale P R.A methodology for template extraction from heterogeneous Web pages[J].Indian Journal of Compute Science and Engineering,2012(3):449-452.
  • 6Ma Ling,Goharian N,Chowdhury A,et al.Extracting unstructed data from template generated Web documents[C]//Proc of the 12th International Conference on Information and Knowledge Management.New York:ACMPress,2003:512-515.
  • 7Reis D,Golgher P,Silva A,et al.Automatic Web news extraction using tree edit distance[C]//Proc of the 13th International Conference on World Wide Web.New York:ACM Press,2004:502-511.
  • 8Vieira K,SilvaI A,Pinto N,et al.A fast and robust method for Web page template detection and removal[C]//Proc of the 15th ACM International Conference on Information and Knowledge Management.New York:ACM Press,2006:258-267.
  • 9Cai Deng,Yu Shipeng,Wen Jirong,et al.VIPS:a vision-based page segmentation algorithm,MSR-TR-3003-79[R].[S.l.]:Microsoft Research,2003.
  • 10Cai Deng,Yu Shipeng,Wen Jirong,et al.Extracting content structure for Web pages based on visual representation[J].Web Technologies and Applications,2003,2642:406-417.

共引文献9

同被引文献25

引证文献3

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部