期刊文献+

基于文本块密度与标签路径等特征的正文提取 被引量:1

Text Extraction Based on Text Block Density with Tag Path and Other Features
下载PDF
导出
摘要 为了解决网页中除正文信息外还包含网页导航、广告和免责声明等噪声信息的问题,本文提出一种基于标签路径等多特征和文本块密度的正文提取方法.首先根据文本块密度特征确定正文区域,然后在区域内使用标签路径等特征剔去噪音节点,最后抽取该文本块中的正文节点内容.该方法有效解决了网页正文块中噪声信息难以过滤和标签路径等特征易对正文部分外较长文本误抽取的问题,且无须训练和人工处理.从知名网站上随机选取新闻网页数据集进行实验,验证了该方法在不同数据源上都具有很好的适用性,抽取精确度优于CETR、CETD等方法. Most of web pages contain content information as well as a lot of noisy information.In order to address this problem and improve the accuracy of web page extraction,a web page extraction method is proposed via text block density with tap path and other features.The proposed method mostly combines the advantages of text block extraction method and label path extraction method.First,the block of the text is determined according to the density feature of the text block,and then the tag path method is used to remove the noisy node in the block,the text node in the text block is extracted from the content finally.This solution effectively solves the problem that the noisy information in the text block is difficult to filter and the tag path method is easy to extract the long text from the noisy block.In the end,experiments show that the solution is better than CETR and CETD in most cases.
作者 杨贤 唐超兰 李航 Yang Xian;Tang Chao-lan;Li Hang(School of Art and Design, Guangdong University of Technology, Guangzhou, 510090, China;School of computers, Guangdong University of Technology, Guangzhou 510006, China)
出处 《广东工业大学学报》 CAS 2018年第2期51-56,共6页 Journal of Guangdong University of Technology
基金 广东省部产学研专项资金企业创新平台资助项目(2013B090800042)
关键词 正文抽取 文本块 标签路径 文本密度 content extraction text block tag path text density
  • 相关文献

参考文献4

二级参考文献61

  • 1李素建,王厚峰,俞士汶,辛乘胜.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004,27(9):1192-1197. 被引量:93
  • 2林亚平,刘云中,周顺先,陈治平,蔡立军.基于最大熵的隐马尔可夫模型文本信息抽取[J].电子学报,2005,33(2):236-240. 被引量:48
  • 3张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,32(19):76-78. 被引量:121
  • 4谷小青,易当祥,刘春和.遗传算法优化神经网络的拓扑结构与权值[J].广东工业大学学报,2006,23(4):64-69. 被引量:13
  • 5刘兵.Web数据挖掘[M].北京:清华大学出版社,2009.
  • 6WANG J, LOCHOVSKY F H. Data-rich section extraction from HT- ML pages [ C]// Proceedings of the Third International Conference on Web Information Systems Engineering. Washington, DC: IEEE Computer Society, 2002:313 - 322.
  • 7CHANG C H, HSU C N, LUI S C. Automatic information extraction from semi-structured Web pages by pattern discovery [ J]. Decision Support Systems, 2003, 35(1) : 129 - 147.
  • 8EMBLEY D W, CAMPBELL D M, SMITH R D, et al. Ontology- based extraction and structuring of information from data-rich un- structured documents [ C]//Proceedings of the Seventh Intemational Conference on Information and Knowledge Management. New York: ACM Press, 1998:52-59.
  • 9ARASU P, GERCIA-MOLINA P. Extracting structured data from Web pages [ C]//Proceedings of the 2003 ACM SIGMOD Interna- tional Conference on Management of Data. New York: ACM Press, 2003:337 - 348.
  • 10ZHAO HK, MENG W Y, WU Z H, et al. Fully automatic wrapper generation for search engines [ C]//Proceedings of the 14th Inter- national Conferenee on World Wide Web. New York: ACM Press, 2005:66 - 75.

共引文献29

同被引文献10

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部