期刊文献+

结合网页结构与文本特征的正文提取方法 被引量:15

Content Extraction Method Combining Web Page Structure and Text Feature
下载PDF
导出
摘要 网页中存在正文信息以及与正文无关的信息,无关信息的存在对Web页面的分类、存储及检索等带来负面的影响。为降低无关信息的影响,从网页的结构特征和文本特征出发,提出一种结合网页结构特征与文本特征的正文提取方法。通过正则表达式去除网页中的无关元素,完成对网页的初次过滤。根据网页的结构特征对网页进行线性分块,依据各个块的文本特征将其区分为链接块与文本块,并利用噪音块连续出现的结果完成对正文部分的定位,得到网页正文信息。实验结果表明,该方法能够快速准确地提取网页的正文内容。 There are both relevant information and irrelevant information in a Web page, the irrelevant information brings some negative influence to their classification, storage and retrieve. In order to reduce the influence, aiming at theme-related Web pages, this paper proposes a new method to extract the content of Web pages based on their text and structural features. It removes those unrelated tags in the Web page by regular expressions, and segments the Web into blocks according to Web pages structure and the text information. By analyzing the text blocks and link blocks of the Web, it only retains the main content of the page; those noisy parts are deleted from the page. Experimental result shows that the method is feasible and of high accuracy in page cleaning and content extraction.
出处 《计算机工程》 CAS CSCD 2013年第12期200-203,210,共5页 Computer Engineering
基金 国家自然科学基金资助项目(71102065)
关键词 正文提取 网页去噪 网页分块 主题爬行 信息检索 WEB挖掘 content extraction Web page denoising Web page segmentation subject crawling information retrieve Web mining
  • 相关文献

参考文献11

  • 1Gibson D,Punera K,Tomkins A.The Volume and Evolution of Web Page Templates[C]//Proc.of the 14th International Conference on World Wide Web.New York,USA:ACM Press,2005.
  • 2Rahman A,Alam H,Hartono R.Content Extraction from HTML Documents[C]//Proc.of the 1st International Workshop on Web Document Analysis.New York,USA:ACM Press,2001.
  • 3Wang Jiying,Lochovsky F H.Data-rich Section Extraction from HTML Pages[C]//Proc.of the 3rd International Conference on Web Information Systems Engineering.Washington D.C.,USA:IEEE Computer Society,2002.
  • 4欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2005,45(S1):1743-1747. 被引量:70
  • 5Sun Fei,Song Dandan,Liao Lejian.Dom Based Content Extraction via Text Density[C]//Proc.of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,USA:ACM Press,2011.
  • 6Weninger T,Hsu W H,Han J.CETR:Content Extraction via Tag Ratios[C]//Proc.of the 19th International Conference on World Wide Web.New York,USA:ACM Press,2010.
  • 7Abdul P,Qureshi R,Memon N.Hybrid Model of Content Extraction[J].Journal of Computer and System Sciences,2012,78(4):1248-1257.
  • 8Cai Deng,Yu Shipeng,Wen Jirong,et al.VIPS:A Vision Based Page Segmentation Algorithm[EB/OL].(2003-10-20).http://research.microsoft.com/apps/pubs/default.aspx?id=70027.
  • 9Song Mingqiu,WU Xintao.Content Extraction from Web Pages Based on Chinese Punctuation Number[C]//Proc.of International Conference on Wireless Communications,Networking and Mobile Computing.[S.1.]:IEEE Press,2007.
  • 10张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57

二级参考文献23

  • 1荆涛,左万利.基于可视布局信息的网页噪音去除算法[J].华南理工大学学报(自然科学版),2004,32(z1):84-87. 被引量:21
  • 2张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57
  • 3常育红,姜哲,朱小燕.基于标记树表示方法的页面结构分析[J].计算机工程与应用,2004,40(16):129-132. 被引量:24
  • 4王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 5Shian-Hua Lin, Jan-Ming Ho. Discovering informative content blocks from Web documents. In: SIGKDD, 2002
  • 6Soumen Chakrabarti, Mukul M. Joshi and Vivek B. Tawde.Enhanced topic distillation using text, markup tags, and hyperlinks. In: SIGIR, 2001
  • 7S. Chakrabarti, M. Joshi, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In :WWW, Hawaii. ACM, 2002
  • 8Yiming Yang. Noise reduction in a statistical approach to text categorization. In: Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, 1995
  • 9Li Xiaoli and Shi Zhongzhi. Innovating Web page classification through reducing noise. Journal of Computer Science & Technology, 2002 ,17(1): 9 ~ 17
  • 10http://162. 105.80.84/cgi-bin/getdirectory? ccode = 0

共引文献110

同被引文献97

引证文献15

二级引证文献25

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部