期刊文献+

基于网页聚类的正文信息提取方法 被引量:6

Web Information Extraction Based on Webpage Clustering
下载PDF
导出
摘要 精准地抽取Web页面中正文内容,在许多Web挖掘研究领域有着重要的应用价值.目前针对该问题主要采用网页分割和密度统计的方法.但现有的方法在网页中正文内容字符数较少时可能失去作用.经实例分析发现,网站内部的网页大多都是由一套相同内容模板生成的.因此本文提出一种基于网页聚类的正文信息提取的方法,该方法主要有2个部分组成:第一,基于网页的结构特征对网页进行聚类;第二,面向相似网页集合的正文位置特征生成.采用该方法可以从多种类型的网页中抽取正文信息.我们针对5个网站进行了实验,实验结果表明该方法的可行性和有效性. Accurately extracting important content from webpage has important applications for many research fields in Web mining. Atpresent,the method of webpage segmentation and density statistics is used to solve this problem. However, the existing method maylose its function when the number of characters in the webpage is small. In this paper,we propose a method for extracting web infor-mation,based on the webpage clustering. This method consists of two components:webpage clustering based on structure feature andtext block features generation with similar webpages. The method can extract web information from different types of webpages. Weconduct the experiment with webpages from 5 sites, and the experimental results show that the proposed methods are feasibility and ef-fective.
出处 《小型微型计算机系统》 CSCD 北大核心 2018年第1期111-115,共5页 Journal of Chinese Computer Systems
基金 国家自然基金项目(61402111)资助 福建省科技平台建设项目(2014m005)资助.
关键词 网页聚类 正文内容块 节点密度 webpage clustering text block node density
  • 相关文献

参考文献2

二级参考文献61

  • 1http ://svmlight. joachims, org/.
  • 2Yin Xinyi, Lee Wee Sun. Using link analysis to improve layout on mobile deviees//Proeeedings of the 13th International Conference on World Wide Web (WWW 2004). New York, USA, 2004:338-344.
  • 3Cben Yu, Ma Wei-Ying, Zhang Hong-Jiang. Detecting Web page structure for adaptive viewing on small form factor devices//Proceedings of the 12th International Conference on World Wide Web (WWW 2003). Budapest, Hungary, 2003:225-233.
  • 4Baluja S. Browsing on small screens: Recasting Web-page segmentation into an effcient machine learning framework// Proceedings of the 15th International Conference on World Wide Web (WWW 2006). Edinburgh, Scotland, 2006: 33-42.
  • 5Sun Fei, Song Dandan, Liao Leiian. DOM based content extraction via text density//Proceedings of the 34th Annual International ACM SIGIR Conference (SIGIR 2011). Beijing, China, 2011:245-254.
  • 6Cai Deng, Yu Shipeng, Wen J i-Rong, Ma Wei-Ying. Extracting content structure for Web pages based on visual representation //Proceedings of the 5th Asian-Pacific Web Conference (APWeb 2003). Xi'an, China, 2003:406-417.
  • 7Yi Lan, Liu Bing, Li Xiaoli. Eliminating noisy information in Web pages for data mining//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003). Washington, USA, 2003: 296- 305.
  • 8Ramaswamy L, Iyengar A, Liu Ling, Douglis F. Automatic fragment detection in dynamic Web pages and its impact on caching. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2005, 17(6): 859-874.
  • 9Debnath S, Mitra P, Pal N, Giles C L. Automatic identifica- tion of informative sections of Web pages. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2005, 17(9) : 1233-1246.
  • 10Kolcz A, Yih Wen-tau. Site-independent template-block detection//Proceedings of the 11th European Confereneeon on Principles and Practice of Knowledge Discovery in Databases (PKDD 2007). Warsaw, Poland, 2007:152-163.

共引文献32

同被引文献63

引证文献6

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部