期刊文献+

基于网页聚类的Web信息自动抽取 被引量:1

Automatic Web information extraction based on page clustering
下载PDF
导出
摘要 针对现今较流行的动态Web网页数量巨大、数据价值高,并且网页结构高度模板化的特点,设计了一个基于网页聚类的Web信息自动抽取系统。在DOM抽取技术基础上利用网页聚类寻找高相似簇,并引入列相似度和全局自相似度计算方法,提高了聚类结果的准确性。抽取模板中应用了可选节点对模板的修正和调整,以提高内容节点的正确标识。实验结果表明,该方法能够自动寻找并抽取网页主要信息,达到了较高的准确率和查全率。 Dynamic Web page has a large amount of pages, high-value data and high-modularity structure. According to these feature, this paper developed an automatic Web information extraction system based on page clustering. On the basis of DOM extraction technique, it used page clustering to find the high similarity clusters, and improved the accuracy of clustering results by using the column similarity measure and global auto-similarity measure. Extraction template applied the optional nodes to modify and adjust the template in order to improve the identification of the content nodes. Experimental result shows this method automati- cally locates and extracts the main information of pages and achieves high precision and recall.
出处 《微型机与应用》 2011年第4期71-74,共4页 Microcomputer & Its Applications
基金 广东省科技计划项目(2009B070300052)
关键词 WEB信息抽取 网页聚类 包装器生成 Web information extraction page clustering wrapper generation
  • 相关文献

参考文献6

  • 1CHANG H, KAYED M, GIRGIS R ,et al.A survey of web information extraction systems[J].IEEE Transactions on Knowledge and Data Engineering, 2006,18 (10) : 1411 - 1428.
  • 2RAGGETT D.Clean up your web pages with HP's HTML tidy[J].Computer Networks and ISDN Systems, 1998(30): 730-732.
  • 3LEVENSHTEIN V I.Binary codes capable of correcting deletions, insertions, and reversals[J].Soviet Physics Doklady, 1996(10) : 707-710.
  • 4CRESCENZI V,MERIALDO P,MIDDIER P.Clustering web pages based on their structure[J].Data and Knowledge Engineering Journal, 2005,54(3) : 279-299.
  • 5ALVAREZ M,PAN A,RAPOSO J ,et al.Extracting lists of data records from semi-structured web pages[J].Data Knowledge Engineering, 2008,24 (2): 491 - 509.
  • 6CRESEENZI V,MEEEA G,MERIALDO P.RoadRu- nner: Towards automatic data extraction from large websites[C].In Proceedings of the 27th International Conferenee on Very Large DataBases,Rome,Italy,2001 : 109-118.

同被引文献5

引证文献1

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部