基于网页聚类的Web信息自动抽取被引量：1

Automatic Web information extraction based on page clustering

下载PDF

导出

摘要针对现今较流行的动态Web网页数量巨大、数据价值高,并且网页结构高度模板化的特点,设计了一个基于网页聚类的Web信息自动抽取系统。在DOM抽取技术基础上利用网页聚类寻找高相似簇,并引入列相似度和全局自相似度计算方法,提高了聚类结果的准确性。抽取模板中应用了可选节点对模板的修正和调整,以提高内容节点的正确标识。实验结果表明,该方法能够自动寻找并抽取网页主要信息,达到了较高的准确率和查全率。 Dynamic Web page has a large amount of pages, high-value data and high-modularity structure. According to these feature, this paper developed an automatic Web information extraction system based on page clustering. On the basis of DOM extraction technique, it used page clustering to find the high similarity clusters, and improved the accuracy of clustering results by using the column similarity measure and global auto-similarity measure. Extraction template applied the optional nodes to modify and adjust the template in order to improve the identification of the content nodes. Experimental result shows this method automati- cally locates and extracts the main information of pages and achieves high precision and recall.

作者邱韬奋杨天奇曾洪波

机构地区暨南大学信息科学技术学院计算机系

出处《微型机与应用》 2011年第4期71-74,共4页 Microcomputer & Its Applications

基金广东省科技计划项目(2009B070300052)

关键词 WEB信息抽取网页聚类包装器生成 Web information extraction page clustering wrapper generation

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献6

1CHANG H, KAYED M, GIRGIS R ,et al.A survey of web information extraction systems[J].IEEE Transactions on Knowledge and Data Engineering, 2006,18 (10) : 1411 - 1428.
2RAGGETT D.Clean up your web pages with HP's HTML tidy[J].Computer Networks and ISDN Systems, 1998(30): 730-732.
3LEVENSHTEIN V I.Binary codes capable of correcting deletions, insertions, and reversals[J].Soviet Physics Doklady, 1996(10) : 707-710.
4CRESCENZI V,MERIALDO P,MIDDIER P.Clustering web pages based on their structure[J].Data and Knowledge Engineering Journal, 2005,54(3) : 279-299.
5ALVAREZ M,PAN A,RAPOSO J ,et al.Extracting lists of data records from semi-structured web pages[J].Data Knowledge Engineering, 2008,24 (2): 491 - 509.
6CRESEENZI V,MEEEA G,MERIALDO P.RoadRu- nner: Towards automatic data extraction from large websites[C].In Proceedings of the 27th International Conferenee on Very Large DataBases,Rome,Italy,2001 : 109-118.

同被引文献5

1何昕,谢志鹏.基于简单树匹配算法的Web页面结构相似性度量[J].计算机研究与发展,2007,44(z3):1-6. 被引量：15
2Reis D C,Golgher P B, Silva A S, et al. Automatic Web news extraction using tree edit distance[-C~//Pro- ceedings of the 13th International Conference on World Wide Web. New York.. ACM.
3Gurmeet Singh Manku, Arvind Jain, Anish Das Sar- ma. Detecting near-duplicates for web crawlingEC~// Proceedings of the 16th International Conference on World Wide Web, Banff, Alberta, Canada, 2007: 141-150.
4李睿,曾俊瑀,周四望.基于局部标签树匹配的改进网页聚类算法[J].计算机应用,2010,30(3):818-820. 被引量：14
5宋明秋,张瑞雪.基于链路压缩树的网页相似度研究[J].情报学报,2012,31(1):40-46. 被引量：2

引证文献1

1余钧,郭岩,张凯,刘林,刘悦,俞晓明,程学旗.FPC:大规模网页的快速增量聚类[J].中文信息学报,2016,30(2):182-188. 被引量：3

二级引证文献3

1赵露.基于聚类分析的网络安全数据特征可视化融合研究[J].长春工程学院学报（自然科学版）,2020(2):94-97. 被引量：3
2刘春梅,郭岩,俞晓明,赵岭,刘悦,程学旗.针对开源论坛网页的信息抽取研究[J].计算机科学与探索,2017,11(1):114-123. 被引量：10
3吴小龙,曹存根.基于等价压缩快速聚类的Web表格知识抽取[J].中文信息学报,2019,33(4):75-84. 被引量：1

1藕军,任明仑.搜索引擎返回结果自动抽取[J].现代图书情报技术,2007(2):49-52.
2于薇.包装器的自动生成方法介绍[J].才智,2009,0(28):73-73.
3李广建,乔建忠.全自动生成网页信息抽取包装器的主要技术方法研究[J].情报理论与实践,2010,33(1):100-104. 被引量：4
4王小朋,李义杰.基于解释学习的包装器生成[J].计算机与数字工程,2006,34(5):151-154.
5李向阳,陆建江,张亚非.基于竞争分类的Web信息抽取[J].电子学报,2004,32(11):1915-1917. 被引量：2
6何一鸣.无监督的互联网事件抽取框架[J].计算机工程与设计,2011,32(3):910-913.
7富士胶片Dimatix公司星光打印头家庭喜添新成员[J].中国包装,2013,33(2):15-15.
8曹晔,刘波,李红民,罗建花,赵健,高宏佛,刘丽辉,张伟刚,开桂云,董孝义.轮辅式光纤光栅压力传感器的研制[J].南开大学学报（自然科学版）,2006,39(2):6-10. 被引量：2
9张晴,林家骏,高深.基于色差分析的图像修复改进算法[J].华东理工大学学报（自然科学版）,2011,37(3):367-371. 被引量：2
10朱晔,周伯伟,顾荣,江龙.关于轮辐式剪切力传感器的研制与设计[J].机械设计与制造工程,2002,31(4):87-89. 被引量：8

微型机与应用

2011年第4期

浏览历史

内容加载中请稍等...

基于网页聚类的Web信息自动抽取被引量：1

参考文献6

同被引文献5

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于网页聚类的Web信息自动抽取 被引量：1

参考文献6

同被引文献5

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于网页聚类的Web信息自动抽取被引量：1