基于DOM树与领域本体的Web抽取方法被引量：6

Web Extraction Method Based on DOM Tree and Domain Ontology

下载PDF

导出

摘要为解决异构DeepWeb结果页面中数据区域及数据记录的自动抽取问题,提出一种基于DOM树与领域本体的Web抽取方法。利用数据内容特征以及领域本体库标记DOM树的节点,按照结果页面展示规律定位数据区域,根据改进的简单树匹配算法,定位数据区域及数据记录。实验结果表明,该方法定位数据区域及数据记录的F-measure值比传统的抽取方法高2.93%~6.67%。 To solve the problem of automatic extraction from different DeepWeb result page structures,this paper proposes a method which combines the Web structure and the content of Web pages.This method uses the characteristics of data content and the DOM tree nodes which are marked by the domain ontology library positioning data area.An improved simple tree matching algorithm is used to identify data records.Experimental results show that the F-measure value of this method is 2.93%~6.67% higher than that of traditional methods.

作者郭建兵崔志明陈明赵朋朋

机构地区苏州大学智能信息处理及应用研究所苏州普达新信息技术有限公司

出处《计算机工程》 CAS CSCD 2012年第5期56-58,共3页 Computer Engineering

基金国家自然科学基金资助项目(60970015 61003054) 江苏省企业博士创新基金资助项目(BK2009563) 江苏省高校自然科学研究基金资助项目(10KJB520018) 苏州市科技型企业技术创新专项基金资助项目(SG201043)

关键词自动抽取 DOM树领域本体数据区域定位简单树匹配 automatic extraction DOM tree domain ontology data area positioning simple tree matching

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献6

1Bergman M K.The Deep Web:Surfacing Hidden Value[J].The Journal of Electronic Publishing,2001,7(1):8912-8914.
2杨舟,卓林,赵朋朋,崔志明.一种针对商品数据记录的自动抽取方法[J].计算机工程,2010,36(23):262-265. 被引量：8
3Bille P.A Survey on Tree Edit Distance and Related Problems[J].Theoretical Computer Science,2005,337(1-3):217-239.
4Zhai Yanhong,Liu Bing.Web Data Extraction Based on Partial Tree Alignment[C] //Proc.of the 14th International Conference on World Wide Web.New York,USA:ACM Press,2005:76-85.
5Liu Bing.Web Data Mining[M].Berlin,Germany:Springer,2009.
6刘丹,谢庆生,顾新建.电子商务环境下产品本体构建技术研究[J].计算机应用,2007,27(3):752-755. 被引量：11

二级参考文献18

1宋强,徐鹏,李涓子.半结构化文档中非标记化表格的抽取[J].计算机工程,2005,31(18):81-83. 被引量：3
2凌玲,胡于进,王学林,李成刚.协同设计环境下基于语义的本体建立方法[J].中国机械工程,2005,16(19):1757-1761. 被引量：4
3Liu Bing. Mining Data Records in Web Pages[C]//Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining. Washington D. C. , USA: [s. n. ], 2003:601-606.
4Miao Gengxin, Tatemura J, Hsiung Wang+Pin, et al. Extracting Data Records from the Web Using Tag Path Clustering[C] //Proceedings of the 18th International Conference on the World Wide Web. Madrid: Spain, [s. n. ], 2009: 981-990.
5Zhai Yanhong, Liu Bing. Web Data Extraction Based on Partial Tree Alignment [C]//Proceedings of the 14th International Conference on the World Wide Web. Chiba, Japan.. [s. n. ], 2005 : 76-85.
6Wang Jingyi, Lochovsk F H. Data Extraction and Label Assignment for Web Databases[C]//Proceedings of the 12th International Conference on the World Wide Web. Budapest, Hungary: [s. n. ],2003.. 187-196.
7Liu Bing, Zhai Yanhong. NET: System for Extracting Web Data from Flat and Nested Data Records[C]//Proceedings of the Conference on Web Information Systems Engineering: New York, USA: [s. n.], 2005: 487-495.
8Liu Wei, Meng Xiaofeng, Meng Weiyi. Vision-based Web Data Records Extractign[C]//Proceedings of the 9th Int'l Workshop on Web and Databases. New York, USA: ACM Press, 2006: 20 -25.
9LEE JG,KANG JY,LEE ES.ICOMA:An Open Infrastructure for Agent-based Intelligent Electronic Commerce on the Internet[A].International Conference on Parallel and Distributed Systems (ICPADS'97)[C].Seoul,Korea,1997.
10CORCHO O,GOMEZ-PEREZ A.Solving Integration Problems of Ecommerce Standards and Initiatives through Ontological Mappings[A].Proceedings of the Workshop on E-Business and Intelligent Web at the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-2001)[C].Seattle,USA,2001.

共引文献17

1滕艳平,廉佐政,王海珍.基于语义元模型的需求信息本体构建方法[J].情报科学,2009,27(11):1695-1699.
2张博,聂规划.面向多视点的商品本体建模[J].情报杂志,2010,29(4):134-137.
3张博,聂规划.基于多视点商品本体的商品视点知识分析方法研究[J].计算机工程与设计,2010,31(11):2519-2522. 被引量：1
4陈冬林,张军,李晓菲.电子目录语义集成与智能服务系统研究[J].情报杂志,2010,29(12):129-133. 被引量：3
5解姝,叶施仁,肖春.社会媒体网页内容的分割与抽取[J].计算机工程,2011,37(21):155-158.
6孔燕燕,施化吉.基于相似URL的深层网数据区域识别[J].计算机工程,2012,38(2):48-50. 被引量：1
7唐朝伟,李俊,苗光胜,杜欣慧.基于DOM树的视频元数据抽取系统[J].计算机工程,2012,38(8):268-270. 被引量：1
8黄武冠,朱明,尹文科.基于DOM树和视觉特征的网页信息自动抽取[J].计算机工程,2013,39(10):309-312. 被引量：6
9傅魁,刘李利,王惠敏.基于机器学习的商品本体细粒度语义知识获取[J].武汉理工大学学报（信息与管理工程版）,2013,35(5):706-709. 被引量：1
10李湘东,霍亚勇,黄莉.图书网页的自动识别及书目信息抽取研究[J].现代图书情报技术,2014(4):71-77. 被引量：3

同被引文献67

1魏景龙.智能矿山建设研究[J].工矿自动化,2021,47(S01):19-20. 被引量：10
2池亚平,方勇.Servlet技术与应用方法[J].北京邮电大学学报,2003,26(z1):137-139. 被引量：11
3李献礼,范会联.基于JSP/Servlet技术的网上选课系统的设计及实现[J].涪陵师范学院学报,2005,21(5):107-110. 被引量：9
4Badica A, Badica C, Popescu E. Application of log- ic wrappers to hierarchical data extraction from HTML[M]. Heidelberg : Springer Berlin, 2007.
5Yang S, Wang G, Han Y. Grubber: allowing end users to develop XML-Based wrappers for Web data sources [M]. Heidelberg: Springer Berlin, 2009.
6Carey M J ,Onose N,Petropoulos M. Data services[J]. Communications of the ACM, 2012,55 (6) : 86-97.
7Palekar V R. A Visual Based Page Segmentation for Deep Web Data Extraction[C] // Proceedings of the International Conference on Soft Computing for Problem Solving, Springer India : 2012 :791-804.
8Liu W, Meng X, Meng W. Vide: A vision-based approach for deep web data extraction[J]. Knowl- edge and Data Engineering, IEEE Transactions on,2010,22(3) :447-460.
9Li Baoan. Research on SOA and Compnent Orien- ted Technology in Development of Large System [C]// Computational Intelligence and Design (ISCID). USA : IEEE, 2010 : 29-31.
10Liu L, Pu C, Han W. XWRAP: an XML-enabled wrapper construction system for Web information sources[C]. Proceedings of the 16th IEEE Inter- national conference on Data Engineering, 2000:611-621.

引证文献6

1刘晨,孟昭彤,韩燕波.一种面向最终用户的可视化网页数据抽取及服务化封装方法[J].北方工业大学学报,2014,26(3):16-22.
2何云钢,曹宝香.基于DOM树和DBSCAN算法的Web信息提取[J].电子技术（上海）,2015,42(6):88-92. 被引量：1
3王佩,牛晨,丁立彤.基于PHP的在线跨站脚本检测工具[J].现代电子技术,2015,38(20):41-43.
4王祥凤,李波,李丰鹏,范一航.基于Android系统的学生成绩查询系统设计[J].沈阳师范大学学报（自然科学版）,2015,33(4):559-562.
5宋成明.基于SSM框架整合的高校教师招聘系统[J].办公自动化,2016,21(23):48-50. 被引量：6
6吴克介.煤矿安全Web数据采集技术研究及应用[J].能源与环保,2024,46(10):14-20.

二级引证文献7

1殷凤梅,刘冠中.基于SSM框架的网上招聘系统设计与实现[J].长春大学学报,2022,32(4):1-5. 被引量：4
2曹珍,杨帆.基于SSM框架的商户管理平台设计与实现[J].计算技术与自动化,2017,36(4):119-121. 被引量：12
3刘杰,孙浩,郭东旭,吴雨洽.基于Spring MVC及MyBatis框架的在线教育平台的设计与实现[J].沈阳师范大学学报（自然科学版）,2019,37(3):268-273. 被引量：19
4李聪惠.高等学校教师招聘系统设计与研究[J].山西电子技术,2021(3):73-75. 被引量：1
5薛航.在线教育平台的设计与实现[J].信息技术与信息化,2021(6):185-187. 被引量：2
6柏志安,廖健,曾剑平.基于DOM树与模板的自适应网络信息抽取方法[J].计算机应用与软件,2022,39(8):15-20. 被引量：2
7郑涛,邹乐.电网调度故障追忆系统的设计与实现[J].电脑与电信,2023(7):54-60.

1何昕,谢志鹏.基于简单树匹配算法的Web页面结构相似性度量[J].计算机研究与发展,2007,44(z3):1-6. 被引量：15
2刘思含,贾美娟.树匹配算法在网页分类中的应用[J].电脑学习,2010(4):126-127.
3寇月,李冬,申德荣,于戈,聂铁铮.D-EEM:一种基于DOM树的Deep Web实体抽取机制[J].计算机研究与发展,2010,47(5):858-865. 被引量：17
4杨晓,刘廷章,王健.XML树匹配算法在城市照明故障诊断专家库中的应用[J].计算机应用与软件,2010,27(1):76-80.
5赵震,张龙昌.XML文档实体识别技术研究[J].计算机技术与发展,2014,24(10):84-87. 被引量：2
6朱南丽,朱晓鸣,叶五梅.Web信息抽取中基于结点权重的树编辑距离匹配法研究[J].计算机时代,2010(3):49-51. 被引量：2
7杨喜权,代书.基于知网的概念匹配细粒度化研究[J].计算机应用,2008,28(11):2837-2839. 被引量：3
8王春平.论实体识别算法在XML文档数据质量管理中的应用[J].电子技术与软件工程,2014(24):189-189.
9李剑锋,杨芸,周昌乐.一种基于汉语隐喻依存句法树的嵌入式树匹配算法[J].厦门大学学报（自然科学版）,2008,47(4):500-504. 被引量：1
10张延红.智能CAA中基于语法树的程序正确性验证研究[J].浙江万里学院学报,2006,19(5):12-15.

计算机工程

2012年第5期

浏览历史

内容加载中请稍等...

基于DOM树与领域本体的Web抽取方法被引量：6

参考文献6

二级参考文献18

共引文献17

同被引文献67

引证文献6

二级引证文献7

相关作者

相关机构

相关主题

浏览历史

基于DOM树与领域本体的Web抽取方法 被引量：6

参考文献6

二级参考文献18

共引文献17

同被引文献67

引证文献6

二级引证文献7

相关作者

相关机构

相关主题

浏览历史

基于DOM树与领域本体的Web抽取方法被引量：6