针对Web信息抽取的数据交叉定位改进方法

An Improved Method of Data Cross-Locating Based on Web Information Extraction

下载PDF

导出

摘要针对包装器在抽取Web网站的过程中抽取精度差、耗时长以及鲁棒性差等问题,提出了一种改进的基于内部特征、自底向上归纳总结的数据交叉定位方法,该方法建立了基于元素文本特征和基于元素属性特征的坐标系,将两种坐标系中的坐标值进行交叉验证获取待抽取的元数据信息。实验结果表明:该方法抽取数据相较于绝对路径方法、相对路径方法、绝对特征路径方法、相对特征路径方法以及交叉定位方法,在召回率略降2.2%的情况下,精确度提高了31.1%,并且相较于交叉定位法,抽取数据的时间提高了17.9秒。 In view of the wrapper in the process of extracting Web site extraction of low accuracy,long time consuming,and poor robustness problem,an improved based on internal characteristics,bottom- up summarized data cross locating method is proposed.The method establishes coordinate system based on elements＇ text characteristics and attributes＇ characteristics,and validates the values of the metadata information by cross- locating.The experiment results show that the recall rate of the method we proposed is reduced by2.2%than absolute path method,relative path method,absolute characteristic path method,relative characteristic path method and cross- locating method,and the precision of the method increases by 31.1%,and the time is reduced 17.9 seconds relative to the cross- locating method.

作者董微倪宏邓浩江刘学

机构地区中国科学院声学研究所国家网络新媒体工程技术研究中心中国科学院大学

出处《网络新媒体技术》 2015年第4期28-34,40,共8页 Network New Media Technology

基金先导专项课题:智能电视平台与服务支撑环境研制(XDA06040501) 国家科技支撑计划课题:电视商务综合体新业态应用示范(2012BAH73F02)

关键词 WEB信息抽取交叉定位包装器内部特征 DOM树 Web Information Extraction Cross Locating Wrapper Internal Characteristic DOM Tree

分类号 TP391.1 [自动化与计算机技术—计算机应用技术] TP393.09 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献2

1陈天,黄敏.Web信息抽取中的数据交叉定位[J].华南理工大学学报（自然科学版）,2008,36(5):43-47. 被引量：2
2孙建伶,蔡俊杰,董金祥.WDL:一种通用的基于XML的Web包装器描述语言[J].浙江大学学报（工学版）,2003,37(1):24-31. 被引量：4

二级参考文献20

1王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量：81
2朱永盛,武港山.基于Web的新闻信息抽取[J].计算机工程,2006,32(10):74-76. 被引量：11
3[6]HAN Wei, BUTTLE D, PU C. Wrapping data into XML [J]. Sigmod Record, 2001, 30(3): 33-38.
4[7]KUSHMERICK N. Wrapper induction: efficiency and expressiveness [J]. Artificial Intelligence Journal, 2000, 118(1-2): 15-68.
5[8]COHEN W W, FAN Wei. Learning page-independent heuristics for extracting data from Web pages [J]. International Journal of Computer and Telecommunication Networking, 1999, 31(11-16): 1641-1652.
6[9]KISTLERA T, MARAIS H. WebL: A programming language for the web [J]. Computer Networks and ISDN Systems, 1998, 30(1-7): 259-270.
7[10]YEMENI RAMANA, LI OHEN, GARCIA-MOLINA HECTOR, et al. Computing capabilities of mediators [A]. Proceedings ACM SIGMOD International Conference on Management of Data [C]. Philadelphia, Pennsylvania: ACM, 1999. 443-454.
8[1]KNOBLOCK C A, MINTON S, AMBITE J L, et al. Modeling Web sources for information integration [A]. Proceedings of the Fifteenth National Conference on Artificial Intelligence and Tenth Conference on Innovative Applications of Artificial Intelligence [C]. Menlo Park, California: AAAI, 1998. 211-218.
9[2]BERGAMASCHI S, CASTANO S, VINCINI M. Semantic integration of semi structured and structured data sources [J]. SIGMOD Record, 1999, 28(1): 54-59.
10[3]KNOBLOC C A K, MINTON S, AMBITE J L, et al. The Ariadne approach to Web-based information integration [J]. Journal on Cooperative Information Systems, 2001, 10(1-2): 145-169.

共引文献4

1华成,高济,陈义.XML数据实体化视图自维护判定准则[J].浙江大学学报（工学版）,2005,39(6):840-844.
2廖灵睿,肖田元.基于XSLT的Web包装器环境[J].计算机工程与科学,2006,28(9):15-17. 被引量：1
3陈天,黄敏.Web信息抽取中的数据交叉定位[J].华南理工大学学报（自然科学版）,2008,36(5):43-47. 被引量：2
4王存昕,蒋文蓉.针对淘宝商家客户管理系统的研究与开发[J].上海第二工业大学学报,2011,28(2):165-170. 被引量：2

1封创.基于粗集理论的数据融合算法在辐射源交叉定位中的应用[J].通信对抗,2010(2):27-30. 被引量：1
2陈天,黄敏.Web信息抽取中的数据交叉定位[J].华南理工大学学报（自然科学版）,2008,36(5):43-47. 被引量：2
3夏晓清,冯伟,赵荣椿.CPODW数字水印技术中的图像内容提取方法[J].计算机应用,2004,24(B12):94-96.
4朱必浩,冯新喜,鹿传国,王帛.基于定位原理的雷达与ESM航迹关联算法[J].火力与指挥控制,2012,37(2):49-51. 被引量：1
5孙铁利,赵隽,杨凤芹,吴迪.一种基于相对特征的文本分类算法[J].东北师大学报（自然科学版）,2010,42(1):63-66. 被引量：2
6张平定,张西川,王明宇,苏蓉.雷达组网中交叉定位误差校正算法[J].空军工程大学学报（自然科学版）,2007,8(6):23-26. 被引量：5
7杨华,李彬,徐松林.基于多站雷达观测数据的假点剔除算法研究[J].地面防空武器,2013,44(4):25-28.
8杜政东,王晓明.聚类方法在测向交叉定位中的应用[J].电信技术研究,2005(10):36-40. 被引量：3
9殷荣网,李赵鑫,邵安贤,庞京玉.基于UKF区域交叉定位的无线传感器网络sink节点动态跟踪算法[J].计算机应用研究,2015,32(9):2729-2732. 被引量：2
10姚依翔,谢俊元.基于UKF区域交叉定位的WSNs Sink节点动态跟踪算法[J].传感器与微系统,2015,34(4):123-126. 被引量：3

网络新媒体技术

2015年第4期

浏览历史

内容加载中请稍等...

针对Web信息抽取的数据交叉定位改进方法

参考文献2

二级参考文献20

共引文献4

相关作者

相关机构

相关主题

浏览历史