期刊文献+

针对Web信息抽取的数据交叉定位改进方法

An Improved Method of Data Cross-Locating Based on Web Information Extraction
下载PDF
导出
摘要 针对包装器在抽取Web网站的过程中抽取精度差、耗时长以及鲁棒性差等问题,提出了一种改进的基于内部特征、自底向上归纳总结的数据交叉定位方法,该方法建立了基于元素文本特征和基于元素属性特征的坐标系,将两种坐标系中的坐标值进行交叉验证获取待抽取的元数据信息。实验结果表明:该方法抽取数据相较于绝对路径方法、相对路径方法、绝对特征路径方法、相对特征路径方法以及交叉定位方法,在召回率略降2.2%的情况下,精确度提高了31.1%,并且相较于交叉定位法,抽取数据的时间提高了17.9秒。 In view of the wrapper in the process of extracting Web site extraction of low accuracy,long time consuming,and poor robustness problem,an improved based on internal characteristics,bottom- up summarized data cross locating method is proposed.The method establishes coordinate system based on elements' text characteristics and attributes' characteristics,and validates the values of the metadata information by cross- locating.The experiment results show that the recall rate of the method we proposed is reduced by2.2%than absolute path method,relative path method,absolute characteristic path method,relative characteristic path method and cross- locating method,and the precision of the method increases by 31.1%,and the time is reduced 17.9 seconds relative to the cross- locating method.
出处 《网络新媒体技术》 2015年第4期28-34,40,共8页 Network New Media Technology
基金 先导专项课题:智能电视平台与服务支撑环境研制(XDA06040501) 国家科技支撑计划课题:电视商务综合体新业态应用示范(2012BAH73F02)
关键词 WEB信息抽取 交叉定位 包装器 内部特征 DOM树 Web Information Extraction Cross Locating Wrapper Internal Characteristic DOM Tree
  • 相关文献

参考文献2

二级参考文献20

  • 1王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 2朱永盛,武港山.基于Web的新闻信息抽取[J].计算机工程,2006,32(10):74-76. 被引量:11
  • 3[6]HAN Wei, BUTTLE D, PU C. Wrapping data into XML [J]. Sigmod Record, 2001, 30(3): 33-38.
  • 4[7]KUSHMERICK N. Wrapper induction: efficiency and expressiveness [J]. Artificial Intelligence Journal, 2000, 118(1-2): 15-68.
  • 5[8]COHEN W W, FAN Wei. Learning page-independent heuristics for extracting data from Web pages [J]. International Journal of Computer and Telecommunication Networking, 1999, 31(11-16): 1641-1652.
  • 6[9]KISTLERA T, MARAIS H. WebL: A programming language for the web [J]. Computer Networks and ISDN Systems, 1998, 30(1-7): 259-270.
  • 7[10]YEMENI RAMANA, LI OHEN, GARCIA-MOLINA HECTOR, et al. Computing capabilities of mediators [A]. Proceedings ACM SIGMOD International Conference on Management of Data [C]. Philadelphia, Pennsylvania: ACM, 1999. 443-454.
  • 8[1]KNOBLOCK C A, MINTON S, AMBITE J L, et al. Modeling Web sources for information integration [A]. Proceedings of the Fifteenth National Conference on Artificial Intelligence and Tenth Conference on Innovative Applications of Artificial Intelligence [C]. Menlo Park, California: AAAI, 1998. 211-218.
  • 9[2]BERGAMASCHI S, CASTANO S, VINCINI M. Semantic integration of semi structured and structured data sources [J]. SIGMOD Record, 1999, 28(1): 54-59.
  • 10[3]KNOBLOC C A K, MINTON S, AMBITE J L, et al. The Ariadne approach to Web-based information integration [J]. Journal on Cooperative Information Systems, 2001, 10(1-2): 145-169.

共引文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部