期刊文献+

基于XPath比较的Web数据抽取方法 被引量:4

Approach for Web Data Extraction Based on XPath Comparison
下载PDF
导出
摘要 研究了从包含多个数据块的页面中抽取数据的方法.通过对比各个数据块的XPath,发现这些数据块具有相似的XPath,提出一种基于XPath比较的数据块抽取规则生成算法XERG.得到各个数据块抽取规则之后,块内的信息可以使用相对XPath或者正则表达式的方法来进行抽取.实验结果表明,该方法能够准确地获得各个数据块,正确抽取块内信息. The method of extracting data from a Web page that contains several data blocks is studied. After the comparison of each data block's XPath, it can be found that they are very similar. Based on this observation, an XPath-comparison-base Extraction Rules Generation Algorithm(XERG) is proposed. When the data block extraction rules are ready, the inner-block information can be extracted by relative XPath or regular expressions. Experimental results show that this method is able to obtain data blocks and extract data from them very accurately.
出处 《郑州大学学报(理学版)》 CAS 2007年第2期161-166,共6页 Journal of Zhengzhou University:Natural Science Edition
基金 国家自然科学资金资助项目 编号90412015 60603022
关键词 WEB数据抽取 XPath比较 XERG 正则表达式 Web data extraction XPath comparison XERG regular expression
  • 相关文献

参考文献7

  • 1王煜,王光明.比较购物现状之研究[J].计算机时代,2005(8):1-2. 被引量:5
  • 2张慧颖,曲著伟.基于子树匹配的交互式Web数据抽取方法[J].计算机工程,2006,32(9):78-80. 被引量:8
  • 3Baumgartner R,Flesca S,Gottlob G.Visual Web information extraction with Lixto[C]∥Processing of the Very Large Data Bases (VLDB),Roma,Italy,2001:119-128.
  • 4W3C.http:∥www.w3.org/DOM/.
  • 5大洋书城:http:∥bookcity.dayoo.com.
  • 6Lawrence S,Giles C L.Searching the World Wide Web[J].Science Magazine,1998,280:98-100.
  • 7精彩网上书城:http:∥www.exvv.com.

二级参考文献7

  • 1Arasu A,Garcia-Molina H.Extracting Structured Data from Web Pages[C].ACM SIGMOD'03,2003:337-348.
  • 2Valiente G.An Efficient Bottom-up Distance Between Trees[C].Proc.of the 8^th International Symposium on String Processing and Information Retrieval,Santiago,Chile,2001:212-219.
  • 3Ribeiro-Neto B,Alberto H F,da Silva L A S.Top-down Extraction of Semi-structured Data[Z].IEEE Computer Society,1999:176-184.
  • 4Selkow S M.The Tree-to-tree Editing Problem[J].Information Processing Letters,1977,6(6):184-186.
  • 5http://www marketingman.net/wmtheo/zh211.htm.网站评比:评比什么?如何评比?--美国主要评比网站的评比方法比较研究.2005.1
  • 6Line Eikvil.Information Extraction from World Wide Web A Survey(199).Survery Report,1999.
  • 7Alexa排名,一个不见硝烟的战场.http://www cfan.com cn/pages/20050301/890.htm,2005.2.

共引文献11

同被引文献18

  • 1胡东东,孟小峰.一种基于树结构的Web数据自动抽取方法[J].计算机研究与发展,2004,41(10):1607-1613. 被引量:21
  • 2于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. 被引量:55
  • 3杨敬伟,杨文柱,高悦.基于DOM的Web信息抽取规则的构造与实现[J].河北大学学报(自然科学版),2007,27(2):209-212. 被引量:5
  • 4邓健爽,郑启伦,彭宏,林旭东.基于关键词聚类和节点距离的网页信息抽取[J].计算机科学,2007,34(4):213-216. 被引量:8
  • 5Crescenzi V, Mecca G, Merialdo P. Roadrunner: towards automatic data extraction from large Web sites[ C]//Proceedings of the 27th International Conference on Very Large Data Bases. San Francisco: Morgan Kaufmann, 2001 : 109 - 118.
  • 6Yang Jaeyoung, Tae- Hyung Kim, Joongmin Choi. An interface agent for wrapper - based information extraction [JJ. Lecture Notes in Computer Science, 2005 (3371):291 - 302.
  • 7Meng X F, Hu D D, Li C. Schema guided wrapper maintenance for Web - data extraction [ C ]// Proceedings of ACM WIDM'2003. New York: ACM Press, 2003: 1- 8.
  • 8Arasu A, Garcia- Molina H. Extracting structured data from Web pages[C]//Proceedings of the 2003 ACM SIGMOD international conference on Management of data. New York: ACM Press, 2003 : 337 - 348.
  • 9Zhu Jun, Nie Zaiqing, Wen Jirong, et al. Simultaneous record detection and attribute labeling in web data extraction[ C] // Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. New York: ACM Press, 2006:494- 503.
  • 10Wang J, Lochovsky F. Data - rich section extraction from HTML pages [ C]//Proceedings of the 3rd International Conference on Web Information Systems Engineering. Washinglon: IEEE Computer Society, 2002: 313 - 322.

引证文献4

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部