期刊文献+

基于视觉信息和标签路径的数据抽取

Data Extraction Based on Vision and Tag Path
下载PDF
导出
摘要 结合网页的视觉信息和DOM树结构,研究从Deep Web查询结果页面中抽取半结构化数据的问题。通过视觉块与整个网页的面积比定位数据区域。根据数据记录两两相邻等视觉特征找到包含数据记录的一组节点,并通过比较各节点的DOM树结构的相似度去除噪音节点。根据xpath属性将各条数据记录的数据项对齐。对整个抽取过程生成模板,可以使抽取效率得到很大提高。对8个Deep Web网站进行了抽取数据实验,结果表明本文方法是有效的。 Semi-structured data extracted from Deep Web query results page is studied, based on the visual information and DOM tree structure of pages. The data region is determined by the ratio of visual block area to the entire page. A set of nodes with data records are identified according to visual features, such as adjacency. Noise nodes are eliminated by comparing the similarity of nodes' DOM tree struc- ture. According to xpath attributes, all data items are aligned. Template is generated for the process of extraction, which significantly improves the extraction efficiency. Experiments of data extraction were con- ducted with eight Deep Web websites, the results of which fully testify the effectiveness of our method.
出处 《中国海洋大学学报(自然科学版)》 CAS CSCD 北大核心 2015年第5期114-119,共6页 Periodical of Ocean University of China
基金 山东省自然科学基金项目(ZR2012FM016)资助
关键词 DEEP WEB数据抽取 视觉信息 标签路径 Deep Web data extraction visual feature tag path
  • 相关文献

参考文献9

  • 1刘伟,孟小峰,孟卫一.Deep Web数据集成研究综述[J].计算机学报,2007,30(9):1475-1489. 被引量:136
  • 2Wang Y, Hu J. A machine learning based approach for table detec- tion on the Web [C].//Proc of the llth Int Conf on World Wide Web. New York: ACM, 2002: 242-250.
  • 3Pinto D, McCallum A, Wei X. Table extraction using conditional random fields [C].//Proc of the 26th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2003: 235-242.
  • 4Crescenzi V, Mecca G, Merialdo P. Road-runner: Towards Auto- matic Data Extraction from Large Web Sites[C].//Proc of the 26th Int'l Conf. on Very Large Database Systems. Roma, Italy: [s.n.], 2001:109 118.
  • 5Chang Chia-Hui, Lui C. IEPAD: Information Extraction Based on Pattern Discovery [C].//Proceedings of the 10th International Conference on World Wide Web. Hong Kong: Is. n. ], 2001: 681- 688.
  • 6Liu B, Grossman R L, Zhai Yanhong. Mining data records in Web pages [C].//Proc of the 9th Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2003: 601-606.
  • 7Zhai Y, Liu B. Web data extraction based on partial tree alignment I-C].//Proe of the 14th Int Conf on World Wide Web. New York: ACM, 2005: 76-85.
  • 8Cai D, Yu S, Wen J R, et al. VIPS: a vision-based page segmenta- tion algorithm [R]. Microsoft Technical Report, MSR-TR-2003- 79, 2003.
  • 9Liu W, Meng X, Meng W. Vision-based Web data records extrac- tion [C].//Proc of the 9th Int Workshop in Web and Databases. New York: ACM, 2006: 20-25.

二级参考文献60

  • 1.[EB/OL].http://www.cogsci.Princeton.edu,.
  • 2Fetterly D,Manasse M,Najork M,Wiener J L.A largescale study of the evolution of Web pages//Proceedings of the 12th International World Wide Web Conference.Budapest,2003:669-678
  • 3Chang K C,He B,Li C,Patel M,Zhang Z.Structured databases on the Web:Observations and Implications.SIGMOD Record,2004,33(3):61-70
  • 4Cope J,Craswell N,Hawking D.Automated discovery of search interfaces on the Web//Proceedings of the 14th Australasian Database Conference(ADC 2003).Adelaide,2003:181-189
  • 5Zhang Z,He B,Chang K C.Understanding Web query interfaces:Best-effort parsing with hidden syntax//Proceedings of the 23rd ACM SIGMOD International Conference on Management of Data.Paris,2004:107-118
  • 6Arasu A,Garcia-Molina H.Extracting structured data from Web pages//Proceedings of the 22nd ACM SIGMOD International Conference on Management of Data.San Diego,2003:337-348
  • 7Crescenzi V,Mecca G,Merialdo P.RoadRunner:Towards automatic data extraction from large Web sites//Proceedings of the 27th International Conference on Very Large Data Bases.Italy,2001:109-118
  • 8Wittenburg K,Weitzman L.Visual grammars and incremental parsing for interface languages//Proceedings of the IEEE Symposium on Visual Languages (VL).Skokie,1990:111-118
  • 9He H,Meng W,Yu C T,Wu Z.WISE-integrator:An automatic integrator of Web search interfaces for e-commerce//Proceedings of the 29th International Conference on Very Large Data Bases.Berlin,2003:357-368
  • 10Peng Q,Meng W,He H,Yu C T.WISE-cluster:Clustering e-commerce search engines automatically//Proceedings of the 6th ACM International Workshop on Web Information and Data Management.Washington,2004:104-111

共引文献135

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部