期刊文献+

视觉与标签信息的Deep Web查询页面内容提取 被引量:1

Combining vision information and tag information to extract Deep Web result pages content
下载PDF
导出
摘要 提出了一种结合页面视觉信息和标签信息来提取页面内容结构的方法——DVS。DVS首先通过分析页面的CSS样式信息、DOM树以获得页面的视觉信息和标签信息,初步得到页面的视觉树;然后利用树的路径相似算法,既考虑标签信息又考虑视觉信息来计算树中模块的相似性,对模块进行聚类,最终得到页面的视觉树,即页面的内容结构。DVS主要的特色在于从视觉信息和标签信息两方面来提取页面的内容结构;采用树形结构表示视觉信息,将分析视觉信息转换成分析'视觉属性'树。实验采用UIUC的TEL数据集,分别与WTS算法、VIPS算法进行了比较,文中算法可以获得更高的准确性。 Extracting content from deep web pages is a challenging problem due to the underlying intricate structures of such pages.A vision and tags based approach(DVS) is proposed.It primarily utilizes the vision information and tag information on the Deep Web result pages to extract the content structure of pages.This approach includes two steps as follows: First,the vision information and tag information are produced by analyzing the Cascading Style Sheet and the DOM Tree to generate an initial visual-tree of the Deep Web result page.And then,the Path Shingle(PS) algorithm is employed,by considering both of the vision and the tag information,and the blocks in the visual-tree are clustered according to the similarity computing result of them to produce the final visual-tree,i.e.,the content structure of pages.The innovations of DVS are that it utilizes the vision information and tag information on the Deep Web pages to extract the content structure;and stores the vision information as a tree to tansform the analysis of the vision information to a vision-attribute tree.Experiments are conducted with a large set of Web databases called UIUC’s TEL.The experimental results show that the vision and tag based approach has high precision compared with the WTS algorithm and the VIPS algorithm.
作者 冯永 唐黎
出处 《重庆大学学报(自然科学版)》 EI CAS CSCD 北大核心 2012年第6期117-124,共8页 Journal of Chongqing University
基金 国家自然科学基金资助项目(61103114) 重庆市高等教育教学改革研究重点资助项目(112023) 中央高校基本科研业务基金资助项目(CDJXS11181164) '211工程'三期建设资助项目(S-10218)
关键词 深层网 内容提取 DOM树 CSS样式 视觉树 deep web content extraction dom tree cascading style sheet visual tree
  • 相关文献

参考文献13

  • 1AN Y J, GELLER J, WU Y T, et al. Semantic deep web: automatic attribute extraction from the deep web data sources[C]//Proceedings of the 22nd Annual ACM Symposium on Applied Computing, March 11-15, 2007. Seoul, Korea:[s.n.], 2007: 1667-1672.
  • 2BALAKRISHNAN R, KAMBHAMPATI S. SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement [C]// Proceedings of the 20th international conference on Wor[d Wide Web, March 28-April1, 2011. Hyderabad, India: [s. n. ], 2011: 227-236:
  • 3HONG J L, SlEW E G, EGERTON S. WMS- extracting multiple sections data records from search engine results pages [C]//Proceedings of the 2010 ACM Symposium on Applied Coputing, March 22-29, 2010. Sierre, Switzerland: ACM, 2010: 1696-1701.
  • 4MADHAVAN J, COHEN S, DONG X L, et al. Web- scale date intergration: you can only afford to pay as you go[C]//Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research(CIDR), January 7-10, 2007. Asilomar, CA, USA: [-s.n.], 2007, 7: 342-350.
  • 5WANG F, AGRAWAL G. Extracting output metadata from scientific deep web data sources[C]//Proceedings of the 9th IEEE International Conference on Data Mining, December 6-9, 2009. Miami, FL, USA: IEEE Computer Society, 2009: 552-561.
  • 6[ LIU W, MENG X F, MENG W Y. Deep web data integration [R ]. WAMDM: Technical Report WAMDM TR-2006-3, 2006.
  • 7CAI D, YU S P, WEN J R, et al. Extracting content structure for web pages based on visual representation[ C ]//Proceedings of the 5th Asia-Pacific Web Conference on Web Technologies and Applications, April 23-25 2002. Xi'an, China: [s. n.], 2003: 406-417.
  • 8EGLIN V, BRES S. Document page similarity based on layout visual saliency: application to query by example and document classification [C]//Proceedings of the Seventh International Conference on Document Analysis and Recognition, Aug. 3-6, 2003, Edinburgh, Scotland, UK. Washington, DC, USA: IEEE Computer Society, 2003, 2: 1208-1212.
  • 9BUTTLER D. A short survey of document structure similarity algorithms [C]//Proceedings of the 5th International Conference on Internet Computing, March 5, 2004. Las Vegas, Nevada, USA: [s. n. ], 2004 : 3-9.
  • 10QIANG B H, XI J Q, ZHANG L. An effective schema extraction algorithm on the deep web[C]//Proceedings of the 4th International Conference on Wireless Communications, Networking and Mobile Computing, Oct. 12-14, 2008. Dalian, China: IEEE, 2008: 1-4.

同被引文献5

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部