视觉与标签信息的Deep Web查询页面内容提取被引量：1

Combining vision information and tag information to extract Deep Web result pages content

下载PDF

导出

摘要提出了一种结合页面视觉信息和标签信息来提取页面内容结构的方法——DVS。DVS首先通过分析页面的CSS样式信息、DOM树以获得页面的视觉信息和标签信息,初步得到页面的视觉树;然后利用树的路径相似算法,既考虑标签信息又考虑视觉信息来计算树中模块的相似性,对模块进行聚类,最终得到页面的视觉树,即页面的内容结构。DVS主要的特色在于从视觉信息和标签信息两方面来提取页面的内容结构;采用树形结构表示视觉信息,将分析视觉信息转换成分析'视觉属性'树。实验采用UIUC的TEL数据集,分别与WTS算法、VIPS算法进行了比较,文中算法可以获得更高的准确性。 Extracting content from deep web pages is a challenging problem due to the underlying intricate structures of such pages.A vision and tags based approach（DVS） is proposed.It primarily utilizes the vision information and tag information on the Deep Web result pages to extract the content structure of pages.This approach includes two steps as follows： First,the vision information and tag information are produced by analyzing the Cascading Style Sheet and the DOM Tree to generate an initial visual-tree of the Deep Web result page.And then,the Path Shingle（PS） algorithm is employed,by considering both of the vision and the tag information,and the blocks in the visual-tree are clustered according to the similarity computing result of them to produce the final visual-tree,i.e.,the content structure of pages.The innovations of DVS are that it utilizes the vision information and tag information on the Deep Web pages to extract the content structure;and stores the vision information as a tree to tansform the analysis of the vision information to a vision-attribute tree.Experiments are conducted with a large set of Web databases called UIUC’s TEL.The experimental results show that the vision and tag based approach has high precision compared with the WTS algorithm and the VIPS algorithm.

作者冯永唐黎

机构地区重庆大学计算机学院重庆大学信息物理社会可信服务计算教育部重点实验室

出处《重庆大学学报（自然科学版）》 EI CAS CSCD 北大核心 2012年第6期117-124,共8页 Journal of Chongqing University

基金国家自然科学基金资助项目(61103114) 重庆市高等教育教学改革研究重点资助项目(112023) 中央高校基本科研业务基金资助项目(CDJXS11181164) '211工程'三期建设资助项目(S-10218)

关键词深层网内容提取 DOM树 CSS样式视觉树 deep web content extraction dom tree cascading style sheet visual tree

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献13

1AN Y J, GELLER J, WU Y T, et al. Semantic deep web: automatic attribute extraction from the deep web data sources[C]//Proceedings of the 22nd Annual ACM Symposium on Applied Computing, March 11-15, 2007. Seoul, Korea:[s.n.], 2007: 1667-1672.
2BALAKRISHNAN R, KAMBHAMPATI S. SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement [C]// Proceedings of the 20th international conference on Wor[d Wide Web, March 28-April1, 2011. Hyderabad, India: [s. n. ], 2011: 227-236:
3HONG J L, SlEW E G, EGERTON S. WMS- extracting multiple sections data records from search engine results pages [C]//Proceedings of the 2010 ACM Symposium on Applied Coputing, March 22-29, 2010. Sierre, Switzerland: ACM, 2010: 1696-1701.
4MADHAVAN J, COHEN S, DONG X L, et al. Web- scale date intergration: you can only afford to pay as you go[C]//Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research(CIDR), January 7-10, 2007. Asilomar, CA, USA: [-s.n.], 2007, 7: 342-350.
5WANG F, AGRAWAL G. Extracting output metadata from scientific deep web data sources[C]//Proceedings of the 9th IEEE International Conference on Data Mining, December 6-9, 2009. Miami, FL, USA: IEEE Computer Society, 2009: 552-561.
6[ LIU W, MENG X F, MENG W Y. Deep web data integration [R ]. WAMDM: Technical Report WAMDM TR-2006-3, 2006.
7CAI D, YU S P, WEN J R, et al. Extracting content structure for web pages based on visual representation[ C ]//Proceedings of the 5th Asia-Pacific Web Conference on Web Technologies and Applications, April 23-25 2002. Xi'an, China: [s. n.], 2003: 406-417.
8EGLIN V, BRES S. Document page similarity based on layout visual saliency: application to query by example and document classification [C]//Proceedings of the Seventh International Conference on Document Analysis and Recognition, Aug. 3-6, 2003, Edinburgh, Scotland, UK. Washington, DC, USA: IEEE Computer Society, 2003, 2: 1208-1212.
9BUTTLER D. A short survey of document structure similarity algorithms [C]//Proceedings of the 5th International Conference on Internet Computing, March 5, 2004. Las Vegas, Nevada, USA: [s. n. ], 2004 : 3-9.
10QIANG B H, XI J Q, ZHANG L. An effective schema extraction algorithm on the deep web[C]//Proceedings of the 4th International Conference on Wireless Communications, Networking and Mobile Computing, Oct. 12-14, 2008. Dalian, China: IEEE, 2008: 1-4.

同被引文献5

1肖升,何炎祥.改进的潜在语义分析中文摘录方法[J].计算机应用研究,2012,29(12):4507-4511. 被引量：8
2刘家益,邹益民.近70年文本自动摘要研究综述[J].情报科学,2017,35(7):154-161. 被引量：17
3马慧芳,王双,李苗,李宁.融合图结构与节点关联的关键词提取方法[J].中文信息学报,2019,33(9):69-78. 被引量：8
4李金鹏,张闯,陈小军,胡玥,廖鹏程.自动文本摘要研究综述[J].计算机研究与发展,2021,58(1):1-21. 被引量：46
5魏媛媛,倪建成,高峰,吴俊清.结合主题信息聚类编码的文本摘要模型[J].计算机技术与发展,2021,31(1):30-34. 被引量：2

引证文献1

1王晴.基于统计的多文本网站文本内容抽取算法[J].安徽电子信息职业技术学院学报,2021,20(4):6-12. 被引量：2

二级引证文献2

1冯俊辉,刘晨,郭浩然.基于模板和规则的声明式代码生成[J].数字技术与应用,2022,40(2):151-154.
2涂著刚,李正军,杨敏.基于柔性粒度的文本摘要自动化技术创新研究[J].计算机科学与应用,2021,11(10):2546-2554.

1王海艳,曹攀.基于节点属性与正文内容的海量Web信息抽取方法[J].通信学报,2016,37(10):9-17. 被引量：12
2王建伟.Snake模型在医学图像分割中的应用[J].电脑知识与技术,2009,5(4X):3216-3218. 被引量：4
3章勤,余洋,陶文兵.图像搜索中基于网页分块的图像分类研究[J].计算机工程与科学,2007,29(6):42-44. 被引量：1
4安全[J].网管员世界,2011(21):10-10.
5王建品.基于DOM和视觉属性的网页信息过滤方法[J].电子设计工程,2013,21(13):28-30. 被引量：2
6魏永勇,魏永胜.论企业网络终端计算机改进方案[J].河北煤炭,2004(3):31-33.
7凌海风,郭坚毅,严骏,陈海松.案例推理技术用于故障诊断时的相似算法[J].解放军理工大学学报（自然科学版）,2006,7(5):480-484. 被引量：11
8何文韬,叶学义,何志伟,汪云路.面向图像修复的域相似算法[J].计算机工程与应用,2014,50(13):163-167. 被引量：1
9伟生.让486也能用Windows 2000——WTS变废为宝[J].电脑界（应用文萃）,2000(2):41-42.
10Paul Thurrott,臧铁军(译者).What You Need to Know About．．．Longhorn Server Beta2[J].Windows IT Pro Magazine（国际中文版）,2006(8):39-41.

重庆大学学报（自然科学版）

2012年第6期

浏览历史

内容加载中请稍等...

视觉与标签信息的Deep Web查询页面内容提取被引量：1

参考文献13

同被引文献5

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

视觉与标签信息的Deep Web查询页面内容提取 被引量：1

参考文献13

同被引文献5

引证文献1

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

视觉与标签信息的Deep Web查询页面内容提取被引量：1