基于相似URL的深层网数据区域识别被引量：1

Deep Web Data Region Identification Based on Similar URL

下载PDF

导出

摘要针对深层网查询结果页面中噪音信息对数据区域识别的干扰问题,提出一种自动识别深层网查询结果数据区域的方法。该方法利用网页的重复结构和相似URL,将页面划分成不同的语义块,依据不同页面块之间URL的相似性识别出数据区域。实验结果表明,该方法能够提高数据区域识别的召回率和准确率。 Aiming at the problem that the noise information may interfere with the identification of the data region in Deep Web search result pages.This paper proposes an automatic approach to identify data region in Deep Web search result list pages.It employs continuous repetitive structure and similar URL to divide the sample pages into different semantic blocks,and identifies the block where the data region locates.Experimental results show the approzch can imprave the recall rate and accuracy of the date region identification.

作者孔燕燕施化吉

机构地区江苏大学计算机科学与通信工程学院

出处《计算机工程》 CAS CSCD 2012年第2期48-50,共3页 Computer Engineering

基金国家自然科学基金资助项目(61003288)

关键词深层网重复结构相似URL 语义块数据区域 Deep Web repetitive structure similar URL semantic block data region

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献9

1He Bin, Patel M, Zhang Zhen, et al. Accessing the Deep Web: A Survey[J]. Communications of the ACM, 2007, 50(5): 94-101.
2Wang Jiying, Lochovsky F H. Data-rich Section Extraction from HTML Pages[C] //Proceedings of the 3rd International Conference on Web Information Systems Engineering. Singapore: [s. n.] , 2002: 313-322.
3Zhai Yanhong, Liu Bing. Structured Data Extraction from the Web Based on Partial Tree Alignment[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(12): 1614-1628.
4Reis D D C, Golgher P B. Automatic Web News Extraction Using Tree Edit Distance[C] //Proceedings of the 13th International Conference on World Wide Web. [S. 1.] : IEEE Press, 2004: 502-511.
5黄健斌,姬红兵,孙鹤立.Web网页中动态数据区域的识别与抽取[J].计算机工程,2007,33(11):53-55. 被引量：8
6杨舟,卓林,赵朋朋,崔志明.一种针对商品数据记录的自动抽取方法[J].计算机工程,2010,36(23):262-265. 被引量：8
7Cai Deng, Yu Shipeng, Wen Jirong, et al. Extracting Content Structure for Web Pages Based on Visual Representation[C] // Proceedings of the 5th Asia Pacific Web Conference. Xi’an, China: [s. n.] , 2003: 406-417.
8Lin Shianhua, Ho J M. Discovering Informative Content Blocks from Web Documents[C] //Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2002: 588-593.
9Liu Bing, Grossman R L, Zhai Yanhong. Mining Data Records in Web Pages[C] //Proceedings of the 9th Int’l Conf. on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2003: 601-606.

二级参考文献12

1陈琼,苏文健.基于网页结构树的Web信息抽取方法[J].计算机工程,2005,31(20):54-55. 被引量：24
2Liu Bing. Mining Data Records in Web Pages[C]//Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining. Washington D. C. , USA: [s. n. ], 2003:601-606.
3Miao Gengxin, Tatemura J, Hsiung Wang+Pin, et al. Extracting Data Records from the Web Using Tag Path Clustering[C] //Proceedings of the 18th International Conference on the World Wide Web. Madrid: Spain, [s. n. ], 2009: 981-990.
4Zhai Yanhong, Liu Bing. Web Data Extraction Based on Partial Tree Alignment [C]//Proceedings of the 14th International Conference on the World Wide Web. Chiba, Japan.. [s. n. ], 2005 : 76-85.
5Wang Jingyi, Lochovsk F H. Data Extraction and Label Assignment for Web Databases[C]//Proceedings of the 12th International Conference on the World Wide Web. Budapest, Hungary: [s. n. ],2003.. 187-196.
6Liu Bing, Zhai Yanhong. NET: System for Extracting Web Data from Flat and Nested Data Records[C]//Proceedings of the Conference on Web Information Systems Engineering: New York, USA: [s. n.], 2005: 487-495.
7Liu Wei, Meng Xiaofeng, Meng Weiyi. Vision-based Web Data Records Extractign[C]//Proceedings of the 9th Int'l Workshop on Web and Databases. New York, USA: ACM Press, 2006: 20 -25.
8Lin S H,Ho J M.Discovering Informative Content Blocks from Web Documents[C]//Proceedings of the 8^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2002:588-593.
9Valiente G.Tree Edit Distance and Common Subtrees[R].Universitat Politecica de Catalunya,Barcelona,Spain,Research Report LSI-02-20-R,2002.
10Wang J Y,Lochovsky F.Data-rich Section Extraction from HTML Pages[C]//Proceedings of the 3^rd International Conference on Web Information Systems and Engineering.2002:313-322.

共引文献14

1赵洋,马建斌,刘博,王春山.基于Internet的农业信息资源采集系统[J].农机化研究,2008,30(10):139-141.
2李宏伟,张志远.Web实体提取在垂直搜索中的应用研究[J].新技术新工艺,2008(12):62-65.
3王燕,吴灏,毛天宇.基于K-中心点聚类算法的论坛信息识别技术研究[J].计算机工程与设计,2009,30(1):210-212. 被引量：3
4王利,刘宗田,王燕华,廖涛.基于内容相似度的网页正文提取[J].计算机工程,2010,36(6):102-104. 被引量：20
5缪霖,邱会中.Web页面自顶向下的正文信息定位算法[J].计算机工程,2010,36(13):76-78. 被引量：2
6王存昕,蒋文蓉.针对淘宝商家客户管理系统的研究与开发[J].上海第二工业大学学报,2011,28(2):165-170. 被引量：2
7解姝,叶施仁,肖春.社会媒体网页内容的分割与抽取[J].计算机工程,2011,37(21):155-158.
8郭建兵,崔志明,陈明,赵朋朋.基于DOM树与领域本体的Web抽取方法[J].计算机工程,2012,38(5):56-58. 被引量：5
9唐朝伟,李俊,苗光胜,杜欣慧.基于DOM树的视频元数据抽取系统[J].计算机工程,2012,38(8):268-270. 被引量：1
10黄武冠,朱明,尹文科.基于DOM树和视觉特征的网页信息自动抽取[J].计算机工程,2013,39(10):309-312. 被引量：5

同被引文献13

1Chang C IA, Mohammed K, Girgis M R, et al. A Survey of Web In- formation Extraction Systems. IEEE Trans on Knowledge and Data Engineering, 2006, 18 ( 10 ) : 1411 - 1428.
2Wang H C, Ruan S H, Tang Q J. The hnplementation of a Web Crawler URL Filter Algorithm Based on Caching// Proe of the 2nd International Workshop on Computer Science and Engineering. Qingdao, China, 2009:453-456.
3Broder A Z, Najork M, Wiener J L. Efficient URL Cacbing fir World Wide Web Crawling//Proc of the 12th International Confer- ence on World Wide Web. Budapest, Hungary, 2003 : 679-689.
4Qu c, Wang B Z, Wei P P. Efficient Focused Crawling Strategy Using Combination of Link Structure and Content Similarity// Procof the IEEE International Symposium on Information Technology Medicine and Education. Xiamen, China, 2008:1045-1048.
5Nie T Z, Wang Z H, Kou Y, et al. Crawling Result Pages for Data Extraction Based on URL Classification /! Proc of the 7th Web Information Systems and Applications. Huhehot, China, 2010: 79- 84.
6Wang J Y, Lochovsky F H. Data-Rich Section Extraction from HTML Pages//Proc of the 3rd International Conference on Web Information Systems Engineering. Singapore, Singapore, 2002:313-322.
7Reis D C, Golgher P B, Silva A S, et al. Automatic Web News Extraction Using Tree Edit Distance//Proc of the 13th International Conference on World Wide Web. New York, USA, 2004:502-511.
8Wong W C o Fu A W C. Finding Structure and Characteristics of Web Documents for Classification // Proc of the ACM SIGMOD Workshop on Research issues jn Data Mining and Knowledge Dis- covery. Dallas, USA, 2000:96-105.
9Srikantaiah K C, Suraj M, Venugopal K R, et al. Similarity Based Dynamic Web Data Extraction and Integration System from Search Engine Result Pages for Web Content Mining. ACEEE International Journal on Info.rmation Technology, 2013, 3( 1 ) : 42-49.
10杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法(英文)[J].软件学报,2008,19(2):209-223. 被引量：45

引证文献1

1陈荟慧,舒云星,林丽.Web语料抓取中基于相似度的URL过滤规则生成算法[J].模式识别与人工智能,2014,27(7):631-637.

1常国锋.浅析编写计算机程序的三种结构[J].电子制作,2015,23(2Z).
2郑启华.PASCAL语言讲座(三)[J].电脑爱好者,1998(11):29-31.
3黄浩锋,肖南峰.基于组稀疏表示的医学图像超分辨率重建[J].计算机科学,2015,42(S1):151-153 189. 被引量：6
4黄亮,赵泽茂,梁兴开.基于编辑距离的Web数据挖掘[J].计算机应用,2012,32(6):1662-1665. 被引量：16
5周少恒,何桃,龙鹏程,程梦云,江平,FDS团队.MCNP大规模重复结构体素的实时交互可视化方法研究[J].核科学与工程,2012,32(3):266-270. 被引量：1
6钱立兵,季振洲.Web搜索引擎的一种检索结构优化方法[J].高技术通讯,2014,24(6):565-572. 被引量：1
7黄亮,赵泽茂,梁兴开.基于属性标签的Web数据挖掘[J].计算机应用与软件,2012,29(11):156-159. 被引量：1
8朱逢春.基于DOM树的网页去噪技术[J].电子制作,2015,23(8Z). 被引量：1
9鲁统伟,任莹,闵锋.基于松弛与投票的目标定位算法[J].软件导刊,2013,12(5):57-59.
10缪永伟,冯小红,于莉洁,陈佳舟,李永水.基于重复结构检测的三维建筑物精细模型重建[J].软件学报,2016,27(10):2557-2573. 被引量：5

计算机工程

2012年第2期

浏览历史

内容加载中请稍等...

基于相似URL的深层网数据区域识别被引量：1

参考文献9

二级参考文献12

共引文献14

同被引文献13

引证文献1

相关作者

相关机构

相关主题

浏览历史

基于相似URL的深层网数据区域识别 被引量：1

参考文献9

二级参考文献12

共引文献14

同被引文献13

引证文献1

相关作者

相关机构

相关主题

浏览历史

基于相似URL的深层网数据区域识别被引量：1