期刊文献+

基于相似URL的深层网数据区域识别 被引量:1

Deep Web Data Region Identification Based on Similar URL
下载PDF
导出
摘要 针对深层网查询结果页面中噪音信息对数据区域识别的干扰问题,提出一种自动识别深层网查询结果数据区域的方法。该方法利用网页的重复结构和相似URL,将页面划分成不同的语义块,依据不同页面块之间URL的相似性识别出数据区域。实验结果表明,该方法能够提高数据区域识别的召回率和准确率。 Aiming at the problem that the noise information may interfere with the identification of the data region in Deep Web search result pages.This paper proposes an automatic approach to identify data region in Deep Web search result list pages.It employs continuous repetitive structure and similar URL to divide the sample pages into different semantic blocks,and identifies the block where the data region locates.Experimental results show the approzch can imprave the recall rate and accuracy of the date region identification.
出处 《计算机工程》 CAS CSCD 2012年第2期48-50,共3页 Computer Engineering
基金 国家自然科学基金资助项目(61003288)
关键词 深层网 重复结构 相似URL 语义块 数据区域 Deep Web repetitive structure similar URL semantic block data region
  • 相关文献

参考文献9

  • 1He Bin, Patel M, Zhang Zhen, et al. Accessing the Deep Web: A Survey[J]. Communications of the ACM, 2007, 50(5): 94-101.
  • 2Wang Jiying, Lochovsky F H. Data-rich Section Extraction from HTML Pages[C] //Proceedings of the 3rd International Conference on Web Information Systems Engineering. Singapore: [s. n.] , 2002: 313-322.
  • 3Zhai Yanhong, Liu Bing. Structured Data Extraction from the Web Based on Partial Tree Alignment[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(12): 1614-1628.
  • 4Reis D D C, Golgher P B. Automatic Web News Extraction Using Tree Edit Distance[C] //Proceedings of the 13th International Conference on World Wide Web. [S. 1.] : IEEE Press, 2004: 502-511.
  • 5黄健斌,姬红兵,孙鹤立.Web网页中动态数据区域的识别与抽取[J].计算机工程,2007,33(11):53-55. 被引量:8
  • 6杨舟,卓林,赵朋朋,崔志明.一种针对商品数据记录的自动抽取方法[J].计算机工程,2010,36(23):262-265. 被引量:8
  • 7Cai Deng, Yu Shipeng, Wen Jirong, et al. Extracting Content Structure for Web Pages Based on Visual Representation[C] // Proceedings of the 5th Asia Pacific Web Conference. Xi’an, China: [s. n.] , 2003: 406-417.
  • 8Lin Shianhua, Ho J M. Discovering Informative Content Blocks from Web Documents[C] //Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2002: 588-593.
  • 9Liu Bing, Grossman R L, Zhai Yanhong. Mining Data Records in Web Pages[C] //Proceedings of the 9th Int’l Conf. on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2003: 601-606.

二级参考文献12

  • 1陈琼,苏文健.基于网页结构树的Web信息抽取方法[J].计算机工程,2005,31(20):54-55. 被引量:24
  • 2Liu Bing. Mining Data Records in Web Pages[C]//Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining. Washington D. C. , USA: [s. n. ], 2003:601-606.
  • 3Miao Gengxin, Tatemura J, Hsiung Wang+Pin, et al. Extracting Data Records from the Web Using Tag Path Clustering[C] //Proceedings of the 18th International Conference on the World Wide Web. Madrid: Spain, [s. n. ], 2009: 981-990.
  • 4Zhai Yanhong, Liu Bing. Web Data Extraction Based on Partial Tree Alignment [C]//Proceedings of the 14th International Conference on the World Wide Web. Chiba, Japan.. [s. n. ], 2005 : 76-85.
  • 5Wang Jingyi, Lochovsk F H. Data Extraction and Label Assignment for Web Databases[C]//Proceedings of the 12th International Conference on the World Wide Web. Budapest, Hungary: [s. n. ],2003.. 187-196.
  • 6Liu Bing, Zhai Yanhong. NET: System for Extracting Web Data from Flat and Nested Data Records[C]//Proceedings of the Conference on Web Information Systems Engineering: New York, USA: [s. n.], 2005: 487-495.
  • 7Liu Wei, Meng Xiaofeng, Meng Weiyi. Vision-based Web Data Records Extractign[C]//Proceedings of the 9th Int'l Workshop on Web and Databases. New York, USA: ACM Press, 2006: 20 -25.
  • 8Lin S H,Ho J M.Discovering Informative Content Blocks from Web Documents[C]//Proceedings of the 8^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2002:588-593.
  • 9Valiente G.Tree Edit Distance and Common Subtrees[R].Universitat Politecica de Catalunya,Barcelona,Spain,Research Report LSI-02-20-R,2002.
  • 10Wang J Y,Lochovsky F.Data-rich Section Extraction from HTML Pages[C]//Proceedings of the 3^rd International Conference on Web Information Systems and Engineering.2002:313-322.

共引文献14

同被引文献13

  • 1Chang C IA, Mohammed K, Girgis M R, et al. A Survey of Web In- formation Extraction Systems. IEEE Trans on Knowledge and Data Engineering, 2006, 18 ( 10 ) : 1411 - 1428.
  • 2Wang H C, Ruan S H, Tang Q J. The hnplementation of a Web Crawler URL Filter Algorithm Based on Caching// Proe of the 2nd International Workshop on Computer Science and Engineering. Qingdao, China, 2009:453-456.
  • 3Broder A Z, Najork M, Wiener J L. Efficient URL Cacbing fir World Wide Web Crawling//Proc of the 12th International Confer- ence on World Wide Web. Budapest, Hungary, 2003 : 679-689.
  • 4Qu c, Wang B Z, Wei P P. Efficient Focused Crawling Strategy Using Combination of Link Structure and Content Similarity// Procof the IEEE International Symposium on Information Technology Medicine and Education. Xiamen, China, 2008:1045-1048.
  • 5Nie T Z, Wang Z H, Kou Y, et al. Crawling Result Pages for Data Extraction Based on URL Classification /! Proc of the 7th Web Information Systems and Applications. Huhehot, China, 2010: 79- 84.
  • 6Wang J Y, Lochovsky F H. Data-Rich Section Extraction from HTML Pages//Proc of the 3rd International Conference on Web Information Systems Engineering. Singapore, Singapore, 2002:313-322.
  • 7Reis D C, Golgher P B, Silva A S, et al. Automatic Web News Extraction Using Tree Edit Distance//Proc of the 13th International Conference on World Wide Web. New York, USA, 2004:502-511.
  • 8Wong W C o Fu A W C. Finding Structure and Characteristics of Web Documents for Classification // Proc of the ACM SIGMOD Workshop on Research issues jn Data Mining and Knowledge Dis- covery. Dallas, USA, 2000:96-105.
  • 9Srikantaiah K C, Suraj M, Venugopal K R, et al. Similarity Based Dynamic Web Data Extraction and Integration System from Search Engine Result Pages for Web Content Mining. ACEEE International Journal on Info.rmation Technology, 2013, 3( 1 ) : 42-49.
  • 10杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法(英文)[J].软件学报,2008,19(2):209-223. 被引量:45

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部