
基于重复模式的自动Web信息抽取 被引量:8

Automatic Web Information Extraction Based on Repetitive Pattern
摘要 互联网上存在很多在线购物网站,抽取这类网站页面里的商品信息可以为电子商务、Web查询提供增值服务。该文针对这类网站提出一种自动的Web信息抽取方法,通过检测网页中的重复模式以及分析主题内容的特征获取网页的主题内容,该方法在抽取过程中不需要人工干预。对10个在线购物网站进行了测试,实验结果表明提出的方法是有效的。 There are many on-line shopping Web sites on WWW, and commodity information in these Web pages can be extracted for E-commerce and Web-query. This paper presents an automated approach for Web information extraction against these Web sites. The approach finds the topic area by detecting repetitive patterns and analyzing the characteristics of topic area in a single Web page. There are no human interactions during extraction. The approach tests 10 on-line shopping sites and experimental results show that the approach is effective.
出处 《计算机工程》 CAS CSCD 北大核心 2008年第22期73-76,共4页 Computer Engineering
基金 国家自然科学基金资助项目(60673043) 国家社会科学基金资助项目(07BYY051)
关键词 WEB信息抽取 DOM树 重复模式 Web information extraction DOM tree repetitive pattern
  • 相关文献


  • 1Chang Chia-Hui, Kayed M, Girgis M R. A Survey of Web Information Extraction Systems[J]. IEEE Transaction on Know-ledge and Data Engineering, 2006, 18( 10): 1411 - 1428.
  • 2Crescenzi V, Mecca G, Merialdo R Road-runner: Towards Automatic Data Extraction from Large Web Sites[C]//Proc. of the 26th Int'l Conf. on Very Large Database Systems. Roma, Italy: [s. n.], 2001: 109-118.
  • 3Chang Chia-Hui, Lui C. IEPAD: Information Extraction Based on Pattern Discovery[C]//Proceedings of the 10th International Conference on World Wide Web. Hong Kong, China: [s. n.], 2001: 681-688.
  • 4Liu Bing, Grossman R, Zhai Yanhong. Mining Data Records in Web Pages[C]//Proceedings of KDD'03. Washington D. C., USA: [s. n.], 2003: 601-606.
  • 5Phong L Vuong B Gao Xiaoying, et al. Data Extraction from Semi-structured Web Pages by Clustering[C]//Proceedings of WI'06. Hong Kong, China: [s. n.], 2006: 374-377.
  • 6Wu Yang. Identifying Syntactic Differences Between Two Programs[J]. Software-practice and Experience, 1991, 21(7): 739-755.


  • 1沈疆海,王德玲.通过自动化接口向应用程序传送数据[J].石油天然气学报,2003,25(z2):171-172. 被引量:1
  • 2包敬海,黄志宇.基于键盘模拟实现程序的自动化操作[J].钦州师范高等专科学校学报,2006,21(3):57-60. 被引量:3
  • 3高强,张敬之,耿桦,潘金贵.基于重复模式的Web信息抽取[J].计算机科学,2007,34(4):210-212. 被引量:6
  • 4黄文蓓,杨静,顾君忠.基于分块的网页正文信息提取算法研究[J].计算机应用,2007,27(B06):24-26. 被引量:32
  • 5CONG G, WANG L, LIN CY, et al. Finding question - answer pairs from online forums [ C ]//Proceedings of the 31 st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2008: 467-474.
  • 6GLANCE N, HURST M, NIGAM K, et al. Deriving marketing intelligence from online discussion [ C ]//Proceedings of the 11th Annual International ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2005: 419-428.
  • 7ZHANG J, ACKERMAN MS, ADAMIC L. Expertise networks in online communities: structure and algorithms [ C ]//Proceedings of the 16th Intemational Conference on World Wide Web. New York, USA: ACM Press, 2007: 221-230.
  • 8KUSHMERICK N. Wrapper induction: efficiency and expressiveness [ J ]. Artificial Intelligence, 2000, 118 : 15- 68.
  • 9LERMAN K, MINTON S, KNOBLOCK C. Wrapper maintenance: a machine learning approach[ J]. Journal of Artificial Intelligence Research, 2003, 18: 149-181.
  • 10ZHAI Y, LIU B. Web data extraction based on partial tree alignment [ C ]//Proceedings of the 14th International Conference on World Wide Web. New York, USA: ACM Press, 2005: 76-85.










使用帮助 返回顶部