期刊文献+

基于有限状态自动机提取不规范表结构Web信息

Unregulated table structure Web information extraction based on finite state automata
下载PDF
导出
摘要 大量的不规范表结构信息是当前Web信息提取所必须解决的问题.在现有方法基础上,给出了归纳学习相邻属性间上下文规则集算法,提出了以Web页为粒度的属性转换机和有限状态自动机包装器概念,最后介绍了采用有限状态自动机包装器提取不规范表结构Web信息的算法. lots of unregulated table structure information currently come to be the unavoidable issue of Web information extraction. Based on the existing method, a further research about inducing contextual rules of adjoining attributes has been done; and a new conception of the finite state automata wrapper and attributes transducer, whose granularity is Web pages, is presented. Finally, the algorithms for the unregulated table structure Web information extraction by finite state automata wrapper, are introduced.
出处 《武汉大学学报(工学版)》 CAS CSCD 北大核心 2005年第6期128-132,共5页 Engineering Journal of Wuhan University
基金 国家自然科学基金项目资助(No.60273072) 国家高技术研究发展计划(863)项目(No.2002AA423450)资助
关键词 信息提取 上下文规则集 有限状态自动机 自动机包装器 information extraction contextual rules finite state automata(FSA)
  • 相关文献

参考文献6

  • 1Hammer J,Garcia-Molina H,Nestorov S,Yerneni R,Breunig M M,Vassalos V.Template-based wrappers in the TSIMMIS system[J].In SIGMOD Conference,1997.532-535.
  • 2Baumgartner R,Flesca S,Gottlob G.Visual web information extraction with lixto[J].In Proc.Vldb'01,2001.682-689.
  • 3李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25(5):526-533. 被引量:101
  • 4Nicholas K,Weld D,Doorenbos R.Wrapper induction for information extraction[J].In Proc.IJCAI,1997.753-761.
  • 5Hsu C,Dung M.Generating finite-state transducers for semi-structured data extraction from the web[J].Journal of Information Systems,1998,23(8):521-538.
  • 6胡东东,孟小峰.一种基于树结构的Web数据自动抽取方法[J].计算机研究与发展,2004,41(10):1607-1613. 被引量:21

二级参考文献24

  • 1Meng X F, Lu H J, Wang H Y, et al. SG-WRAP: A schemaguided wrapper generator demonstration. In: Proc of ICDE'2002. Los Alamitos, CA: IEEE Computer Society Press, 2002.331 ~332
  • 2Meng X F, Hu D D, Li C. Schema guided wrapper maintenance for Web-data extraction. In: Proc of ACM WIDM' 2003. New York: ACM Press, 2003. 1~8
  • 3Meng X F, Wang H Y, Hu D D, et al. Sg-wram: Schema guided wrapper maintenance. In: Proc of ICDE' 2003. Los Alamitos,CA: IEEE Computer Society Press, 2003. 750~752
  • 4Meng X F, Lu H J, Wang H Y, et al. Schema-guided data extraction from the Web. Journal of Computer Science and Technology, 2002, 17(4): 377~388
  • 5V Crescenzi, G Mecca, P Merialdo. ROADRUNNER: Towards automatic data extraction from large Web sites. In: Proc of VLDB'2001. San Francisco, CA: Morgan Kaufmann, 2001. 109~118
  • 6A Arasu, H Garcia-Molina. Extracting structured data from Web pages. In: Proc of ACM SIGMOD'03. New York: ACM Press,2003. 337~348
  • 7St(e)phane Grumbach, Giansalvatore Mecca. In search of the lost schema. In: Proc of ICDT'1999. Berlin: Springer, 1999. 314~331
  • 8Florescu D, Levy A Y, Mendelzon A. Database techniques for the World-Wide Web: A Survery. In: ACM The SIGMOD Record, 1998.59-74
  • 9Atzeni P, Mecca G, Merialdo P. To weave the Web. In: Proc the 23rd International Conference on Very Large Data Bases. Athens, Greece, 1997. 206-215
  • 10Pemberton S et al. XHTML 1.0: The extensible hyperText markup language. In: http://www.w3.org/MarkUp/

共引文献119

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部