期刊文献+

基于隐马尔可夫模型的Web信息抽取 被引量:6

Web Information Extraction Based on Hidden Markov Model
下载PDF
导出
摘要 针对Web信息抽取领域中存在的"项缺失"和"项无序"问题,提出一种基于隐马尔可夫模型的Web信息抽取方法。将Web文档解析为一棵扩展的DOM树,映射待抽取的信息项为状态,映射待抽取的信息项在扩展DOM树中的路径为词汇,使用归纳算法构造隐马尔可夫模型。实验结果证明该方法可以获得更好的抽取性能。 To solve disorder among information items and lack of information item in the field of information extraction, this paper proposes a Web information extraction algorithm based on Hidden Markov ModeI(HMM). It parses a Web document into an extended DOM tree, and maps an information item to a state with mapping a path in extended DOM tree about an information item to a vocable. An HMM model is obtained by using induction algorithm. Experiments show that the algorithm has better extraction performance.
作者 刘亚清 陈荣
出处 《计算机工程》 CAS CSCD 北大核心 2009年第18期25-27,共3页 Computer Engineering
基金 国家自然科学基金资助项目(60775028) 大连市科技局基金资助重大项目(2007A14GX042)
关键词 信息抽取 隐马尔可夫模型 扩展DOM树 information extraction Hidden Markov ModeI(HMM) extended DOM tree
  • 相关文献

参考文献5

  • 1Laender A, Ribeiro-Neto B, Silva A, et el. A Brief Survey of Web Data Extraction Tools[J]. ACM SIGMOD Record, 2002, 31(2): 84-93.
  • 2Hammer J, McHugh J, Garcia-Molina H. Semi-structured Data: The TSIMMIS Experience[C]//Proceedings of the 1st East-European Symposium on Advances in Databases and Information Systems. St. Petersburg, Russia: [s. n], 1997.
  • 3Crescenzi V, MeccaG, MerialdoE RoadRunner: Towards Automatic Data Extraction from Large Web Sites[C]//Proceedings of 27th Int'l Conference on Very Large Databases. San Francisco, USA: [s, n.], 2001.
  • 4Muslea I, Minton S, Knoblock C. Hierarchical Wrapper Induction for Semi-structured Information Sources[J]. Autonomous Agents and Multi-Agent Systems, 2001,4(1/2): 93-114.
  • 5Soderland S. Learning Information Extraction Rules for Semistructured and Free Text[J]. Machine Learning, 1999, 34(1-3): 233-272.

同被引文献75

引证文献6

二级引证文献25

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部