
基于XML和DOM技术的Web信息抽取模型 被引量:1

Research on Web Information Extraction Model Based on XML and DOM Technologies
摘要 将XML技术应用于搜索引擎,提出一种基于XML和DOM技术的Web信息抽取模型,对模型的数据采集、页面优化处理、抽取规则生成和信息抽取四个阶段进行了详细分析,讨论了网页爬虫、NekoHTML、Xerces-J、JTree、Xpath以及XSLT技术在Web信息抽取中的应用,实现了Web信息抽取的半自动化. XML technology is applied in search engine, and a web information extraction model based on XML and DOM technology is proposed. The stages of data acquisition, web age optimization, extraction rule genera- tion and information extraction are analyzed in detail. The technologies of webpage reptile, NekoHTML, Xerc- es-J, JTree, Xpath and XSLT are applied in Web information extraction. Finally, semi-automation method of Web information extraction is realized.
出处 《大连交通大学学报》 CAS 2013年第3期96-99,118,共5页 Journal of Dalian Jiaotong University
基金 武汉大学软件工程国家重点实验室开放基金资助项目(SKLSE2012-9-27) 四川省重点实验基金资助项目(GK201202) 广西混杂计算与集成电路设计分析重点实验室基金资助项目
关键词 信息抽取 XML技术 DOM技术 WEB页面 information extraction XML technology DOM technology Web page
  • 相关文献


  • 1陈佳,胡燕,轩艳艳.一种基于XML的Web信息抽取方法[J].计算机与数字工程,2007,35(6):101-103. 被引量:3
  • 2冀高峰,汤庸,道炜,吴桂宾,黄帆,王鹏.基于XML的自动学习Web信息抽取[J].计算机科学,2008,35(3):87-90. 被引量:10
  • 3JOHNSON E J,KUNZE A R.IXP2400/2800 program-ming-the complete micro engine coding guide[M].[s.l.]:Intel Press,2003.
  • 4DAVID W E,YUANJ,DERMIS Y K NG.Record-Bound-ary Discovery in Web Documents.Proc of ACM SIGMODInternational Conference on Management of Data[C].USA:Pennsylvania,1999:467-478.
  • 5CHRISTINA Y C,MICHAEL G,NEEL S.Reverse engi-neering for web data:From visual to semantic structures:Proc of the 18th International Conference on data Engi-neering[C].California:San Jose,2002:53-63.
  • 6ROBERT BAUMGARTNER,SERGIO FIESCA,GEORGGOTTLOB.Supervised wrapper generation with lixto:Proceedings of 27th international Conference on VeryLarge DatabaseRomaItaly[C].[s.1.]:[s.n.],2001:1-2.
  • 7LLUL PU C,HAN W.XWRAP:P:An XML-enabledwrapper construction system for Web Informationsources:Proceedings of the International Conference onData Engineering[C].[s.l.]:SanDiego,2000:611-621.
  • 8王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 9黄豫清,戚广志,张福炎.从WEB文档中构造半结构化信息的抽取器[J].软件学报,2000,11(1):73-78. 被引量:47
  • 10CHANG C H,KAYEDM,GIRGIS M R,et al.A surveyof Web information extraction systems[J].IEEE Trans-actions on Knowledge and Data Engineering,2006,18(10):1411-1428.


  • 1王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 2O Buyukkokten, H Garcia-Molina, A Paepcke. Accordion summarization for end-game browsing on PDAs and cellular phones. In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York: ACM Press, 2001. 213~220
  • 3Wang Tengjiao, Tang Shiwei, Yang Dongqing, et al. COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD' 02. New York: ACM Press, 2002. 620
  • 4Liu Ling, Pu Calton, Han Wei. XWRAP: An XML-enabled wrapper construction system for Web information sources. In:Proc of the 16th Int'l Conf on Data Engineering. Washington:IEEE Computer Society Press, 2000. 611~621
  • 5R Baumgartner, S Flesca, G Gottlob. Visual Web information extraction with Lixto. In: Proc of the 27th Int'l Conf on Very Large Data Bases. San Francisco: Morgan Kaufmann, 2001. 119~ 128
  • 6D Freitag. Machine learning for information extraction in information domains. Machine Learning, 2000, 39 (2-3): 169 ~202
  • 7S SoderLan. Learning information extraction rules for semistructured and free text. Machine Learning, 1999, 34(1-3): 233~ 272
  • 8R D Doorenbos, O Etzioni, D S Weld. A scalable comparasonshopping agent for the World-Wide Web. In: ACM Agents' 97.New York: ACM Press, 1997. 39~48
  • 9D W Embley, et al. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering,1999, 31(3): 227~251
  • 10A Finn, A Kushmerick, B Smyth. Fact or fiction: Content classification for digital libraries. The 2nd DELOS Network of Excellence Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, Ireland, 2001











使用帮助 返回顶部