期刊文献+

D-EEM:一种基于DOM树的Deep Web实体抽取机制 被引量:16

D-EEM:A DOM-Tree Based Entity Extraction Mechanism for Deep Web
下载PDF
导出
摘要 随着Web数据库的不断增长,通过对Deep Web的访问逐渐成为获取信息的主要手段.如何有效地抽取Deep Web中结果页面所包含的实体信息成为一个值得研究的问题.通过分析Deep Web结果页面的特点,提出了一种基于DOM树的Deep Web实体抽取机制(DOM-tree based entity extraction mechanism for Deepweb,D-EEM),能够有效解决Deep Web环境中的实体抽取问题.D-EEM采用基于DOM树的自动实体抽取策略,利用DOM树中的文本内容和层次结构来确定数据区域和实体区域,提高了实体抽取的准确性;另外,提出了一种基于上下文距离和共现次数的语义标注方法,有效地将来自不同数据源的抽取结果进行合成.通过实验验证了D-EEM中所采用的关键技术的可行性和有效性,同其他实体抽取策略相比,D-EEM在抽取效率及抽取准确性等方面具有一定的优势. With the increase of Web databases,accessing Deep Web is becoming the main method to acquire information.Because of the large-scale unstructured content,heterogeneous result and dynamic data in Deep Web,there are some new challenges for entity extraction.Thus it is important to solve the problem of extracting the entities from Deep Web result pages effectively.By analyzing the characteristics of result pages,a DOM-tree based entity extraction mechanism for Deep Web(called D-EEM) is presented to solve the problem of entity extraction for Deep Web.D-EEM is modeled as three levels:expression level,extraction level,collection level.Therein the components of region location and semantic annotation are the core parts to be researched in this paper.A DOM-tree based automatic entity extraction strategy is performed in D-EEM to determine the data regions and entity regions respectively,which can improve the accuracy of extraction by considering both the textual content and the hierarchical structure in DOM-trees.Also based on the Web context and co-occurrence,a semantic annotation method is proposed to benefit the process of data integration effectively.An experimental study is proposed to determine the feasibility and effectiveness of the key techniques of D-EEM.Compared with various entity extraction strategies,D-EEM is superior in the accuracy and efficiency of extraction.
出处 《计算机研究与发展》 EI CSCD 北大核心 2010年第5期858-865,共8页 Journal of Computer Research and Development
基金 国家自然科学基金项目(60673139 60973021) 国家"八六三"高技术研究发展计划基金项目(2008AA01Z146) 中央高校基本科研业务费专项基金项目(NO90304005)~~
关键词 实体抽取 DOM树 DEEPWEB 数据区域定位 实体区域定位 entity extraction DOM-tree Deep Web data region location entity region location
  • 相关文献

参考文献11

  • 1Chang KCC,He B,Li C,et al.Structured databases on the Web:Observations and implications[J].SIGMOD Record,2004,33(3):61-70.
  • 2Calife M,Mooney R.Relational learning of pattern match rules for information extraction[C] //Proc of the 16th National Conf on Artificial Intelligence and 11th Conf on Innovative Applications of Artificial Intelligence.Menlo Park,CA:AAAI,1999:328-334.
  • 3Soderlan S.Learning information extraction rules for semi-structured and free text[J].International Journal of Machine Learning,1999,34(1-3):233-272.
  • 4Muslea I,Minton S,Knoblock G.A hierarchical approach to wrapper induction[C] //Proc of the 3rd Conf on Autonomous Agents.New York:ACM,1999:190-197.
  • 5Liu Wei,Meng Xiaofeng,Meng Weiyi.Vision-based Web data records extraction[C] //Proc of the 9th SIGMOD Int Workshop on Web and Database.New York:ACM,2006:20-25.
  • 6Zhao Hongkun,Meng Weiyi.Fully automatic wrapper generation for search engines[C] //Proc of WWW'05.New York:ACM,2005:66-75.
  • 7Liu L,Pu C,Han W.XWRAP:An XML-enable wrapper construction system Web information sources[C] //Proc of the 16th IEEE Int Conf on Data Engineering.Washington:IEEE,2000:611-621.
  • 8Valter C,Giansalvatore M,Paolo M.RoadRunner:Towards automatic data extraction from large Web sites[C] //Proc of the 27th VLDB.San Francisco:Morgan Kaufmann,2001:109-118.
  • 9李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25(5):526-533. 被引量:101
  • 10王茹,宋瀚涛,陆玉昌.基于树自动机的网页数据抽取[J].北京理工大学学报,2004,24(9):790-793. 被引量:6

二级参考文献39

  • 1Florescu D, Levy A Y, Mendelzon A. Database techniques for the World-Wide Web: A Survery. In: ACM The SIGMOD Record, 1998.59-74
  • 2Atzeni P, Mecca G, Merialdo P. To weave the Web. In: Proc the 23rd International Conference on Very Large Data Bases. Athens, Greece, 1997. 206-215
  • 3Pemberton S et al. XHTML 1.0: The extensible hyperText markup language. In: http://www.w3.org/MarkUp/
  • 4Cattell R G G. The Object Database Standard ODMG-93. San Mateo,California: Morgan Kaufmann Publishers,1994
  • 5Mitchell T. Machine Learning. New York: McGraw Hill, 1997
  • 6Wall L et al. Programming Perl(3rd Edition). O'Reilly & Associates,2000
  • 7Birbeck M et al. Professional XML. Wrox Press Inc, 2000
  • 8Liu L, Pu C, Han W. XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proc International Conference on Data Engineering (ICDE), San diego, California, 2000. 611-621
  • 9Chamberlin D, Robie J, Florescu D. Quilt: An XML query language for heterogeneous data sources. In: Proc International Workshop on the Web and Databases (WebDB'2000), Dallas, Texas, 2000. 53-62
  • 10Sahuguet A, Azavant F. Building light-weight wrappers for legacy web datasources using w4f. In: Proc International Conference on Very Large Databases, Edinburgh, Scotland, 1999. 738-741

共引文献144

同被引文献87

引证文献16

二级引证文献51

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部