摘要
Web技术的发展导致Web数据激增,其中Deep Web数据占主要部分.实体识别是开展模式识别、数据集成等Deep Web关键技术研究的首要前提.为提高实体识别的效率和准确性,提出了基于模板的Deep Web实体识别信息抽取方法.该方法拥有三个不同的处理阶段:其中基于DOM树抽取规则的模板训练阶段最为关键,抽取规则通过结构分析和语义分析两个阶段完成,此外该方法还包含着数据准备和实体信息抽取两个辅助阶段.最后经实验验证所提方法在提升实体识别准确性的同时具有较好的信息抽取效率.
The development of Web technology led to a surge of Web data,of whichI)eep Web data accounted for a high proportion. Entitiesidentifyis the most important prerequisite for the research of Deep Web such as pattern recognition, data integration and so on. In order to improve the efficiency and accuracy of entity recognition, a method of entities identification information extraction in deep web based on templatewas proposed. Tiffs method has 3 stages:the key stage istemplate training stage to extract rules based on DOM tree. The extraction rules are obtainedby structure analysis and semantic analysis. The method also includes 2 auxiliary stages:data preparation and entity information extraction. Finally, the experimental results show that the proposed method can improve the recognition accuracy of the entity ,and has better information extraction efficiency.
出处
《辽宁大学学报(自然科学版)》
CAS
2017年第2期97-104,共8页
Journal of Liaoning University:Natural Sciences Edition
基金
辽宁省博士科研启动基金(201601099)
辽宁省社科规划项目(L14DGL049)
2016年省级本科教改立项一般项目
辽宁省档案科技项目(L-2016-8-7)