摘要
Web中蕴藏着大量有价值的数据,过去十几年中,针对Web信息抽取技术已有较多的研究。而现有的研究和系统多集中在数据抽取处理阶段,忽略或简化了完整的Web信息抽取过程需要的网页自动浏览导航和集成处理。为克服这些不足,提出了包含浏览导航、数据抽取和集成过程的三阶段Web信息抽取处理模型,基于此进一步研究提出了自动浏览导航模型,并设计实现了网页自动浏览导航规则语言。研究提出了一种Web数据抽取、转换和集成(extraction-transformation-integration,ETI)模型,设计实现了一套灵活有效的数据集成和流程控制规则语言,能有效地维护跨网页数据记录的复杂关系,并提供灵活的流程控制能力。抽取实例的结果表明,该规则语言和系统可有效完成全过程化的Web信息抽取集成处理功能。
Web contains large amount of valuable data information. Many Web information extraction techniques have been studied in past decade. However, most of existing studies and systems focus on data extraction processing from acquired Web pages, and ignore or simplify the automated navigation and data integration processes. To solve the problem, this paper proposes a three-stage Web information extraction model including automated navigation, data extraction and data integration. Based on this model, this paper designs a navigation model along with an auto-mated navigation rule language. Furthermore, this paper proposes an ETI (extraction-transformation-integration) model and an integration and workflow control rule language, which can effectively maintain the complex relation-ship for cross-page data record and provide flexible workflow control. Extraction results show that the proposed rule language and the implemented system can effectively achieve Web page navigation and data extraction.
出处
《计算机科学与探索》
CSCD
2014年第9期1049-1066,共18页
Journal of Frontiers of Computer Science and Technology
基金
国家自然科学基金
江苏省科技支撑计划项目~~
关键词
WEB信息抽取
自动浏览导航
数据集成
流程控制
规则语言
Web information extraction
automated Web navigation
data integration
workflow control
rule language