摘要
针对多信息源网站中化学物质信息的获取与数据库的更新查询问题,运用网络爬虫技术和包装器方法实现数据的抽取;采用自定义XML文件的方式,提出了任务分割、动态更新检查、失败重试机制方法,实现了动态信息源网站中化学物质信息的持续、实时抽取,并进行异常处理和监控。将抽取的数据运用正则表达式和排序算法进行预处理并构建全面而准确的化学品环境安全数据库,最终实现了对原有数据的更新查询,在一定程度上保证了可靠性、可用性、可扩展性、可维护性。
To solve the problems of chemical substance information acquisition from Multi-source website, database update and database query, the technology of web crawler and the method of the wrapper are used to extract data, and methods of task partitioning, dynamic updating inspection and failure retry mechanism is proposed by introducing the user-defined xml file to implement continuous and real-time extraction, exception handling and monitoring of Chemical information in the information source website. Moreover, extracted data is pretreated by regular expression and sorting algorithmand built a comprehensive and accu- rate database of environmental safety of chemicals, finally to updating and querying the original database. A certain degree of reliability, availability, extendibility and maintainability is guaranteed.
出处
《计算机工程与设计》
CSCD
北大核心
2012年第8期3040-3046,共7页
Computer Engineering and Design
基金
公益性行业(环保)科研专项基金项目(200909086)
关键词
WEB信息抽取
任务分割
重试机制
持续抽取
数据预处理
web information extraction
task division
retry strategy
continuous extraction
data pretreatment