摘要
介绍一种从网络文献数据库中自动采集机构学术成果并存储到DSpace平台的实验系统(DAAS),并实现信息过滤、元数据提取、版权验证、元数据映射和数据存储的半自动化流程。详细描述基于Nutch核心组件,DAAS针对不同的期刊数据库,采用基于规则的方法设置过滤器来提取非结构化网页上书目信息,并指出计算机学习算法是下一步研究重点。
This paper introduces an experimental system(DAAS) which can automatic harvest the institutional researcher articles and ingest the metadata into the local DSpace platform.The system implements a semi-automatic approach for IRs population which consists of information filtering,metadata extraction,copyright verification,metadata mapping and data archiving.Based on Nutch key component,how to parse the URL and extract the metadata from unstructured Web pages according to the rule-based filter is described in detail.The next research is focus on the computer-learning algorithm.
出处
《现代图书情报技术》
CSSCI
北大核心
2010年第12期76-80,共5页
New Technology of Library and Information Service
基金
北京理工大学基础研究基金"机构知识库构建研究"(项目编号:20061442003)的研究成果之一