摘要
在数字图书馆系统集成应用的框架下,提出基于Nutch的专题网页资源采集服务系统设计方案。该方案引入信息过滤模块、基于计算机通信领域专业词典的中文分词模块、GUI信息定制模块、词典和关键词管理模块等,保证采集和获取过程中资源的专题性和系统的可管理性以及易用性。重点对文本解析过滤、Plugin插件开发以及搜索结果的层次化自动聚类等相关技术进行深入研究。通过基于Webservice的服务接口,实现其在数字图书馆资源层的集成应用。
This paper proposes the design of Nutch- based Website Harvest and Service system in Special field under the framework of digital library systems integration. It introduces information filtering module, dictionary - based Chinese analyzer module, GUI information module,topic - knowledge based information processing module as well as the Webservice - based search service modules to improve function and performance of the system. It focuses on text parsing filters, plugin development and applications of the level - automatic clustering of the search results. Finally, integration with other subsystem in digital library is realized through the Webservice - interface, which can provide comprehensive and professional services
出处
《现代图书情报技术》
CSSCI
北大核心
2010年第3期19-26,共8页
New Technology of Library and Information Service