期刊文献+

基于Nutch的专题网页资源采集服务系统的设计与实现 被引量:3

Research and Implementation of Nutch-based Website Harvest and Service System in Special Field
原文传递
导出
摘要 在数字图书馆系统集成应用的框架下,提出基于Nutch的专题网页资源采集服务系统设计方案。该方案引入信息过滤模块、基于计算机通信领域专业词典的中文分词模块、GUI信息定制模块、词典和关键词管理模块等,保证采集和获取过程中资源的专题性和系统的可管理性以及易用性。重点对文本解析过滤、Plugin插件开发以及搜索结果的层次化自动聚类等相关技术进行深入研究。通过基于Webservice的服务接口,实现其在数字图书馆资源层的集成应用。 This paper proposes the design of Nutch- based Website Harvest and Service system in Special field under the framework of digital library systems integration. It introduces information filtering module, dictionary - based Chinese analyzer module, GUI information module,topic - knowledge based information processing module as well as the Webservice - based search service modules to improve function and performance of the system. It focuses on text parsing filters, plugin development and applications of the level - automatic clustering of the search results. Finally, integration with other subsystem in digital library is realized through the Webservice - interface, which can provide comprehensive and professional services
出处 《现代图书情报技术》 CSSCI 北大核心 2010年第3期19-26,共8页 New Technology of Library and Information Service
关键词 NUTCH 网页资源采集 中文分词插件 WEBSERVICE 集成服务 Nutch Website harvest Chinese analyzer plugin Webservice Integration services
  • 相关文献

参考文献14

  • 1Nutch [ EB/OL ]. [ 2009 - 07 - 20 ]. http ://lucene. apache. org/ nutch/.
  • 2Heritrix[ EB/OL]. [ 2009 - 10 - 24 ]. http:// crawler, archive. org/.
  • 3WCT [ EB/OL ]. [ 2009 - 12 - 24 ]. http://webeurator. sourceforge, net/.
  • 4NetarchiveSuite [ EB/OL ]. [ 2008 - 11 - 12 ]. http ://netarehive. dk/suite.
  • 5Smart Crawler [EB/OL]. [ 2009- 11 -12 ]. http://erawler.archive.org/.
  • 6Wget [ EB/OL]. [ 2010 - 02 - 07 ]. http://www.gnu.org/software/wget/.
  • 7Hadoop [ EB/OL ]. [ 2010 - 02 - 12 ]. http://hadoop. Apache. org/.
  • 8Cutting D. Nutch, Open - Source Web Search[ EB/OL]. [ 2009 - 01 -29 ]. http://wiki, apache, org/nutch -data/attachments/ Presentations/attachments/www2004. pdf.
  • 9徐健,张智雄.基于Nutch的Web网站定向采集系统[J].现代图书情报技术,2009(4):1-6. 被引量:10
  • 10HTML Parser [ EB/OL ]. [ 2009 - 01 - 29 ]. http ://htmlparaer. soureeforge. net/.

二级参考文献25

  • 1程妮.科学搜索引擎Scirus研究[J].现代图书情报技术,2005(3):45-49. 被引量:12
  • 2Clusty the clustering search engine. [2008 -06 - 17 ]. http:// clusty. com.
  • 3iBoogie metaSearch document clustering engine and personalized search engines directory. [ 2008 -06 - 17 ]. http ://www. iboogie.
  • 4Mnemomap. [2008 -06 - 17]. http://www.mnemo.org.
  • 5WebBrain. com. [ 2008 - 06 - 17 ]. http://www. webbrain.com/ htmL/defauh_win. html.
  • 6Carrot clustering engine. [ 2008 - 06 - 17] http ://demo. carrot2. org/demotable/main.
  • 7Quintura. [ 2008 - 06 - 18 ]. http ://www. quintura.com.
  • 8Bbmao搜索.[2008-06-18].http://www.bbmao.com.
  • 9KVisu. [ 2008 -06 -24 ]. http://beta.kvisu. com/index_en. khtm? lg = en.
  • 10Introducing clustering 2. 0. [ 2008 - 06 - 25 ]. http ://searchdoneright. com/2008/01/introducing.

共引文献14

同被引文献44

  • 1赵德平,刘阳,李鹏.基于Lucene的房产信息垂直搜索引擎的研究[J].沈阳建筑大学学报(自然科学版),2011,27(1):178-183. 被引量:6
  • 2朱嫣岚,闵锦,周雅倩,黄萱菁,吴立德.基于HowNet的词汇语义倾向计算[J].中文信息学报,2006,20(1):14-20. 被引量:325
  • 3张学福.信息检索可视化基本问题研究[J].中国图书馆学报,2006,32(3):37-40. 被引量:15
  • 4陈艳.信息检索可视化技术[J].情报理论与实践,2006,29(5):618-621. 被引量:20
  • 5Kosala R,Blockeel H.Web Mining Research:A Survey[J].ACM SIGKDD Explorations Newsletter,2000,2(1):1-15.
  • 6Sanchez D,Batet M,Isern D,et al.Ontology-based Semantic Similarity:A New Feature-based Approach[J].Expert Systems with Applications.2012,39(9):7718-7728.
  • 7Resnik P.Using Information Content to Evaluate Semantic Similarity in a Taxonomy[EB/OL].(1995-11-07).http://arxiv.org/pdf/cmp-lg/9511007.pdf.
  • 8Mikolv T,Chen Kai,Corrado G.Efficient Estimation of Word Representations in Vector Space[EB/OL].(2013-03-17).http://arxiv.org/pdf/1301.3781.pdf.
  • 9李东海.基于Nutch技术的主题搜索引擎实现[D].长春:吉林大学,2007.
  • 10Menczer F,Pant G,Srinivasan P.Topical Web Crawlers:Evaluating Adaptive Algorithms[J].ACM Transactions on Internet Technology,2004,4(4):378-419.

引证文献3

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部