摘要
针对物流信息平台中信息资源采集系统建设的不足,提出建立基于Nutch的网页资源定向采集系统,并对中文分词、主题相关度分析、结果排序、正文解析等关键模块进行重点探讨。最后在一定条件下进行了实验,并分析了实验结果。
In this paper, in view of the inadequacy of the information resources harvesting system in logistics information platforms, we proposed to build a web-page resources targeted harvesting system based on Nutch, discussed emphatically the issues of segmentation of Chinese characters, topic, relevance analysis, query result ranking and text parsing, etc., and finally carried out the corresponding experiment under certain conditions and analyzed the result.
出处
《物流技术》
北大核心
2012年第7期367-371,共5页
Logistics Technology
基金
国家自然科学基金项目"基于云计算
物联网的物流系统资源优化调度方法研究"(B12A200050)
关键词
NUTCH
主题爬虫
正文抽取
定向采集
中文分词
Nutch
topic crawler
web-page extraction
targeted harvesting
segmentation of Chinese characters