摘要
网络文献知识库中的海量资源及其分类的粗粒度,导致学习者容易在文献检索和阅读过程出现认知迷航和知识过载问题。该文提出一种基于Map Reduce的知识聚类与统计机制:首先,提出基于Map Reduce的共现矩阵构建算法MR-Co Matrix;其次,将共现矩阵与相似度系数结合构建相似度矩阵;然后,通过Z Scores对相似度矩阵进行标准化;最后,使用离差平方和法(Ward,s method)对相似度矩阵进行聚类,生成树状的知识聚类谱系图;基于聚类结果,提出基于Map Reduce的知识文献统计算法MR-Statistics,对每个分类的知识属性进行统计。实验结果表明:将MR-Co Matrix和MR-Statistics方法应用于网络文献知识库进行知识聚类和统计,达到较理想的聚类精度和计算效率,实现了细粒度知识聚类和多维统计,同时减少了时间开销。
The large scale and the coarse classification granularity of resources in literature knowledge bases lead to disorientation and overloading when learners retrieve and read papers. This paper proposes a mechanism of knowledge clustering and knowledge statistics based on Map Reduce. Firstly, this paper presents a Co-occurrence Matrix building algorithm based on Map Reduce(MR-Co Matrix). Secondly, it makes combination of the co-occurrence matrix and similarity coefficient to build the similarity matrix. Thirdly, the similarity matrix is standardized with Z scores. Finally, knowledge clusters are constructed with the Ward,s method. After knowledge clustering, this paper introduces a knowledge Statistics algorithm based on Map Reduce(MR-Statistics) to dig the hidden information in each cluster. The experimental results show that the literature knowledge base with MRCo Matrix and MR-Statistics can realize the accurate and fine clustering, multi-dimension statistics, computational efficiency, and less cost of time.
出处
《电子与信息学报》
EI
CSCD
北大核心
2016年第1期202-208,共7页
Journal of Electronics & Information Technology
基金
国家自然科学基金(61202004
61472192)
教育部科技发展中心网络时代的科技论文快速共享专项研究(2013116)
江苏省高校自然科学研究计划(14KJB520014)~~