期刊文献+

并行MapReduce模型下的一种改进型KNN分类算法 被引量:2

A Modified Bi-Measurement Central Index KNN Classification Algorithm Based on MapReduce
下载PDF
导出
摘要 大数据时代带来数据处理模式的变革,依托Hadoop分布式编程框架处理大数据问题是当前该领域的研究热点之一。为解决海量数据挖掘中的分类问题,提出基于一种双度量中心索引KNN分类算法。该算法在针对存在类别域的交叉或重叠较多的大数据,先对训练集进行中心点的确定,通过计算分类集与训练集中心点的欧式距离,确定最相似的3个类别,然后以余弦距离为度量,通过索引选择找出K个近邻点,经过MapReduce编程框架对KNN并行计算加以实现。最后在UCI数据库进行比较验证,结果表明提出的并行化改进算法在准确率略有提高的基础上,运算效率得到了极大提高。 Big data era has a revolution on the data processing mode, and the way dealing with bigdata by Hadoop distributed framework becomes one of the most popular research topics. Cloud computing model of clusters covers the shortage of the large amount of calculation and time-consuming of traditional non-dis- tributed algorithm, meanwhile huge amounts of unstructured data increases the difficulty of data utilization. Aimed at the problem of solving the mass classification in data mining, this essay puts forward a algorithm, i.e. Bi-Measurement Central Index KNN Classification. And the algorithm mainly deals with in the field of the cross or overlap data. First, the essay is to find center of training data, then calculate the Eu- clidean distance between classifying data and training sites, and determine the most similar to the three categories. In addition, the essay selects k nearest neighbor points by the cosine distance metric, and computes the results by MapReduce. Finally, the UCI database is compared with and verified. The results show that though the amplitude of improving the accuracy by the proposed algorithm is not very great, the efficiency of the algorithm is greatly improved.
出处 《空军工程大学学报(自然科学版)》 CSCD 北大核心 2017年第1期92-98,共7页 Journal of Air Force Engineering University(Natural Science Edition)
基金 陕西省科技计划自然基金重点项目(2012JZ8005)
关键词 大数据 HADOOP 数据挖掘 双度量中心索引 MAPREDUCE big data Hadoop data mining techniques bi-measurement central index
  • 相关文献

参考文献3

二级参考文献32

  • 1袁方,苑俊英.基于类别核心词的朴素贝叶斯中文文本分类[J].山东大学学报(理学版),2006,41(3):111-114. 被引量:12
  • 2苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:389
  • 3宋玲,马军,连莉,张志军.文档相似度综合计算研究[J].计算机工程与应用,2006,42(30):160-163. 被引量:43
  • 4袁方,周志勇,宋鑫.初始聚类中心优化的k-means算法[J].计算机工程,2007,33(3):65-66. 被引量:154
  • 5Han E H, Gerge K, Vipln K, et al.Text categorization using weight adjusted k-nearest neighbor classification, Technical Report#00--046[R].University of Minnesota,2000.
  • 6Chakrabarti S, Joshi M, Tawde V.Enhanced topic distillation using text,markup tags,and hyperlinks[C]//ACM SIGIR,2001.
  • 7王潇.基于向量空间模型的文本自动分类算法的研究与改进[D].西安:西北师范大学,2007.
  • 8Li Sujian, Zhang Jian, Huang Xiong.Semantic computation in Chinese question-answering system[J].Joumal of Computer Science and Technology, 2002.
  • 9Dong Z D,Dong Q.HowNet[EB/OL].http://www.keenage.com.
  • 10Seo H C,Chung H J,Rim H C.Unsupervised word sense disam- biguation using WordNet relatives[J].Computer Speech and Language, 2004,18(3) :253-273.

共引文献79

同被引文献30

引证文献2

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部