摘要
大数据时代带来数据处理模式的变革,依托Hadoop分布式编程框架处理大数据问题是当前该领域的研究热点之一。为解决海量数据挖掘中的分类问题,提出基于一种双度量中心索引KNN分类算法。该算法在针对存在类别域的交叉或重叠较多的大数据,先对训练集进行中心点的确定,通过计算分类集与训练集中心点的欧式距离,确定最相似的3个类别,然后以余弦距离为度量,通过索引选择找出K个近邻点,经过MapReduce编程框架对KNN并行计算加以实现。最后在UCI数据库进行比较验证,结果表明提出的并行化改进算法在准确率略有提高的基础上,运算效率得到了极大提高。
Big data era has a revolution on the data processing mode, and the way dealing with bigdata by Hadoop distributed framework becomes one of the most popular research topics. Cloud computing model of clusters covers the shortage of the large amount of calculation and time-consuming of traditional non-dis- tributed algorithm, meanwhile huge amounts of unstructured data increases the difficulty of data utilization. Aimed at the problem of solving the mass classification in data mining, this essay puts forward a algorithm, i.e. Bi-Measurement Central Index KNN Classification. And the algorithm mainly deals with in the field of the cross or overlap data. First, the essay is to find center of training data, then calculate the Eu- clidean distance between classifying data and training sites, and determine the most similar to the three categories. In addition, the essay selects k nearest neighbor points by the cosine distance metric, and computes the results by MapReduce. Finally, the UCI database is compared with and verified. The results show that though the amplitude of improving the accuracy by the proposed algorithm is not very great, the efficiency of the algorithm is greatly improved.
出处
《空军工程大学学报(自然科学版)》
CSCD
北大核心
2017年第1期92-98,共7页
Journal of Air Force Engineering University(Natural Science Edition)
基金
陕西省科技计划自然基金重点项目(2012JZ8005)