摘要
数据约简是包括数据压缩、数据调整和特征提取在内的数据挖掘技术中的重要课题,但已有的数据约简方法主要聚焦在特征或者维度的约简,而针对样本个数的约简方法,往往是针对具体的数据集开发,缺乏一般性。针对数据集中数据分布的一般特征,定义一种新的基于张开角的测度。该测度能够区分数据集中核心对象和边界对象分布的本质区别,实现数据集中以核心对象为中心的数据压缩。通过对UCI公共测试平台上20个具有不同特征的典型样本集进行数据约简和测试,结果表明:约简能够有效地提取数据集中的核心目标;通过对约简前后数据集采用经典K均值算法聚类,发现约简后数据集中聚类正确率明显高于约简前数据集。
Data reduction has been an important issue of data mining including data compression,data adjustment,feature extraction,and so on,however,existing methods of data reduction mainly focus on reduction of features and dimensions,methods of reduction to the number of samples always limit to specific data sets which lack of generality. Aiming at general feature of data distribution in data sets,define a new kind of measurement based on opening angle. This measurement can distinguish essential difference of distribution between kernel objects and boundary objects,and realize data compression which takes kernel objects as center for data sets. By data reduction and test on 20 typical simple sets which have different features on UCI public test platform,the result demonstrates the proposed method can extract kernel objects in data sets effectively; by using the typical kmeans algorithm to cluster the data sets before data reduction,cluster accuracy of reduced data sets is apparently higher than that of original data sets.
出处
《传感器与微系统》
CSCD
2016年第4期25-28,31,共5页
Transducer and Microsystem Technologies
基金
国家自然科学基金资助项目(61174014)
关键词
数据约简
方向角
聚类分析
data reduction
direction angle
cluster analysis