摘要
数据的完整性是数据可用性的重要维度。由于数据采集等过程中存在的问题,现实中的数据往往存在缺失。现有的聚类算法在面对不完整数据时一般采用忽略缺失或填补缺失的策略,但是当数据缺失属于非随机缺失时,这样的处理策略会导致聚类精度严重下降。当数据缺失属于非随机缺失时,数据缺失模式与缺失属性的取值相关,因此在不完整对象的相似度量中加入缺失模式相似的度量,提出了两种结合缺失模式的PCM(Possibilistic c-means)模糊聚类算法:最小化缺失模式距离之和的PatDistPCM算法和基于缺失模式聚类的PatCluPCM算法。在两个公开数据集上的实验证明,考虑缺失模式的模糊聚类PatDistPCM和PatCluPCM算法,在对存在非随机缺失的数据进行聚类时,能有效提高聚类结果的准确性。
Data integrality is an important metric for data availability.For the problems in data acquisition,datasets in real world are always incomplete.Missing data are usually ignored or imputed in common clustering algorithm.When data missing is missing not at random,ignorance or imputation will result poor clustering accuracy.Considering the relationship of the data missing pattern and the missing value,two PCM(Possibilistic c-means)clustering algorithms were proposed:PatDistPCM based on minimizing the sum of missing pattern distance and PatCluPCM based on missing pattern clustering.The experiments on public datasets show that the two proposed fuzzy clustering algorithms PatDistPCM and PatCluPCM can improve clustering precision and recall when clustering data are of missing not at random.
作者
郑奇斌
刁兴春
曹建军
ZHENG Qi-bin;DIAO Xing-chun;CAO Jian-jun(College of Command Information System,PLA University of Science and Technology,Nanjing210007;Nanjing Telecommunication Technology Institute,Nanjing210007,China)
出处
《计算机科学》
CSCD
北大核心
2017年第12期58-63,共6页
Computer Science
基金
国家自然科学基金(61371196)资助