摘要
分类作为一类重要的数据挖掘问题被广泛地研究和应用,然而先前的研究主要针对确定数据上的分类问题,由于目前例如传感器等数据采集工具的普遍使用,概率数据广泛存在,在这类数据上进行分类研究十分必要。提出了一种新的概率数据模型,它既考虑了概率分布上的随机性,又包含了独立区间上的相似度;定义了一种新的辨识距离来衡量这类概率数据元组之间的距离;最后提出了概率数据上基于规则的分类算法,在基础分类算法上,引入了一种带有可变精度的分类算法来降低噪声或者扰动,提高了分类的精度。实验结果证明了该算法的有效性。
Classification as an important problem in data mining is widely studied and applied nowadays, but the previous study is mainly about classification on certain data. Since probabilistic data exist and are widely used in many fields, such as sensor data, it is necessary to do feature selection for probabilistic databases. Firstly, this paper proposes a new probabilistic data model, which considers not only the randomness but also the similarity of different intervals. Secondly, in order to do classification for such probabilistic data, this paper designs a discernible distance to measure the distance between such and develops a new variable distance tuples. Finally, this paper proposes a basic rule-based classification algorithm, to reduce classification sensitivity to noise or perturbation. The Experimental results verify the effectiveness of the proposed algorithm.
出处
《计算机科学与探索》
CSCD
2013年第7期639-648,共10页
Journal of Frontiers of Computer Science and Technology
基金
中央高校基本科研业务费专项资金
中国人民大学研究基金 No.12XNLF07~~
关键词
分类
随机性
概率数据
辨识距离
classification
randomness
probabilistic data
discemible distance