摘要
包含分类属性和数值属性的混合数据广泛存在于真实世界采集的数据或实验数据,在挖掘或分析这类数据前,通常需要将它们处理(转换/嵌入/表示/编码)为高质量的数值数据。条件概率编码方法(以属性条件独立假设为前提)在大多数情况下能取得不错的性能,但当它面对具有强属性关联的数据集时,性能并不理想。受独依赖值差度量的启发,将放宽属性条件独立的构想应用于条件概率编码方法。此外,还利用属性加权法来优化编码后的数据质量。融合上述这些方法,我们为混合数据的分类编码提出了一个属性加权的独依赖条件概率编码方法。实验结果表明,我们的编码方法可以显著性提高数据转换的质量,从而增强后续数据分析算法的性能。
Mixed data containing categorical and numerical attributes are widely available in real-world or experimental data sets. Before mining or analyzing such data, it is typically necessary to process (transform/embed/represent) them into high-quality numerical data. Conditional probability transformation method (which is premised on the attribute conditional independence assumption) can provide acceptable performance in the majority of cases, but it is not satisfactory for data sets with strong attribute association. Inspired by the one dependence value difference metric method, the concept of relaxing the attributes conditional independence is applied to the conditional probability transformation method. In addition, an attribute weighting method is designed to optimize the quality of data encoding. Combining these methods, we propose an Attribute Weighted One Dependence Conditional Probability Encoding method for categorical encoding on mixed data. Extensive experimental results demonstrate that our method can significantly boost the quality of data encoding, hence enhancing the performance of subsequent data analysis algorithms.
出处
《运筹与模糊学》
2023年第1期74-87,共14页
Operations Research and Fuzziology