摘要
许多机器学习算法要求变量为离散量,例如决策树、贝叶斯网络等。当出现变量为连续变量的情况时,需要对连续变量进行离散化处理。离散化直接影响了算法的处理效果,其对整个算法模型具有重大意义。文中提出了两种离散化方式,一种是改进的K-means(K均值聚类)离散化算法,其能确定最佳聚类数并在无监督的条件下进行离散化,一种是传统的有监督离散化算法ChiMerge,使用两种方法对数据集进行离散化处理,然后分别建立贝叶斯网络并且进行预测分析,比较二者的离散化结果。实验表明,相对于改进的K-means算法,ChiMerge的离散化效果更好,但处理效率明显低于前者。
Many machine learning algorithms require variables to be discrete,such as decision trees and Bayesian networks.When the variable is a continuous variable,the continuous variable needs to be discretized.Discretization directly affects the processing effect of the algorithm,which is of great significance to the entire algorithm model.Two discretization methods are proposed.One is an improved K-means(K-means clustering)discretization algorithm,which can determine the optimal number of clusters and perform discretization under unsupervised conditions.The other is traditional ChiMerge,which is supervised discretization algorithm.Two methods are used to discretize the data set,and then establishes a Bayesian network and performs predictive analysis to compare the discretization results of the two.Experiments show that,compared with the improved K-means algorithm,ChiMerge’s discretization effect is better,but the processing efficiency is significantly lower than the former.
作者
李浩
魏明
LI Hao;WEI Ming(Wuhan Research Institute of Posts and Telecommunications,Wuhan 430070,China;Wuhan Fiberhome Technology Service Co.,Ltd.,Wuhan 430074,China)
出处
《信息技术》
2020年第11期121-124,131,共5页
Information Technology