The traditional model selection criterions try to make a balance between fitted error and model complexity. Assumptions on the distribution of the response or the noise, which may be misspecified, should be made befor...The traditional model selection criterions try to make a balance between fitted error and model complexity. Assumptions on the distribution of the response or the noise, which may be misspecified, should be made before using the traditional ones. In this ar- ticle, we give a new model selection criterion, based on the assumption that noise term in the model is independent with explanatory variables, of minimizing the association strength between regression residuals and the response, with fewer assumptions. Maximal Information Coe^cient (MIC), a recently proposed dependence measure, captures a wide range of associ- ations, and gives almost the same score to different type of relationships with equal noise, so MIC is used to measure the association strength. Furthermore, partial maximal information coefficient (PMIC) is introduced to capture the association between two variables removing a third controlling random variable. In addition, the definition of general partial relationship is given.展开更多
It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limit...It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limits the applicability of existing methods in handling this complex scenario. To address this issue, we propose a model-free feature screening approach for ultra-high-dimensional multi-classification that can handle both categorical and continuous variables. Our proposed feature screening method utilizes the Maximal Information Coefficient to assess the predictive power of the variables. By satisfying certain regularity conditions, we have proven that our screening procedure possesses the sure screening property and ranking consistency properties. To validate the effectiveness of our approach, we conduct simulation studies and provide real data analysis examples to demonstrate its performance in finite samples. In summary, our proposed method offers a solution for effectively screening features in ultra-high-dimensional datasets with a mixture of categorical and continuous covariates.展开更多
In the era of big data,correlation analysis is significant because it can quickly detect the correlation between factors.And then,it has been received much attention.Due to the good properties of generality and equita...In the era of big data,correlation analysis is significant because it can quickly detect the correlation between factors.And then,it has been received much attention.Due to the good properties of generality and equitability of the maximal information coefficient(MIC),MIC is a hotspot in the research of correlation analysis.However,if the original approximate algorithm of MIC is directly applied into mining correlations in big data,the computation time is very long.Then the theoretical time complexity of the original approximate algorithm is analyzed in depth and the time complexity is n2.4 when parameters are default.And the experiments show that the large number of candidate partitions of random relationships results in long computation time.The analysis is a good preparation for the next step work of designing new fast algorithms.展开更多
It is an important issue to identify important influencing factors in railway accident analysis.In this paper,employing the good measure of dependence for two-variable relationships,the maximal information coefficient...It is an important issue to identify important influencing factors in railway accident analysis.In this paper,employing the good measure of dependence for two-variable relationships,the maximal information coefficient(MIC),which can capture a wide range of associations,a complex network model for railway accident analysis is designed in which nodes denote factors of railway accidents and edges are generated between two factors of which MIC values are larger than or equal to the dependent criterion.The variety of network structure is studied.As the increasing of the dependent criterion,the network becomes to an approximate scale-free network.Moreover,employing the proposed network,important influencing factors are identified.And we find that the annual track density-gross tonnage factor is an important factor which is a cut vertex when the dependent criterion is equal to 0.3.From the network,it is found that the railway development is unbalanced for different states which is consistent with the fact.展开更多
最大信息系数(Maximum information coefficient,MIC)可以对变量间的线性和非线性关系,以及非函数依赖关系进行有效度量.本文首先根据最大信息系数理论,提出了一种评价各维特征间以及每维特征与类别间相关性的度量标准,然后提出了基于...最大信息系数(Maximum information coefficient,MIC)可以对变量间的线性和非线性关系,以及非函数依赖关系进行有效度量.本文首先根据最大信息系数理论,提出了一种评价各维特征间以及每维特征与类别间相关性的度量标准,然后提出了基于新度量标准的近似马尔科夫毯特征选择方法,删除冗余特征.在此基础上提出了基于特征排序和近似马尔科夫毯的两阶段特征选择方法,分别对特征的相关性和冗余性进行分析,选择有效的特征子集.在UCI和ASU上的多个公开数据集上的对比实验表明,本文提出的方法总体优于快速相关滤波(Fast correlation-based filter,FCBF)方法,与Relief F,FAST,Lasso和RFS方法相比也具有优势.展开更多
目前,关于直流电压下局部放电信号特征提取技术的研究极少。用于表征连续放电间相关关系的特征散点图是常用的统计分析方法,但现阶段仅用于定性分析放电现象。引入互信息、最大信息系数(maximal information coefficient,MIC)、最大...目前,关于直流电压下局部放电信号特征提取技术的研究极少。用于表征连续放电间相关关系的特征散点图是常用的统计分析方法,但现阶段仅用于定性分析放电现象。引入互信息、最大信息系数(maximal information coefficient,MIC)、最大信息非参数扩展类(maximal information-based non-parametric exploration,MINE)等先进的非线性相关特征分析手段,提取该类散点图定量特征。基于互信息的MIC和MINE具有普适性、公平性和对称性等重要特性。最终共提取了36个相关特征参数,与22个传统统计算子一起组成特征指纹。之后,使用最大相关最小冗余(mR MR)算法选取最优特征指纹空间并使用MIC进行优化。利用XLPE单芯电缆制作了绝缘内部气隙、主绝缘表面划伤、高压端毛刺电晕、半导电层爬电4类典型绝缘缺陷模型,将文中方法应用于试验数据分析。最终确定了含有48个参数的最优特征指纹,使用人工神经网络等机器学习方法进行模式识别可获得91%的平均识别精度。该结果表明,使用文中方法提取的散点图非线性特征可以有效反映放电模式。展开更多
基金partly supported by National Basic Research Program of China(973 Program,2011CB707802,2013CB910200)National Science Foundation of China(11201466)
文摘The traditional model selection criterions try to make a balance between fitted error and model complexity. Assumptions on the distribution of the response or the noise, which may be misspecified, should be made before using the traditional ones. In this ar- ticle, we give a new model selection criterion, based on the assumption that noise term in the model is independent with explanatory variables, of minimizing the association strength between regression residuals and the response, with fewer assumptions. Maximal Information Coe^cient (MIC), a recently proposed dependence measure, captures a wide range of associ- ations, and gives almost the same score to different type of relationships with equal noise, so MIC is used to measure the association strength. Furthermore, partial maximal information coefficient (PMIC) is introduced to capture the association between two variables removing a third controlling random variable. In addition, the definition of general partial relationship is given.
文摘It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limits the applicability of existing methods in handling this complex scenario. To address this issue, we propose a model-free feature screening approach for ultra-high-dimensional multi-classification that can handle both categorical and continuous variables. Our proposed feature screening method utilizes the Maximal Information Coefficient to assess the predictive power of the variables. By satisfying certain regularity conditions, we have proven that our screening procedure possesses the sure screening property and ranking consistency properties. To validate the effectiveness of our approach, we conduct simulation studies and provide real data analysis examples to demonstrate its performance in finite samples. In summary, our proposed method offers a solution for effectively screening features in ultra-high-dimensional datasets with a mixture of categorical and continuous covariates.
基金Supported by the China Postdoctoral Science Foundation(2019M650981)Shandong Provincial Natural Science Foundation,China(ZR2018MG003)。
文摘In the era of big data,correlation analysis is significant because it can quickly detect the correlation between factors.And then,it has been received much attention.Due to the good properties of generality and equitability of the maximal information coefficient(MIC),MIC is a hotspot in the research of correlation analysis.However,if the original approximate algorithm of MIC is directly applied into mining correlations in big data,the computation time is very long.Then the theoretical time complexity of the original approximate algorithm is analyzed in depth and the time complexity is n2.4 when parameters are default.And the experiments show that the large number of candidate partitions of random relationships results in long computation time.The analysis is a good preparation for the next step work of designing new fast algorithms.
基金Supported by the Fundamental Research Funds for the Central Universities under Grant No.2016YJS087the National Natural Science Foundation of China under Grant No.U1434209the Research Foundation of State Key Laboratory of Railway Traffic Control and Safety,Beijing Jiaotong University under Grant No.RCS2016ZJ001
文摘It is an important issue to identify important influencing factors in railway accident analysis.In this paper,employing the good measure of dependence for two-variable relationships,the maximal information coefficient(MIC),which can capture a wide range of associations,a complex network model for railway accident analysis is designed in which nodes denote factors of railway accidents and edges are generated between two factors of which MIC values are larger than or equal to the dependent criterion.The variety of network structure is studied.As the increasing of the dependent criterion,the network becomes to an approximate scale-free network.Moreover,employing the proposed network,important influencing factors are identified.And we find that the annual track density-gross tonnage factor is an important factor which is a cut vertex when the dependent criterion is equal to 0.3.From the network,it is found that the railway development is unbalanced for different states which is consistent with the fact.