It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limit...It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limits the applicability of existing methods in handling this complex scenario. To address this issue, we propose a model-free feature screening approach for ultra-high-dimensional multi-classification that can handle both categorical and continuous variables. Our proposed feature screening method utilizes the Maximal Information Coefficient to assess the predictive power of the variables. By satisfying certain regularity conditions, we have proven that our screening procedure possesses the sure screening property and ranking consistency properties. To validate the effectiveness of our approach, we conduct simulation studies and provide real data analysis examples to demonstrate its performance in finite samples. In summary, our proposed method offers a solution for effectively screening features in ultra-high-dimensional datasets with a mixture of categorical and continuous covariates.展开更多
In the era of big data,correlation analysis is significant because it can quickly detect the correlation between factors.And then,it has been received much attention.Due to the good properties of generality and equita...In the era of big data,correlation analysis is significant because it can quickly detect the correlation between factors.And then,it has been received much attention.Due to the good properties of generality and equitability of the maximal information coefficient(MIC),MIC is a hotspot in the research of correlation analysis.However,if the original approximate algorithm of MIC is directly applied into mining correlations in big data,the computation time is very long.Then the theoretical time complexity of the original approximate algorithm is analyzed in depth and the time complexity is n2.4 when parameters are default.And the experiments show that the large number of candidate partitions of random relationships results in long computation time.The analysis is a good preparation for the next step work of designing new fast algorithms.展开更多
目前提出的用于检测变量间相关关系的方法,如最大信息系数(Maximal Information Coefficient,MIC),多应用于成对变量,却很少用于三元变量或更高元变量间的相关性检测。基于此,该文提出能够检测多元变量间相关关系的新方法最大信息熵(Max...目前提出的用于检测变量间相关关系的方法,如最大信息系数(Maximal Information Coefficient,MIC),多应用于成对变量,却很少用于三元变量或更高元变量间的相关性检测。基于此,该文提出能够检测多元变量间相关关系的新方法最大信息熵(Maximal Information Entropy,MIE)。对于k元变量,首先基于任意两变量间的MIC值构造最大信息矩阵,然后根据最大信息矩阵计算最大信息熵来度量变量间的相关度。仿真实验结果表明MIE能够检测三元变量间的1维流形依赖关系,真实数据集上的实验验证了MIE的实用性。展开更多
为准确预测电力市场中的短期电价,将最大信息系数(maximal information coefficient,MIC)相关性分析与改进多层级门控长短期记忆网络(multi-hierachy gated long shortterm memory,MHG-LSTM)相结合,提出一种新型短期电价预测方法。该方...为准确预测电力市场中的短期电价,将最大信息系数(maximal information coefficient,MIC)相关性分析与改进多层级门控长短期记忆网络(multi-hierachy gated long shortterm memory,MHG-LSTM)相结合,提出一种新型短期电价预测方法。该方法首先对备选序列与预测电价序列做MIC相关性分析,在此基础上筛选备选序列并经小波变换合成神经网络输入序列,有效增加了输入中与预测电价相关的信息密度;其次,对传统LSTM进行创新性改进,提出用两级遗忘门和输入门替换传统的一级门控机构的MHG-LSTM模型,提高了神经网络选择和提取高频电价序列特征的能力。在PJM市场日前电价数据集上对所提方法进行仿真实验,实验结果表明,该方法的预测误差仅为4.506%,相比已有预测方法有效提升了短期电价的预测精度,且具有很强的普适性,可应用于电力市场短期电价预测,为市场参与者和监管机构提供有力决策依据。展开更多
针对最大互信息系数(maximal information coefficient,MIC)近似算法在大规模数据场景下的计算时间复杂度高,计算时间增长快的问题,提出一种最大互信息系数并行计算(parallel computing maximal information coefficient,PCMIC)方法。...针对最大互信息系数(maximal information coefficient,MIC)近似算法在大规模数据场景下的计算时间复杂度高,计算时间增长快的问题,提出一种最大互信息系数并行计算(parallel computing maximal information coefficient,PCMIC)方法。分别在Spark和Spark-MPI(message passing interface)计算框架中,在不同的数据规模和不同的噪声水平下,利用PCMIC算法对14种典型的相关关系做并行计算。另外在不同节点数的情况下,选择两种具有代表性的相关关系来测试PCMIC算法在两种计算框架中的性能。结果表明:PCMIC算法在两种框架下的运算效果与原始MIC近似算法相比,同样具有普适性和均匀性,而且具有良好的可扩展性;随着数据规模和节点数的增加,PCMIC算法在两种框架中运算的时间增长明显比MIC近似算法慢,而且在Spark-MPI框架下的并行加速比和效率略优于Spark;Spark能够支持MPI任务的调度,为研究不同并行计算框架之间的融合奠定了一定的理论和应用基础。展开更多
文摘It is common for datasets to contain both categorical and continuous variables. However, many feature screening methods designed for high-dimensional classification assume that the variables are continuous. This limits the applicability of existing methods in handling this complex scenario. To address this issue, we propose a model-free feature screening approach for ultra-high-dimensional multi-classification that can handle both categorical and continuous variables. Our proposed feature screening method utilizes the Maximal Information Coefficient to assess the predictive power of the variables. By satisfying certain regularity conditions, we have proven that our screening procedure possesses the sure screening property and ranking consistency properties. To validate the effectiveness of our approach, we conduct simulation studies and provide real data analysis examples to demonstrate its performance in finite samples. In summary, our proposed method offers a solution for effectively screening features in ultra-high-dimensional datasets with a mixture of categorical and continuous covariates.
基金Supported by the China Postdoctoral Science Foundation(2019M650981)Shandong Provincial Natural Science Foundation,China(ZR2018MG003)。
文摘In the era of big data,correlation analysis is significant because it can quickly detect the correlation between factors.And then,it has been received much attention.Due to the good properties of generality and equitability of the maximal information coefficient(MIC),MIC is a hotspot in the research of correlation analysis.However,if the original approximate algorithm of MIC is directly applied into mining correlations in big data,the computation time is very long.Then the theoretical time complexity of the original approximate algorithm is analyzed in depth and the time complexity is n2.4 when parameters are default.And the experiments show that the large number of candidate partitions of random relationships results in long computation time.The analysis is a good preparation for the next step work of designing new fast algorithms.
文摘目前提出的用于检测变量间相关关系的方法,如最大信息系数(Maximal Information Coefficient,MIC),多应用于成对变量,却很少用于三元变量或更高元变量间的相关性检测。基于此,该文提出能够检测多元变量间相关关系的新方法最大信息熵(Maximal Information Entropy,MIE)。对于k元变量,首先基于任意两变量间的MIC值构造最大信息矩阵,然后根据最大信息矩阵计算最大信息熵来度量变量间的相关度。仿真实验结果表明MIE能够检测三元变量间的1维流形依赖关系,真实数据集上的实验验证了MIE的实用性。
文摘相关性分析因其能快速发现数据间潜在的关系而变得越来越重要了.在现实生活中,人们经常要分析多变量间的相关性大小.鉴于此,提出一种能够度量多变量间相关关系的度量方法——多变量间的最大互信息系数(Multi-variable Maximal Mutual Information Coefficient, Mv_MMIC),该方法能够探测多变量间广泛的相关关系,这里的广泛相关关系包括线性和非线性的函数型关系,甚至所有的函数型关系.首先利用最大互信息系数MIC (Mutual Information Coefficient)构建最大互信息系数矩阵,然后基于矩阵的特征分解原理,利用最大互信息系数矩阵的特征值构建出度量多变量间相关关系的度量方法,把度量两个随机变量间的相关关系的方法MIC巧妙地从两纬度的度量准则推广到度量多变量间的相关性的多维度度量准则中,最后通过实验证明:多变量间的最大互信息系数Mv_MMIC保留了MIC的通用性和公平性的优点,具有一定的理论研究和实际应用价值.
文摘为准确预测电力市场中的短期电价,将最大信息系数(maximal information coefficient,MIC)相关性分析与改进多层级门控长短期记忆网络(multi-hierachy gated long shortterm memory,MHG-LSTM)相结合,提出一种新型短期电价预测方法。该方法首先对备选序列与预测电价序列做MIC相关性分析,在此基础上筛选备选序列并经小波变换合成神经网络输入序列,有效增加了输入中与预测电价相关的信息密度;其次,对传统LSTM进行创新性改进,提出用两级遗忘门和输入门替换传统的一级门控机构的MHG-LSTM模型,提高了神经网络选择和提取高频电价序列特征的能力。在PJM市场日前电价数据集上对所提方法进行仿真实验,实验结果表明,该方法的预测误差仅为4.506%,相比已有预测方法有效提升了短期电价的预测精度,且具有很强的普适性,可应用于电力市场短期电价预测,为市场参与者和监管机构提供有力决策依据。
文摘针对最大互信息系数(maximal information coefficient,MIC)近似算法在大规模数据场景下的计算时间复杂度高,计算时间增长快的问题,提出一种最大互信息系数并行计算(parallel computing maximal information coefficient,PCMIC)方法。分别在Spark和Spark-MPI(message passing interface)计算框架中,在不同的数据规模和不同的噪声水平下,利用PCMIC算法对14种典型的相关关系做并行计算。另外在不同节点数的情况下,选择两种具有代表性的相关关系来测试PCMIC算法在两种计算框架中的性能。结果表明:PCMIC算法在两种框架下的运算效果与原始MIC近似算法相比,同样具有普适性和均匀性,而且具有良好的可扩展性;随着数据规模和节点数的增加,PCMIC算法在两种框架中运算的时间增长明显比MIC近似算法慢,而且在Spark-MPI框架下的并行加速比和效率略优于Spark;Spark能够支持MPI任务的调度,为研究不同并行计算框架之间的融合奠定了一定的理论和应用基础。