Neyman-Pearson(NP) criterion is one of the most important ways in hypothesis testing. It is also a criterion for classification. This paper addresses the problem of bounding the estimation error of NP classification...Neyman-Pearson(NP) criterion is one of the most important ways in hypothesis testing. It is also a criterion for classification. This paper addresses the problem of bounding the estimation error of NP classification, in terms of Rademacher averages. We investigate the behavior of the global and local Rademacher averages, and present new NP classification error bounds which are based on the localized averages, and indicate how the estimation error can be estimated without a priori knowledge of the class at hand.展开更多
In this paper,we explore two conjectures about Rademacher sequences.Let(εi)be a Rademacher sequence,i.e.,a sequence of independent{-1,1}-valued symmetric random variables.Set Sn=aiε1+…+anεn for a=(a1,…,an)∈Rn.Th...In this paper,we explore two conjectures about Rademacher sequences.Let(εi)be a Rademacher sequence,i.e.,a sequence of independent{-1,1}-valued symmetric random variables.Set Sn=aiε1+…+anεn for a=(a1,…,an)∈Rn.The first con.jecture says that P(|Sn|≤‖a‖)>1/2 for all a∈Rn and n∈N.The second conjecture says that P(|Sn|>‖a‖)≥7/32 for all a∈Rn and n∈N.Regarding the first conjecture,we present several new equivalent formulations.These include a topological view,a combinatorial version and a strengthened version of the conjecture.Regarding the second conjecture,we prove that it holds true when n<7.展开更多
为提高频繁项集挖掘性能,提出了基于渐近取样的频繁项集挖掘近似算法(Frequent Itemsets Mining Approximate Algorithm based on Progressive Sampling,FIMAA-PS),该算法使用渐近取样方法实现数据集的样本提取,基于当前样本输出结果自...为提高频繁项集挖掘性能,提出了基于渐近取样的频繁项集挖掘近似算法(Frequent Itemsets Mining Approximate Algorithm based on Progressive Sampling,FIMAA-PS),该算法使用渐近取样方法实现数据集的样本提取,基于当前样本输出结果自动配置下一轮循环挖掘的样本大小,并使用Rademacher均值对输出结果的频率偏差上限进行理论估计从而得到终止条件,最后通过单次样本快速扫描判断算法终止条件,输出挖掘结果。实验结果表明,不同于传统挖掘精确算法和使用静态取样的挖掘近似算法,FIMAA-PS在输出结果精准度和运行时间方面具有显著优势。展开更多
ROC曲线下面积(Area Under the ROC Curve,AUC)是类不均衡/二分排序等问题中的标准评价指标之一.本文主要聚焦于半监督AUC优化方法.现有大多数方法局限于通过单一模型进行半监督AUC优化,对如何通过模型集成技术融合多个模型则鲜有涉及....ROC曲线下面积(Area Under the ROC Curve,AUC)是类不均衡/二分排序等问题中的标准评价指标之一.本文主要聚焦于半监督AUC优化方法.现有大多数方法局限于通过单一模型进行半监督AUC优化,对如何通过模型集成技术融合多个模型则鲜有涉及.考虑上述局限性,本文主要研究基于模型集成的半监督AUC优化方法.具体而言,本文提出一种基于Boosting算法的半监督AUC优化算法,并提出基于权重解耦的加速策略以降低算法时间/空间复杂度.进一步地,在优化层面,本文通过理论分析证明了所提出的算法相对于弱分类器的增加具有指数收敛速率;在模型泛化能力层面,本文构造了所提出算法的泛化误差上界,并证明增加弱分类器个数在提升训练集性能的同时并不会带来明显的过拟合风险.最后,本文在16个基准数据集上对所提出算法的性能进行了验证,实验结果表明所提出算法在多数情况下以0.05显著水平优于其他对比方法,并可在平均意义上产生0.9%~11.28%的性能提升.展开更多
基金Supported by the Grant from the Natural Sciences Research Council of China(60473034)the Natural Sciences Research Council of Zhejiang Province (Y604003)
基金Research supported in part by NSF of China under Grant Nos. 10801004, 10871015supported in part by Startup Grant for Doctoral Research of Beijing University of Technology
文摘Neyman-Pearson(NP) criterion is one of the most important ways in hypothesis testing. It is also a criterion for classification. This paper addresses the problem of bounding the estimation error of NP classification, in terms of Rademacher averages. We investigate the behavior of the global and local Rademacher averages, and present new NP classification error bounds which are based on the localized averages, and indicate how the estimation error can be estimated without a priori knowledge of the class at hand.
基金supported by the National Natural Science Foundation of China(Nos.11771309,11871184)the China Scholarship Council(No.201809945013)the Natural Sciences and Engineering Research Council of Canada(No.4394-2018)。
文摘In this paper,we explore two conjectures about Rademacher sequences.Let(εi)be a Rademacher sequence,i.e.,a sequence of independent{-1,1}-valued symmetric random variables.Set Sn=aiε1+…+anεn for a=(a1,…,an)∈Rn.The first con.jecture says that P(|Sn|≤‖a‖)>1/2 for all a∈Rn and n∈N.The second conjecture says that P(|Sn|>‖a‖)≥7/32 for all a∈Rn and n∈N.Regarding the first conjecture,we present several new equivalent formulations.These include a topological view,a combinatorial version and a strengthened version of the conjecture.Regarding the second conjecture,we prove that it holds true when n<7.
文摘为提高频繁项集挖掘性能,提出了基于渐近取样的频繁项集挖掘近似算法(Frequent Itemsets Mining Approximate Algorithm based on Progressive Sampling,FIMAA-PS),该算法使用渐近取样方法实现数据集的样本提取,基于当前样本输出结果自动配置下一轮循环挖掘的样本大小,并使用Rademacher均值对输出结果的频率偏差上限进行理论估计从而得到终止条件,最后通过单次样本快速扫描判断算法终止条件,输出挖掘结果。实验结果表明,不同于传统挖掘精确算法和使用静态取样的挖掘近似算法,FIMAA-PS在输出结果精准度和运行时间方面具有显著优势。
文摘ROC曲线下面积(Area Under the ROC Curve,AUC)是类不均衡/二分排序等问题中的标准评价指标之一.本文主要聚焦于半监督AUC优化方法.现有大多数方法局限于通过单一模型进行半监督AUC优化,对如何通过模型集成技术融合多个模型则鲜有涉及.考虑上述局限性,本文主要研究基于模型集成的半监督AUC优化方法.具体而言,本文提出一种基于Boosting算法的半监督AUC优化算法,并提出基于权重解耦的加速策略以降低算法时间/空间复杂度.进一步地,在优化层面,本文通过理论分析证明了所提出的算法相对于弱分类器的增加具有指数收敛速率;在模型泛化能力层面,本文构造了所提出算法的泛化误差上界,并证明增加弱分类器个数在提升训练集性能的同时并不会带来明显的过拟合风险.最后,本文在16个基准数据集上对所提出算法的性能进行了验证,实验结果表明所提出算法在多数情况下以0.05显著水平优于其他对比方法,并可在平均意义上产生0.9%~11.28%的性能提升.