摘要
样本训练集的选取对网络分类精度及泛化能力有很大影响,同样对回归分析中的两难问题“偏差-方差”影响很大。经典的简单抽样理论在现实中很难做到,数据之间关系受到噪音以及领域知识的限制而显得很复杂,尤其是离群点的影响不能忽视。故而有限样本集中学习,如何获得最优结果不仅与算法有关,且与样本集的选取有关。文章首先从学习的数学理论出发阐明样本训练集的选取方法必要性,进而提出样本选择的鞅性要求与样本训练集中的离群点定义,最后提出在无监督学习中,混合密度分布有限样本集且样本类别数不知情形下的聚类与离群点判别算法,试验结果表明该算法的可行性与有效性。
The selection of training sample set has some influence on classification precision and generalization ability of neural networks as well as "bias-variance" dilemma of regression analysis.Classical simple sampling theory cannot carry out in reality because of noise affection and domain knowledge limitation,especially outliers affection,so that optimal result is relative to not only algorithms but also selection of sample set under the condition of finite samples.In this paper,the selection of training sample set is necessary in light of mathmatical learning theory firstly,martingale criterion about selecting samples and outliers definition are brought up secondly,and at last a kind of outliers detection algorithm is proposed based on unsupervised learning.The analysis of a simulated data shows that the algorithm can effectively detect samples produced by different mechanisms,namely outliers.
出处
《计算机工程与应用》
CSCD
北大核心
2006年第18期47-49,共3页
Computer Engineering and Applications
基金
安徽省高等学校青年教师科研资助计划资助项目(编号:2004jq103)
关键词
神经网络
回归分析
鞅
离群点
无监督学习
neural networks,regression analysis, martingale, outliers,unsupervised learning