摘要
高维数据分析是机器学习和数据挖掘研究中的主要内容,降维算法通过寻找数据表示的最优子空间来约减维数,在降低计算代价的同时,也提高了后续分类或者聚类算法的性能,从而成为高维数据分析的有效手段。然而,目前缺乏高维数据分析的理论指导。对高维数据空间的统计和几何性质进行了综述,从不同的角度给出了高维数据空间中"度量集中"现象的直观解释,并讨论了通过度量选择的方式来提高经典的基于距离度量的机器学习算法在分析高维数据时的性能。实验表明,分数距离度量方式可以显著提高K近邻和Kmeans算法的性能。
High-dimesional data analysis is the core task of machine learning and data mining.By finding optimal subspace for data representation,dimensionality reduction algorithms can reduce computational cost and improve the performance of subsequent classification or clustering algorithms,leading to effective techniques for high-dimensional data analysis.However,there is very little guidance for theoretical analysis on high-dimensional data.This paper reviewed some statistical and geometrical properties of high-dimensional data space,and gave some intuitive explanations on "concentration of measure" phenomenon from different perspectives.In order to improve performances of classical machine learning algorithms based on distance metric,this paper discussed the effects of metric choice on high-dimensional data analysis.Empirical results show that fractional distance metric can improve performances of K Nearest Neighbor and Kmeans significantly.
出处
《计算机科学》
CSCD
北大核心
2014年第3期212-217,共6页
Computer Science
基金
中央高校基本科研业务费专项资金(2012211020209)
广东省省部产学研结合专项(2011B090400477)
珠海市产学研合作专项资金(2011A050101005
2012D0501990016)
珠海市重点实验室科技攻关项目(2012D0501990026)资助
关键词
高维数据
维数灾难
度量集中
High-dimensional data
Curse of dimensionality
Concentration of measure