期刊文献+

高维数据空间的性质及度量选择 被引量:8

Properties of High-dimensional Data Space and Metric Choice
下载PDF
导出
摘要 高维数据分析是机器学习和数据挖掘研究中的主要内容,降维算法通过寻找数据表示的最优子空间来约减维数,在降低计算代价的同时,也提高了后续分类或者聚类算法的性能,从而成为高维数据分析的有效手段。然而,目前缺乏高维数据分析的理论指导。对高维数据空间的统计和几何性质进行了综述,从不同的角度给出了高维数据空间中"度量集中"现象的直观解释,并讨论了通过度量选择的方式来提高经典的基于距离度量的机器学习算法在分析高维数据时的性能。实验表明,分数距离度量方式可以显著提高K近邻和Kmeans算法的性能。 High-dimesional data analysis is the core task of machine learning and data mining.By finding optimal subspace for data representation,dimensionality reduction algorithms can reduce computational cost and improve the performance of subsequent classification or clustering algorithms,leading to effective techniques for high-dimensional data analysis.However,there is very little guidance for theoretical analysis on high-dimensional data.This paper reviewed some statistical and geometrical properties of high-dimensional data space,and gave some intuitive explanations on "concentration of measure" phenomenon from different perspectives.In order to improve performances of classical machine learning algorithms based on distance metric,this paper discussed the effects of metric choice on high-dimensional data analysis.Empirical results show that fractional distance metric can improve performances of K Nearest Neighbor and Kmeans significantly.
出处 《计算机科学》 CSCD 北大核心 2014年第3期212-217,共6页 Computer Science
基金 中央高校基本科研业务费专项资金(2012211020209) 广东省省部产学研结合专项(2011B090400477) 珠海市产学研合作专项资金(2011A050101005 2012D0501990016) 珠海市重点实验室科技攻关项目(2012D0501990026)资助
关键词 高维数据 维数灾难 度量集中 High-dimensional data Curse of dimensionality Concentration of measure
  • 相关文献

参考文献29

  • 1Skillicom D B.Understanding High-Dimensional Spaces[M].Springer-Verlag New York Incorporated,2013.
  • 2Donoho D L.High-dimensional data analysis:The curses and blessings of dimensionality[J].AMS Math Challenges Lecture,2000:1-32.
  • 3Bellman R.Adaptive Control Process:A Guide Tour[M].Princeton University Press,Princeton,New Jersey,1961.
  • 4Fukunaga K.Introduction to Statistical Pattern Recognition(2nd ed)[M].New York:Academic,1990,39-40(31-34):220-221.
  • 5Mil'man V D.New proof of the theorem of A.Dvoretzky on intersections of convex bodics[J].Functional Analysis and its Applications,1971,5 (4):288-295.
  • 6Weber R,Schek H-J,Blott S.A quantitative analysis and performance study for similarity-sesrch methods in high-dimensional spaces[C] //Proceedings of the 24rd International Conference on Very Large Data Bases,ser.VLDB' 98.San Francisco,CA,USA:Morgan Kaufmanm Publishers Inc,1998:194-205.
  • 7Gaede V,Günther O.Multidimensional access methods[J].ACM Computing Surveys (CSUR),1998,30(2):170-231.
  • 8Francois D,Wertz V,Verleysen M.Non-euclidean metrics for similarity search in noisy datasets[C] //Proc.of ESANN.2005.
  • 9Kouiroukidis N,Evangelidis G.The Effects of Dimensionality Curse in High Dimensional kNN Search[C] //Informatics(PCI),2011 15th Panhellenic Conference on.IEEE,2011:41-45.
  • 10Clarke R,Ressom H W,Wang A,et al.The properties of highdimensional data spaces:implications for exploring gene and protein expression data[J].Nature Reviews Cancer,2008,8 (1):37-49.

共引文献4

同被引文献77

  • 1邓军,余忠华,杨基平,丁鼎,吴昭同.面向产品生命周期的全面质量管理系统[J].浙江大学学报(工学版),2005,39(4):500-505. 被引量:17
  • 2贾小勇,徐传胜,白欣.最小二乘法的创立及其思想方法[J].西北大学学报(自然科学版),2006,36(3):507-511. 被引量:137
  • 3陆汝华,杨胜跃,朱颖,樊晓平.基于DHMM的轴承故障音频诊断方法[J].计算机工程与应用,2007,43(17):218-220. 被引量:12
  • 4Buhlmann p, Van De Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications[M]. Berlin: Springer Science &. Business Media, 2011.
  • 5TenenbaumJ B, De Silva V, LangfordJ C. A global geometric framework for nonlinear dimensionality reduction[J]. Science, 2000, 290(5500): 2319-2323.
  • 6Roweis S T, Saul L K. Nonlinear dimensionality reduction by locally linear embedding[J]. Science, 2000, 290(5500): 2323-2326.
  • 7Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation[J]. Neural Computation, 2003, 15(6): 1373-1396.
  • 8Weinberger K Q, Saul L K. Unsupervised learning of image manifolds by semidefinite programming[J]. InternationalJournal of Computer Vision, 2006, 70 (1): 77-90.
  • 9Donoho D L, Grimes C. Hessian eigenmaps , Locally linear embedding techniques for high-dimensional data[J]. Proceedings of the National Academy of Sciences, 2003, 100(0): 5591-5596.
  • 10Coifman R R, Lafon S, Lee A B, et al. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps[J]. Proceedings of the National Academy of Sciences of the United States of America, 2005, 102(21): 7426-7431.

引证文献8

二级引证文献31

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部