分箱核密度估计的误差及其修正被引量：1

Error Evaluation and Emendation of Binned Kernel Density Estimators

下载PDF

导出

摘要核密度估计的计算复杂度使其难以应用于大规模数据集的密度函数构造,采用分箱近似核估计是降低密度函数构造过程复杂度的有效手段。本文提出了一种修正简单分箱核估计误差的方法,该方法采用数据重心取代分箱中心作为数据的代表点,能够更准确反映数据的局部分布特征。经证明,该方法的拟合精度为O(δ4)(相对于窗宽),达到线性分箱核估计的水平。实验表明,修正的简单分箱核估计构造方法具有良好的时间效率和计算精度,能够运用于面向大规模数据集的聚类分析应用。 The complexity of kernel density estimation （KDE） prohibits is difficult for the density function construction of the large dataset. The binning-based version of classic KDE is an efficient alternative for such kind of application. A revised simple binning strategy is presented by taking the representative gravity point of the data in a bin instead of the center of the bin. This improvement enables the simple binning strategy to monitor the distribution. It is proved that the revised simple binning can achieve 0（84） of discrepancy of the linear binning compared with the ordinary KDE. Experiments in synthetic and real world dataset show that the method has good construction efficiency and the accuracy, thus it is used in the clustering analysis of large dataset.

作者李存华纪兆辉胡云

机构地区淮海工学院计算机工程学院

出处《数据采集与处理》 CSCD 北大核心 2009年第2期212-217,共6页 Journal of Data Acquisition and Processing

基金江苏省自然科学基金(BK20082140)资助项目江苏省教育厅自然科学基金(06KJB520005)资助项目江苏省"六大人才高峰"(06-E-028)资助项目

关键词核密度估计分箱规则误差估计 kernel density estimation binning rule error estimation

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献2

1李存华,孙志挥.一类数据空间网格化聚类算法的均值近似方法(英文)[J].软件学报,2003,14(7):1267-1274. 被引量：15
2李存华,孙志挥.GridOF:面向大规模数据集的高效离群点检测算法[J].计算机研究与发展,2003,40(11):1586-1592. 被引量：28

二级参考文献18

1Sheikholeslami G, Chatterjee S, Zhang A. Wave-Cluster: A multi-resolution clustering approach for very large spatial databases. In:Proceedings of the 24th International Conference on Very Large Databases. New York, 1998. 428~439.
2Aggrawal R, Gehrke J, Gunopulos D, Raghawan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. Seattle, WA, 1998.94~ 105.
3Wang W, Yang J, Muntz R. STING: A statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Databases. Athens, Greece, 1997.186~ 195.
4Hinneburg A, Keim DA. An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'98). New York, 1998.58~65.
5Xing EP, Karp RM. CLIFF: Clustering of high dimensional microarray data via iterative feature filtering using normalized cuts.BIOINFORMATICS, 2001,1(1):1~9.
6Hinneburg A, Keim DA, Brandt W. Clustering 3D-structures of small amino acid chains for detecting dependences from their sequential context in proteins. In: Proceedings of the IEEE International Symposium on BioInformatics and Biomedical Engineering. Washington, DC, 2000. 43-49.
7Xu X, Ester M, Kriegel H, Sander J. A distribution-based clustering algorithm for mining in large spatial databases. In: Proceedings of the 14th International Conference on Data Engineering, ICDE'98. Orlando, FL, 1998. 324~331.
8Silverman B. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986.72~113.
9Han J, Kamber M. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000.335~398.
10Berchtold S, Keim D, Kriegel HP. The X-tree: An index structure for high-dimensional data. In: Proceedings of the International Conference on Very Large Databases. Bombay, India, 1996.28~39.

共引文献40

1李存华,孙志挥,陈耿,胡云.核密度估计及其在聚类算法构造中的应用[J].计算机研究与发展,2004,41(10):1712-1719. 被引量：61
2苏守宝,郁书好.一种基于密度的增量式网格聚类算法[J].皖西学院学报,2004,20(5):91-94.
3张莹,韩芳溪,柴乔林.基于频繁模式树的AOI聚类算法[J].计算机工程与应用,2004,40(35):178-179.
4ZHANG Jing 1,2 , SUN Zhi-hui 1 1.Department of Computer Science and Engineering, Southeast University, Nanjing 210096, Jiangsu, China,2.Department of Electricity and Information Engineering, Jiangsu University, Zhenjiang 212001, Jiangsu, China.Constructing Three-Dimension Space Graph for Outlier Detection Algorithms in Data Mining[J].Wuhan University Journal of Natural Sciences,2004,9(5):585-589. 被引量：1
5肖冰,邓飞其.一种对电子商店中孤立点进行跟踪的算法[J].河南科技大学学报（自然科学版）,2005,26(4):41-43.
6张净,孙志挥.GDLOF:基于网格和稠密单元的快速局部离群点探测算法[J].东南大学学报（自然科学版）,2005,35(6):863-866. 被引量：6
7王博,迟忠先,岳训.一种面向GIS系统的新型双层聚类方法[J].计算机工程,2006,32(7):84-85. 被引量：2
8倪巍伟,陈耿,陆介平,孙志挥.基于nested-loop的大数据集快速离群点检测算法[J].东南大学学报（自然科学版）,2006,36(3):463-466. 被引量：1
9张光建,黄贤英.基于最小聚类单元的聚类算法研究及其在CRM中的应用[J].计算机科学,2006,33(7):188-189. 被引量：11
10杨宜东,孙志挥,朱玉全,杨明,张柏礼.基于动态网格的数据流离群点快速检测算法[J].软件学报,2006,17(8):1796-1803. 被引量：22

同被引文献5

1伍亚舟,张彦琦,黄明辉,杨梦苏,曾志雄,易东.基因芯片表达数据的标准化策略研究[J].第三军医大学学报,2004,26(7):594-597. 被引量：17
2毛燕芬,施鹏飞.一种核密度估计动态场景建模算法[J].数据采集与处理,2004,19(4):391-394. 被引量：5
3王东生,戴科.加权处理提高检测灵敏度的一种方法[J].数据采集与处理,1989,4(A10):8-9. 被引量：1
4吕建平,Wang Yue.一种新型多类别生物芯片cDNA基因表达数据标准化方法[J].电子与信息学报,2009,31(6):1350-1353. 被引量：1
5吴成茂.基于核空间的Otsu阈值法[J].数据采集与处理,2010,25(6):761-765. 被引量：2

引证文献1

1严德春,王加俊.改进的稳健Lowess标准化算法在基因芯片中的应用[J].数据采集与处理,2013,28(1):82-86. 被引量：3

二级引证文献3

1刘学军,张武军,张礼.一种改进的Affymetrix外显子芯片原始数据分析方法[J].数据采集与处理,2013,28(5):572-579.
2苏理云,梁昌海,李凤兰,赵胜利,宋江敏.基于LOWESS的函数系数自回归模型(FAR)优化及应用[J].重庆理工大学学报（自然科学）,2020,34(3):228-239. 被引量：6
3周彦球,王贵文,江程舟,刘秉昌,解宇强,赖强,夏小勇.基于声学特性的致密砂岩储层含气性评价[J].天然气地球科学,2022,33(5):831-841. 被引量：1

1大数据的未来[J].网络运维与管理,2015,0(10):27-27.
2孟繁杰,郭宝龙.使用兴趣点局部分布特征及多示例学习的图像检索方法[J].西安电子科技大学学报,2011,38(2):47-53. 被引量：16
3张延松.数据库与MapReduce融合的大数据管理技术探索[J].科研信息化技术与应用,2013,4(1):19-29. 被引量：4
4张良,鲁梦梦,姜华.局部分布信息增强的视觉单词描述与动作识别[J].电子与信息学报,2016,38(3):549-556. 被引量：11
5李存华,孙志挥,陈耿,胡云.核密度估计及其在聚类算法构造中的应用[J].计算机研究与发展,2004,41(10):1712-1719. 被引量：61
6何金洋.基于稀疏滤波和神经网络的人脸识别算法[J].网络空间安全,2016,7(5):28-31. 被引量：1
7章芳,庞明勇.一种呈现清晰纹理的快速误差扩散算法[J].小型微型计算机系统,2010,31(8):1609-1612. 被引量：2
8李雅林,张化祥,张顺.基于近邻加权及多示例的多标记学习改进算法[J].计算机工程与应用,2013,49(16):113-116.
9丁贵广,戴琼海,徐文立.基于兴趣点局部分布特征的图像检索方法[J].光电子．激光,2005,16(9):1101-1106. 被引量：24
10谢同玲.医院虚拟服务器技术的实现[J].科技风,2015(18):48-48.

数据采集与处理

2009年第2期

浏览历史

内容加载中请稍等...

分箱核密度估计的误差及其修正被引量：1

参考文献2

二级参考文献18

共引文献40

同被引文献5

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

分箱核密度估计的误差及其修正 被引量：1

参考文献2

二级参考文献18

共引文献40

同被引文献5

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

分箱核密度估计的误差及其修正被引量：1