摘要
核密度估计的计算复杂度使其难以应用于大规模数据集的密度函数构造,采用分箱近似核估计是降低密度函数构造过程复杂度的有效手段。本文提出了一种修正简单分箱核估计误差的方法,该方法采用数据重心取代分箱中心作为数据的代表点,能够更准确反映数据的局部分布特征。经证明,该方法的拟合精度为O(δ4)(相对于窗宽),达到线性分箱核估计的水平。实验表明,修正的简单分箱核估计构造方法具有良好的时间效率和计算精度,能够运用于面向大规模数据集的聚类分析应用。
The complexity of kernel density estimation (KDE) prohibits is difficult for the density function construction of the large dataset. The binning-based version of classic KDE is an efficient alternative for such kind of application. A revised simple binning strategy is presented by taking the representative gravity point of the data in a bin instead of the center of the bin. This improvement enables the simple binning strategy to monitor the distribution. It is proved that the revised simple binning can achieve 0(84) of discrepancy of the linear binning compared with the ordinary KDE. Experiments in synthetic and real world dataset show that the method has good construction efficiency and the accuracy, thus it is used in the clustering analysis of large dataset.
出处
《数据采集与处理》
CSCD
北大核心
2009年第2期212-217,共6页
Journal of Data Acquisition and Processing
基金
江苏省自然科学基金(BK20082140)资助项目
江苏省教育厅自然科学基金(06KJB520005)资助项目
江苏省"六大人才高峰"(06-E-028)资助项目
关键词
核密度估计
分箱规则
误差估计
kernel density estimation
binning rule
error estimation