The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computin...The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computing in distributed environments, DBOZ, a distributed algorithm for distance-based outlier detection using Z-curve hierarchical tree (ZH-tree) is proposed. First, we propose a new index, ZH-tree, to effectively manage the data in a distributed environment. ZH-tree has two desirable advantages, including clustering property to help search the neighbors of a point, and hierarchical structure to support space pruning. We also design a bottom-up approach to build ZH-tree in parallel, whose time complexity is linear to the number of dimensions and the size of dataset. Second, DBOZ is proposed to compute outliers in distributed environments. It consists of two stages. 1) To avoid calculating the exact nearest neighbors of all the points, we design a greedy method and a new ZH-tree based k-nearest neighbor searching algorithm (ZHkNN for short) to obtain a threshold LW. 2) We propose a filter-and-refine approach, which first filters out the unpromising points using LW, and then outputs the final outliers through refining the remaining points. At last, the efficiency and the effectiveness of ZH-tree and DBOZ are testified through a series of experiments.展开更多
Index structure that enables efficient similarity queries in high-dimensional space is crucial for many applications. This paper discusses the indexing problem in dataset composed of partially clustered data, which ex...Index structure that enables efficient similarity queries in high-dimensional space is crucial for many applications. This paper discusses the indexing problem in dataset composed of partially clustered data, which exists in many applications. Current index methods are inefficient with partially clustered datasets. The dynamic and adaptive index structure presented here, called a multi-cluster tree (MC-tree), consists of a set of height-balanced trees for indexing. This index structure improves the querying efficiency in three ways: 1) Most bounding regions achieve uniform distributions, which results in fewer splits and less overlap compared with a single indexing tree. 2) The clusters in the dataset are dynamically detected when the index is updated. 3) The query process does not involve a sequential scan. The MC-tree was shown to be better than hierarchical and cluster-based indexes for the partially clustered datasets.展开更多
介绍织机计算机监测系统中多维数据集的建立及应用效果。针对现有织机计算机监测系统数据资源浪费和检索效率低两个方面的问题,利用SQL Server Business Intelligence Development Studio工具建立了织机监测系统的多维数据集,分析了两...介绍织机计算机监测系统中多维数据集的建立及应用效果。针对现有织机计算机监测系统数据资源浪费和检索效率低两个方面的问题,利用SQL Server Business Intelligence Development Studio工具建立了织机监测系统的多维数据集,分析了两种多维数据浏览方式。结果表明:使用Microsoft Office Excel 2007进行多维数据集的浏览与共享,用户培训时间短、浏览方式简便,报表开发周期短、灵活性及生成速度高。认为:采用多维数据集的织机计算机监测系统能为企业实际生产提供帮助与支持。展开更多
作者于2017年6月发表了遥感多维数据格式互操作分析软件1.0版(MARS-Multidimensional Analysis of Remote Sensing V1.0),在此基础上发布遥感多维数据格式互操作分析软件系统更新版(MARS v2.03)。该版系统可以处理作者提出的将遥感产品...作者于2017年6月发表了遥感多维数据格式互操作分析软件1.0版(MARS-Multidimensional Analysis of Remote Sensing V1.0),在此基础上发布遥感多维数据格式互操作分析软件系统更新版(MARS v2.03)。该版系统可以处理作者提出的将遥感产品涉及的时间、空间、光谱特征等关联成一体的数据格式,即"多维数据格式(Multi-Dimensional Data Format,MDD)",其中包括由TSB(Temporal Sequential in Band)、TSP(Temporal Sequential in Pixel)、TIB(Temporal Interleaved by Band)、TIP(Temporal Interleaved by Pixel)和TIS(Temporal Interleaved by Spectrum)五种数据存储格式组成的关联组织关系和数据组织结构,具有.mdd格式数据的输入、存储、分析、输出及格式转换等功能,图像预处理、光谱分析、遥感数据分类等功能,Vegetation Index计算功能,去除系统中的一些冗余和错误,使得系统运行更加顺畅。更新后的版本保留原版本的DOI注册号。展开更多
基金This work was supported by the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61033007 and 61472070, and the Fundamental Research Funds for the Central Universities of China under Grant No. N120816001.
文摘The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computing in distributed environments, DBOZ, a distributed algorithm for distance-based outlier detection using Z-curve hierarchical tree (ZH-tree) is proposed. First, we propose a new index, ZH-tree, to effectively manage the data in a distributed environment. ZH-tree has two desirable advantages, including clustering property to help search the neighbors of a point, and hierarchical structure to support space pruning. We also design a bottom-up approach to build ZH-tree in parallel, whose time complexity is linear to the number of dimensions and the size of dataset. Second, DBOZ is proposed to compute outliers in distributed environments. It consists of two stages. 1) To avoid calculating the exact nearest neighbors of all the points, we design a greedy method and a new ZH-tree based k-nearest neighbor searching algorithm (ZHkNN for short) to obtain a threshold LW. 2) We propose a filter-and-refine approach, which first filters out the unpromising points using LW, and then outputs the final outliers through refining the remaining points. At last, the efficiency and the effectiveness of ZH-tree and DBOZ are testified through a series of experiments.
基金Supported by the Chinese National Key FundamentalResearch Program(No.G1998030414)the National Natural Science Foundation of China (No.79990580)the"985" Program of Tsinghua University
文摘Index structure that enables efficient similarity queries in high-dimensional space is crucial for many applications. This paper discusses the indexing problem in dataset composed of partially clustered data, which exists in many applications. Current index methods are inefficient with partially clustered datasets. The dynamic and adaptive index structure presented here, called a multi-cluster tree (MC-tree), consists of a set of height-balanced trees for indexing. This index structure improves the querying efficiency in three ways: 1) Most bounding regions achieve uniform distributions, which results in fewer splits and less overlap compared with a single indexing tree. 2) The clusters in the dataset are dynamically detected when the index is updated. 3) The query process does not involve a sequential scan. The MC-tree was shown to be better than hierarchical and cluster-based indexes for the partially clustered datasets.
文摘介绍织机计算机监测系统中多维数据集的建立及应用效果。针对现有织机计算机监测系统数据资源浪费和检索效率低两个方面的问题,利用SQL Server Business Intelligence Development Studio工具建立了织机监测系统的多维数据集,分析了两种多维数据浏览方式。结果表明:使用Microsoft Office Excel 2007进行多维数据集的浏览与共享,用户培训时间短、浏览方式简便,报表开发周期短、灵活性及生成速度高。认为:采用多维数据集的织机计算机监测系统能为企业实际生产提供帮助与支持。
文摘作者于2017年6月发表了遥感多维数据格式互操作分析软件1.0版(MARS-Multidimensional Analysis of Remote Sensing V1.0),在此基础上发布遥感多维数据格式互操作分析软件系统更新版(MARS v2.03)。该版系统可以处理作者提出的将遥感产品涉及的时间、空间、光谱特征等关联成一体的数据格式,即"多维数据格式(Multi-Dimensional Data Format,MDD)",其中包括由TSB(Temporal Sequential in Band)、TSP(Temporal Sequential in Pixel)、TIB(Temporal Interleaved by Band)、TIP(Temporal Interleaved by Pixel)和TIS(Temporal Interleaved by Spectrum)五种数据存储格式组成的关联组织关系和数据组织结构,具有.mdd格式数据的输入、存储、分析、输出及格式转换等功能,图像预处理、光谱分析、遥感数据分类等功能,Vegetation Index计算功能,去除系统中的一些冗余和错误,使得系统运行更加顺畅。更新后的版本保留原版本的DOI注册号。