摘要
为了在海量、高维、动态的半结构化数据集上进行有效的相似搜索,该文提出一种采用聚类技术进行索引构建与更新的多路平衡树——CSS-树以及基于CSS-树的相似搜索与动态更新的算法.CSS-树借鉴SS+-树基于聚类进行节点组织与分裂的基本思想,避免了根据坐标维进行分裂时所要求的维不相关性,同时在节点组织、分裂算法和搜索算法等方面进行了改进,提出了新的搜索剪枝策略.实验表明,该结构及算法对海量半结构化数据相似搜索的效率明显优于传统算法.
A new index, called CSS-tree, is proposed to organize and search dynamic high-dimension vast semi-structure data set. The CSS-tree is a multi-way balance tree, which is combining the benefit of R-tree and SS-tree to deal with high-dimension vast data sets, and the benefit of M-tree to deal with 'metric space' data sets. This paper details the structure of CSS-tree, whose each inner node is composed of a group of index elements including cover center and cover radius of child tree and every leaf is in same level and all data indices is in leaves. The paper give algorithms for similarity search based CSS-tree both range search and k-NN search, and dynamic update algorithms of the CSS-tree. It describes the simply split policy which reference to CF-tree's split policy of BIRTH, and reorganizing algorithms which using clustering technique to keep the index elements that the similar elements are neighbor in the index tree, and avoid the need of independent between feather values. It also describes how to keep minimum cover space and overlap space. Using simulation data sets and using part of 'Chinese Encyclopedia Database' as data set, which is on XML document set, experiments show that the CSS-tree is close to SS+-tree and M-tree in building tree, but CSS-tree outperforms both SS+-tree and M-tree in similarity search in semi-structured data sets.
出处
《计算机学报》
EI
CSCD
北大核心
2002年第11期1219-1226,共8页
Chinese Journal of Computers