摘要
相似性度量是聚类等问题中的核心问题.本文研究了XML检索结果的相似性度量,提出了一种新的结构的和内容的XML检索结果相似性度量.首先,在结构方面,提出了两个结构相似性度量:纵向结构相似度和横向结构相似度,它们基于不同的特征集,体现了结构的不同方面的相似度.在内容方面,提出用带有结构的内容模型来描述内容,基于这一内容模型提出了内容相似度.最后进行了实验,在实际数据集和合成数据集上的实验结果都显示,结构相似度和内容相似度都具有很好的准确性.
As XML has become a de facto standard for formatting and exchanging data on the web and in digital library and scientific applications, there is an increasing need for managing, clustering and retrieving XML data. XML information retrieval is one of the most active areas in database and information retrieval research. In information retrieval, retrieval results organization is an important aspect and effective technique. For example, results clustering has been studied and proved effective in improving retrieval quality. When information retrieval meets XML, it is natural to borrow and extend traditional techniques such as result clustering and apply these techniques to XML retrieval. Clustering XML retrieval results, however, is non-trivial and cannot employ traditional techniques built for traditional information retrieval directly. The core of clustering is similarity measure between data objects, and the similarity measure for XML retrieval results is still open. In this paper, we study the similarity measures of XML retrieval results, and propose novel structural and content similarity measures. Firstly, to remove redundant information, we compute the structural summaries of document trees to reduce the original documents. Summary tree (i. e. structural summary) still has a lot of structural information. In order to depict the summary tree in a comprehensive way, the paper proposes two feature sets, which reflect structural features ofsummary tree from different perspectives and are complementary to each other. Corresponding to these feature sets, we present a two-dimensional structural similarity measure comprising two similarities: horizontal structural similarity and vertical structural similarity. Each of them represents the similarity from one particular perspective and the combination of them will give rise to an accurate structural similarity measure. On the other hand, we propose structural content model to describe the content. A content similarity measure is presented based on the content model. Finally, the overall similarity measure of two XML retrieval results is composed of the structural similarity measure and content similarity measure. A comprehensive set of experiments are conducted. Experimental results on real datasets and synthetic datasets show that, the accuracy of the proposed structural and content similarity measures is well guaranteed.
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2009年第5期629-637,共9页
Journal of Nanjing University(Natural Science)
基金
国家自然科学基金(60763001
60803105/F020606)
国家社会科学基金(07BTQ025)
关键词
XML检索结果
相似性度量
结构相似度
内容相似度
XML retrieval result, similarity measure, structural similarity, content similarity