期刊文献+

有效的XML检索结果的相似性度量 被引量:3

Effective similarity measure for XML retrieval results
下载PDF
导出
摘要 相似性度量是聚类等问题中的核心问题.本文研究了XML检索结果的相似性度量,提出了一种新的结构的和内容的XML检索结果相似性度量.首先,在结构方面,提出了两个结构相似性度量:纵向结构相似度和横向结构相似度,它们基于不同的特征集,体现了结构的不同方面的相似度.在内容方面,提出用带有结构的内容模型来描述内容,基于这一内容模型提出了内容相似度.最后进行了实验,在实际数据集和合成数据集上的实验结果都显示,结构相似度和内容相似度都具有很好的准确性. As XML has become a de facto standard for formatting and exchanging data on the web and in digital library and scientific applications, there is an increasing need for managing, clustering and retrieving XML data. XML information retrieval is one of the most active areas in database and information retrieval research. In information retrieval, retrieval results organization is an important aspect and effective technique. For example, results clustering has been studied and proved effective in improving retrieval quality. When information retrieval meets XML, it is natural to borrow and extend traditional techniques such as result clustering and apply these techniques to XML retrieval. Clustering XML retrieval results, however, is non-trivial and cannot employ traditional techniques built for traditional information retrieval directly. The core of clustering is similarity measure between data objects, and the similarity measure for XML retrieval results is still open. In this paper, we study the similarity measures of XML retrieval results, and propose novel structural and content similarity measures. Firstly, to remove redundant information, we compute the structural summaries of document trees to reduce the original documents. Summary tree (i. e. structural summary) still has a lot of structural information. In order to depict the summary tree in a comprehensive way, the paper proposes two feature sets, which reflect structural features ofsummary tree from different perspectives and are complementary to each other. Corresponding to these feature sets, we present a two-dimensional structural similarity measure comprising two similarities: horizontal structural similarity and vertical structural similarity. Each of them represents the similarity from one particular perspective and the combination of them will give rise to an accurate structural similarity measure. On the other hand, we propose structural content model to describe the content. A content similarity measure is presented based on the content model. Finally, the overall similarity measure of two XML retrieval results is composed of the structural similarity measure and content similarity measure. A comprehensive set of experiments are conducted. Experimental results on real datasets and synthetic datasets show that, the accuracy of the proposed structural and content similarity measures is well guaranteed.
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2009年第5期629-637,共9页 Journal of Nanjing University(Natural Science)
基金 国家自然科学基金(60763001 60803105/F020606) 国家社会科学基金(07BTQ025)
关键词 XML检索结果 相似性度量 结构相似度 内容相似度 XML retrieval result, similarity measure, structural similarity, content similarity
  • 相关文献

参考文献11

  • 1Nierman A, Jagadish H V. Evaluating structural similarity in XML documents. Proceedings of the 5^th International Workshop on Web and Databases (WebDB). Madison, Wisconsin, USA, 2002, 61-66.
  • 2Lian W, Cheung D W-L, Mamoulis N, et al. An efficient and scalable algorithm for clustering XML documents by structure. IEEE Transactions on Knowledge and Data Engineering, 2004, 16 (1): 82-96.
  • 3苗建新,吉根林.GML文档结构聚类算法Clu-GML[J].南京大学学报(自然科学版),2008,44(2):188-194. 被引量:8
  • 4Yang R, Kalnis P, Tung A K H. Similarity evaluation on tree-structured data. Proceedings of ACM SIGMOD International Conference on Management of Data. Baltimore, Maryland, USA, 2005, 754-765.
  • 5Flesca S, Manco G, Masciari E, etal. Fast detection of XML structural similarity. IEEE Transactions on Knowledge and Data Engineering, 2005, 17 (2): 160-175.
  • 6Helmer S. Measuring the structural similarity of semistructured documents using entropy. Proceedings of the 33^rd International Conference on Very Large Data Bases. University of Vienna, Austria, 2007, 1022-1032.
  • 7Tagarelli A, Greco S. Toward semantic XML clustering. Proceedings of the SIAM International Conference on Data Mining. Bethesda, MD, USA, 2006, 188-199.
  • 8Yang J W, Chen X O. A semi-structured docu ment model for text mining. Journal of Com puter Science and Technology, 2002, 17 (5) 603-610.
  • 9Markus H, Alessandro S, Ah C T, etal. Clustering XML documents using self-organizing maps for structures. Lecture Notes in Computer Science, Springer, 2006, 3977: 481-496.
  • 10Ramanan P. Bisimutation Covering indexes for XML queries simulation = negation. Proceed ings of the 29^th International Conference on Very Large Data Bases. Berlin, Germany, 2003, 165-176.

二级参考文献12

  • 1陆翠明,李芳,Athena I Vakali.XML文档相似性的仿真研究[J].计算机仿真,2005,22(12):300-302. 被引量:1
  • 2王正群,陈世福,陈兆乾.基于模糊划分的神经网络集成[J].南京大学学报(自然科学版),2006,42(1):63-68. 被引量:6
  • 3潘有能.XML文档自动聚类研究[J].情报学报,2006,25(2):215-220. 被引量:16
  • 4Yun C, Yi X, Yang Y R, et al. Mining closed and maximal frequent subtrees from databases of labeled rooted trees. IEEE Transactions on Knowledge and Data Engineering, 2005, 17 (2): 190-202.
  • 5Nierman A, Jagadish H V. Evaluating structural similarity in xml documents. Proceedings of the WebDB Workshop, USA: Madison, 2002 : 61-66.
  • 6Chawathe S S. Comparing hierarchical data in external memory. Proceedings of the VLDB Conference, UK: Edinburgh, 1999: 90-101.
  • 7Wang L,Cheung D W, Mamoulis N, et al. An efficient and scalable algorithm for clustering XML documents by structure. IEEE Transactions on Knowledge and Data Engineering, 2004,16(1) :82-96.
  • 8Francesca F D, Gordano G, Ortale R, et al. A general framework for XML document clustering. Technical Report, No. 8, ICAR-CNR (Consiglio Nazionale delle Ricerche Istituto di Calcoloe Reti ad Alte Prestazioni), 2003.
  • 9Guha S, Rastogi R, Shim K. ROCK: A robust clustering algorithm for categorical attributes. Proceedings of ICDE99 (International Conference on Data Engineering), Australia: Sydney, 1999, 512-521.
  • 10Theodore D, Tao C, Klaas J W, et al. Clustering XML documents using structural summaries. Current Trends in Database Technology- EDBT 2004 Workshops. Springer Berlin/Heidelberg, 2004 : 547-556.

共引文献7

同被引文献32

  • 1李笛,胡学钢,胡春玲.主动贝叶斯分类方法研究[J].计算机研究与发展,2007,44(z2):47-51. 被引量:1
  • 2Yang Jianwu, Cheung W K, Chen Xiaoou. Integrating element and term semantics for similarity-based XML document clustering[ C ]//Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence. 2005: 222-228.
  • 3Sanz I, Berlanga R, Mesiti M, et al. ArHeX: Flexible composition of indexes and similarity measures for XML [ C ]// Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop. 2007:281-284.
  • 4Mezghani N, Mitiche A, Cheriet M. Bayes classification of online Arabic characters by Gibbs modeling of class conditional densities[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence ,2008,30 ( 7 ) : 1121-1131.
  • 5Xiangrong Shi ,Jun Liang,Lubin Ye,et al. A method of fault diagnosis based on PCA and Bayes classification[ C]//Proceedings of the 2010 8th World Congress on Intelligent Control and Automation(WCICA). 2010:5628-5631.
  • 6Iqbal K, Asghar S, Fong S. Hiding sensitive XML association rules via Bayesian network [ C ]//Proceedings of the 2010 6th International Conference on Advanced Information Management and Service(IMS). 2010:466-471.
  • 7Chang C C, Lu Hsueh-Ming. Integration of heterogeneous medical decision support systems based on Web services [ C ]//Proceedings of the 2009 Ninth IEEE International Conference on Bioinformatics and Bioengineering. 2009: 415-422.
  • 8苗建新,吉根林.GML文档结构聚类算法Clu-GML[J].南京大学学报(自然科学版),2008,44(2):188-194. 被引量:8
  • 9朱颖雯,吉根林.基于最大频繁Induced子树的GML文档结构聚类[J].南京师范大学学报(工程技术版),2008,8(4):50-55. 被引量:2
  • 10罗文婷,王艳辉,贾利民,秦勇.改进层次分析法在铁路应急预案评价中的应用研究[J].铁道学报,2008,30(6):24-28. 被引量:27

引证文献3

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部