有效的XML检索结果的相似性度量被引量：3

Effective similarity measure for XML retrieval results

下载PDF

导出

摘要相似性度量是聚类等问题中的核心问题.本文研究了XML检索结果的相似性度量,提出了一种新的结构的和内容的XML检索结果相似性度量.首先,在结构方面,提出了两个结构相似性度量:纵向结构相似度和横向结构相似度,它们基于不同的特征集,体现了结构的不同方面的相似度.在内容方面,提出用带有结构的内容模型来描述内容,基于这一内容模型提出了内容相似度.最后进行了实验,在实际数据集和合成数据集上的实验结果都显示,结构相似度和内容相似度都具有很好的准确性. As XML has become a de facto standard for formatting and exchanging data on the web and in digital library and scientific applications, there is an increasing need for managing, clustering and retrieving XML data. XML information retrieval is one of the most active areas in database and information retrieval research. In information retrieval, retrieval results organization is an important aspect and effective technique. For example, results clustering has been studied and proved effective in improving retrieval quality. When information retrieval meets XML, it is natural to borrow and extend traditional techniques such as result clustering and apply these techniques to XML retrieval. Clustering XML retrieval results, however, is non-trivial and cannot employ traditional techniques built for traditional information retrieval directly. The core of clustering is similarity measure between data objects, and the similarity measure for XML retrieval results is still open. In this paper, we study the similarity measures of XML retrieval results, and propose novel structural and content similarity measures. Firstly, to remove redundant information, we compute the structural summaries of document trees to reduce the original documents. Summary tree （i. e. structural summary） still has a lot of structural information. In order to depict the summary tree in a comprehensive way, the paper proposes two feature sets, which reflect structural features ofsummary tree from different perspectives and are complementary to each other. Corresponding to these feature sets, we present a two-dimensional structural similarity measure comprising two similarities： horizontal structural similarity and vertical structural similarity. Each of them represents the similarity from one particular perspective and the combination of them will give rise to an accurate structural similarity measure. On the other hand, we propose structural content model to describe the content. A content similarity measure is presented based on the content model. Finally, the overall similarity measure of two XML retrieval results is composed of the structural similarity measure and content similarity measure. A comprehensive set of experiments are conducted. Experimental results on real datasets and synthetic datasets show that, the accuracy of the proposed structural and content similarity measures is well guaranteed.

作者刘喜平万常选

机构地区江西财经大学信息管理学院

出处《南京大学学报（自然科学版）》 CAS CSCD 北大核心 2009年第5期629-637,共9页 Journal of Nanjing University（Natural Science）

基金国家自然科学基金(60763001 60803105/F020606) 国家社会科学基金(07BTQ025)

关键词 XML检索结果相似性度量结构相似度内容相似度 XML retrieval result, similarity measure, structural similarity, content similarity

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献11

1Nierman A, Jagadish H V. Evaluating structural similarity in XML documents. Proceedings of the 5^th International Workshop on Web and Databases (WebDB). Madison, Wisconsin, USA, 2002, 61-66.
2Lian W, Cheung D W-L, Mamoulis N, et al. An efficient and scalable algorithm for clustering XML documents by structure. IEEE Transactions on Knowledge and Data Engineering, 2004, 16 (1): 82-96.
3苗建新,吉根林.GML文档结构聚类算法Clu-GML[J].南京大学学报（自然科学版）,2008,44(2):188-194. 被引量：8
4Yang R, Kalnis P, Tung A K H. Similarity evaluation on tree-structured data. Proceedings of ACM SIGMOD International Conference on Management of Data. Baltimore, Maryland, USA, 2005, 754-765.
5Flesca S, Manco G, Masciari E, etal. Fast detection of XML structural similarity. IEEE Transactions on Knowledge and Data Engineering, 2005, 17 (2): 160-175.
6Helmer S. Measuring the structural similarity of semistructured documents using entropy. Proceedings of the 33^rd International Conference on Very Large Data Bases. University of Vienna, Austria, 2007, 1022-1032.
7Tagarelli A, Greco S. Toward semantic XML clustering. Proceedings of the SIAM International Conference on Data Mining. Bethesda, MD, USA, 2006, 188-199.
8Yang J W, Chen X O. A semi-structured docu ment model for text mining. Journal of Com puter Science and Technology, 2002, 17 (5) 603-610.
9Markus H, Alessandro S, Ah C T, etal. Clustering XML documents using self-organizing maps for structures. Lecture Notes in Computer Science, Springer, 2006, 3977: 481-496.
10Ramanan P. Bisimutation Covering indexes for XML queries simulation = negation. Proceed ings of the 29^th International Conference on Very Large Data Bases. Berlin, Germany, 2003, 165-176.

二级参考文献12

1陆翠明,李芳,Athena I Vakali.XML文档相似性的仿真研究[J].计算机仿真,2005,22(12):300-302. 被引量：1
2王正群,陈世福,陈兆乾.基于模糊划分的神经网络集成[J].南京大学学报（自然科学版）,2006,42(1):63-68. 被引量：6
3潘有能.XML文档自动聚类研究[J].情报学报,2006,25(2):215-220. 被引量：16
4Yun C, Yi X, Yang Y R, et al. Mining closed and maximal frequent subtrees from databases of labeled rooted trees. IEEE Transactions on Knowledge and Data Engineering, 2005, 17 (2): 190-202.
5Nierman A, Jagadish H V. Evaluating structural similarity in xml documents. Proceedings of the WebDB Workshop, USA: Madison, 2002 : 61-66.
6Chawathe S S. Comparing hierarchical data in external memory. Proceedings of the VLDB Conference, UK: Edinburgh, 1999: 90-101.
7Wang L,Cheung D W, Mamoulis N, et al. An efficient and scalable algorithm for clustering XML documents by structure. IEEE Transactions on Knowledge and Data Engineering, 2004,16(1) :82-96.
8Francesca F D, Gordano G, Ortale R, et al. A general framework for XML document clustering. Technical Report, No. 8, ICAR-CNR (Consiglio Nazionale delle Ricerche Istituto di Calcoloe Reti ad Alte Prestazioni), 2003.
9Guha S, Rastogi R, Shim K. ROCK: A robust clustering algorithm for categorical attributes. Proceedings of ICDE99 (International Conference on Data Engineering), Australia: Sydney, 1999, 512-521.
10Theodore D, Tao C, Klaas J W, et al. Clustering XML documents using structural summaries. Current Trends in Database Technology- EDBT 2004 Workshops. Springer Berlin/Heidelberg, 2004 : 547-556.

共引文献7

1张丽,吉根林.一种基于线面包含关系的GML空间聚类算法[J].山东大学学报（工学版）,2009,39(2):21-25. 被引量：3
2魏建香,刘怀,苏新宁.基于遗传算法的文档聚类算法的设计与仿真(英文)[J].南京大学学报（自然科学版）,2009,45(3):432-438. 被引量：4
3杨娜,吉根林.一种基于相交关系的GML空间聚类算法[J].广西师范大学学报（自然科学版）,2009,27(3):113-117. 被引量：3
4朱颖雯,吉根林,孙勤红.基于频繁子树模式的GML文档结构聚类算法[J].计算机工程与应用,2011,47(1):144-146.
5宋爱琪,宋德香,刘晓红,王美君.基于空间相邻关系的GML点对象聚类算法研究[J].测绘标准化,2011,27(1):8-10.
6宋爱琪,刘晓红,吴国洋.GML时空聚类算法性能综述[J].测绘标准化,2011,27(4):9-11. 被引量：1
7兰小机,余红丽,戢武平,赵志岐.基于GML原理的GPS气象学预警研究[J].地球物理学进展,2012,27(4):1294-1297.

同被引文献32

1李笛,胡学钢,胡春玲.主动贝叶斯分类方法研究[J].计算机研究与发展,2007,44(z2):47-51. 被引量：1
2Yang Jianwu, Cheung W K, Chen Xiaoou. Integrating element and term semantics for similarity-based XML document clustering[ C ]//Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence. 2005: 222-228.
3Sanz I, Berlanga R, Mesiti M, et al. ArHeX: Flexible composition of indexes and similarity measures for XML [ C ]// Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop. 2007:281-284.
4Mezghani N, Mitiche A, Cheriet M. Bayes classification of online Arabic characters by Gibbs modeling of class conditional densities[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence ,2008,30 ( 7 ) : 1121-1131.
5Xiangrong Shi ,Jun Liang,Lubin Ye,et al. A method of fault diagnosis based on PCA and Bayes classification[ C]//Proceedings of the 2010 8th World Congress on Intelligent Control and Automation(WCICA). 2010:5628-5631.
6Iqbal K, Asghar S, Fong S. Hiding sensitive XML association rules via Bayesian network [ C ]//Proceedings of the 2010 6th International Conference on Advanced Information Management and Service(IMS). 2010:466-471.
7Chang C C, Lu Hsueh-Ming. Integration of heterogeneous medical decision support systems based on Web services [ C ]//Proceedings of the 2009 Ninth IEEE International Conference on Bioinformatics and Bioengineering. 2009: 415-422.
8苗建新,吉根林.GML文档结构聚类算法Clu-GML[J].南京大学学报（自然科学版）,2008,44(2):188-194. 被引量：8
9朱颖雯,吉根林.基于最大频繁Induced子树的GML文档结构聚类[J].南京师范大学学报（工程技术版）,2008,8(4):50-55. 被引量：2
10罗文婷,王艳辉,贾利民,秦勇.改进层次分析法在铁路应急预案评价中的应用研究[J].铁道学报,2008,30(6):24-28. 被引量：27

引证文献3

1宋爱琪,刘晓红,吴国洋.GML时空聚类算法性能综述[J].测绘标准化,2011,27(4):9-11. 被引量：1
2韩晓梅,郑洪源,丁秋林.一种基于贝叶斯分类的XML检索文档相似度算法[J].计算机与现代化,2012(1):34-36.
3鲁金涛.应急演练“情景-响应”模型的结构相似度构建方法[J].中国安全科学学报,2021,31(10):182-188. 被引量：10

二级引证文献11

1赵开功,张晓蕾,李长明,越成浩,盖泳伶.铁路列控系统应急案例知识重用方法[J].中国安全科学学报,2022,32(S02):217-224. 被引量：1
2张忠贵,芦娅.一种通用的生命线工程网络事件空间聚类分析算法[J].灾害学,2015,30(1):29-33. 被引量：2
3南锐,肖叶静,王静.突发事件应急管理情景分析:宏观审视与微观解构[J].矿业科学学报,2023,8(2):265-276. 被引量：2
4王博,常宁,吴春水,赖光辉,韩自强,陈锋,刘晓东,白夜.延庆冬奥赛区外围森林火灾应急情景构建研究[J].森林防火,2022,40(2):7-12. 被引量：2
5饶星,黄小勇,屈俊勋.基于“情景—响应”的实验室动态场景实战应急演练[J].实验技术与管理,2023,40(2):200-204. 被引量：1
6陆熙燕,于鹏亮,胡洛岩.重大突发公共卫生事件中政府响应效果的影响因素研究——以2012—2021年我国重大突发公共卫生事件为例[J].科技传播,2023,15(9):51-55.
7高娜.城市地铁反恐实战演练优化路径——基于8个新一线城市地铁反恐实战演练实例[J].中国人民警察大学学报,2023,39(11):62-69.
8许钧,李筱,王修来.考虑不确定性的多情景应急车辆综合调度模型[J].中国安全科学学报,2023,33(11):156-164.
9盛伟,关城,张苏,傅炜,王金贵.基于情景相似的电网企业台风灾害应急决策[J].中国安全科学学报,2023,33(11):174-180.
10张超,翁文国,陈勇,代宝乾,秦挺鑫.城市安全风险特征及对风险管理的启示[J].中国安全科学学报,2024,34(1):223-230.

1陈则盛,李建华,诸鸿文.主动网络的体系结构和性能分析[J].上海交通大学学报,2001,35(11):1701-1704.
2杨宇平,邓承志,汪胜前.基于结构相似保真的图像稀疏表示模型[J].小型微型计算机系统,2013,34(5):1198-1200. 被引量：1
3宁静,刘杰,叶丹.一种基于内容模型图的XML Schema Definition的提取方法[J].计算机科学,2010,37(6):179-185. 被引量：3
4刘雪洁,刘衍珩,王鼎.基于C/S与B/S模式的纵向综合结构管理系统的实现[J].吉林大学学报（工学版）,2004,34(1):146-149. 被引量：10
5吴良海.数字博物馆通用平台的建设[J].信息与电脑,2016,28(7):133-134.
6余宏,万常选.基于XML的检索结果聚类方法[J].计算机工程,2010,36(1):85-86. 被引量：5
7董伟俊,赵一九.关于电子政务促进服务型政府建设的调查研究[J].电子政务,2010(2):167-171.
8朱珂,华林,周晓方,章倩苓.JPEG2000中EBCOT算法的VLSI结构研究与实现[J].小型微型计算机系统,2006,27(2):343-347.
9魏武华,崔欣,刘静.基于SCORM的移动学习资源结构设计与实现[J].计算机时代,2015(7):17-19.
10云健,江荻,潘悟云.模因机制下人类元音系统演化的计算模型[J].山东大学学报（工学版）,2010,40(4):12-18. 被引量：2

南京大学学报（自然科学版）

2009年第5期

浏览历史

内容加载中请稍等...

有效的XML检索结果的相似性度量被引量：3

参考文献11

二级参考文献12

共引文献7

同被引文献32

引证文献3

二级引证文献11

相关作者

相关机构

相关主题

浏览历史

有效的XML检索结果的相似性度量 被引量：3

参考文献11

二级参考文献12

共引文献7

同被引文献32

引证文献3

二级引证文献11

相关作者

相关机构

相关主题

浏览历史

有效的XML检索结果的相似性度量被引量：3