基于语义和结构的XML文档相似度的计算方法被引量：3

XML Document Similarity Measure Based on Semantics and Structure

下载PDF

导出

摘要个性化信息服务通过了解用户的兴趣爱好,为不同的用户提供不同的信息服务。XML是一种标示语言,是Web文档表示和交换的常用相关标准,因此XML文档之间相似度计算问题对于个性化推荐与信息检索非常重要,为此提出了一个计算XML文档之间的语义和结构相似度的方法 XMLSim。首先,基于节点标记对之间的语义相似度和编辑距离计算节点标记对之间的相似度;在分析了路径上节点具有的偏序关系之后,将路径之间相似度问题抽象为最大相似子序列(MSS,Maximal Similar Subsequence)问题,并利用动态规划对MSS问题求解得到路径相似度NpathSim。最后,XML文档之间的相似度XMLSim通过路径集合之间的最大NPathSim的平均值得到。 XML is a markup language that has emerged as the most relevant standardization effort for document rep- resentation and exchange on the Web. Similarity measure for XML documents plays important role in personalized recommendations and information retrieval. A novel approach to compute semantic and structural similarity between XML documents, XMLSim, is proposed in this paper. Firstly, a similarity between node tags is created based on semantic similarity and string similarity. After analyzing partial relationship among node tags, the path similarity is abstracted as Maximal Similar Subsequence （MSS） problem. The result of NPathSim is obtained by the solution of MSS with dynamic programming. Finally, XMLSim is the average of the best NPathSim value among path sets.

作者宋玲吕强邓薇吕晓琳

机构地区山东大学控制科学与工程学院国网技术学院电网检修培训部山东科技大学基础课部天津财经大学商学院

出处《中文信息学报》 CSCD 北大核心 2012年第5期59-64,共6页 Journal of Chinese Information Processing

基金国家自然科学基金资助项目(61170052) 山东省高等教育学会"十二五"高等教育科学研究课题(YBKT2011063) 山东建筑大学博士基金(XNBS1028)

关键词 XML 相似度动态规划语义和结构 XML similarity dynamic programming semantics and structure

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献13

1郑仕辉,周傲英,张龙.XML文档的相似测度和结构索引研究[J].计算机学报,2003,26(9):1116-1122. 被引量：28
2Zhang K, Statman R, Shasha D. On the editing dis-tance between unordered labeled trees[J]. Information Processing Letters. 1992, 42(3) : 133-139.
3Nierman A, Jagadish H V. Evaluating Structural Simi- larity in XML Documents[DB/OL]. 2002, citeseerx. ist. psu. edu,61-66.
4Nayak R. Investigating Semantic Measures in XML Clustering[C]//Proceedings of IEEE/WIC/ACM In- ternational Conference on Web Intelligence, 2006: 1042-1045.
5Joshi S, Agrawal N, Krishnapuram R, et al. A bag of paths model for measuring structural similarity in Web documents[C]//Proceedings of Knowledge Discovery and Data Mining. Washington, D. C. , ACM Press, 2003: 577-582.
6Nayak R, Iryadi W. XML schema clustering with se- mantic and hierarchical similarity measures [J].Knowledge-Based Systems. 2007, 20(4) : 336-349.
7赵嫣,马军,李森.一种计算结构化文档相关度的方法[c]//第二届中国分类技术及应用学术会议.郑州:20070527.350-355.
8Jeong B, Lee D, Cho H, et al. A novel method for measuring semantic similarity for XML schema match- ing[J].Expert Systems with Applications. 2008, 34(3) : 1651-1658.
9Levenshtein V. Binary codes capable of correcting de letions, insertions, and reversals[J]. Soviet Physics Doklady. 1966, 10(8): 707-710.
10Princeton University. WordNet[DB/OL]. 2011, ht- tp ://wordnet. princeton, edu/.

二级参考文献15

1XQuery: A query language for XML. W3C Working Draft 15February 2001, available: http://www. w3. org/TR/xquery/.
2Tarjan. Three partition refinement algorithms. SIAM Journalon Computing, 1987, 16(6): 973-989.
3Henzinger M R, Henzinger T A, Kopke P W. Computing sim-ulations on finite and infinite graphs. In: Proceedings of the36th Annual IEEE Symposium on Foundations of ComputerScience, Milwaukee, Wisconsin, 1995. 453-462.
4Marian A, Abiteboul S, Cobena G, Mignet L. Change-centricmanagement of versions in an XML warehouse. In: Proceed-ings of the 27th International Conference on Very Large DataBases, Roma, Italy,2001. 581-590.
5Goldman R, Widom J. Summarizing and searching sequential semistructured sources. Stanford University: Technical ReportTR20000312, 2000.
6Zheng Shi-Hui, Zhou Ao-Ying et al. Structure-based approximate searching in XML data. Fudan University: Technical Report TR20010203,2001.
7Wang J T-L, Shasha D etal. Structural matching and discovery in document databases. Sigmod Record, 1997, 26(2): 560-564.
8Zhang K. A constrained editing distance between unordered labeled trees. Journal of Algorithmica, 1996, 15(3): 205-222.
9Zhang K, Shasha D. On the editing distance between unordered labeled trees. Information Processing Letters, 1992, 42(3): 133-139.
10Wang J T-L, Zhang K etal. Exact and approximate algorithmsfor unordered tree matching. IEEE Transactions on Systems,Man and Cybernetics, 1994, 24(4): 668-678.

共引文献27

1赵嫣,马军,李森.一种计算结构化文档相关度的方法[J].计算机研究与发展,2007,44(z2):350-355.
2叶庆卫,汪同庆.基于二叉树相似性检测的变形文字识别研究[J].计算机工程与应用,2005,41(31):52-54. 被引量：1
3陈德华,韩忠明,乐嘉锦.基于相似性分析的软件构件聚类研究[J].小型微型计算机系统,2005,26(12):2207-2211. 被引量：2
4闫利国,贺飞.XM L文档结构相似测度研究[J].计算机应用研究,2006,23(3):44-46. 被引量：4
5潘有能.XML文档自动聚类研究[J].情报学报,2006,25(2):215-220. 被引量：16
6刘大昕,王桐.一种新的XML近似查询及排序方法[J].哈尔滨工程大学学报,2006,27(B07):407-410. 被引量：1
7杨长辉,岳友友.一种基于编辑距离的XML查询方案[J].计算机应用,2006,26(12):2991-2993. 被引量：2
8梅东霞,张晓明.基于单个XML文档结构的数据挖掘[J].石油化工高等学校学报,2007,20(1):94-98. 被引量：3
9潘有能,丁楠.基于标记树的XML文档自动分类研究[J].情报学报,2007,26(3):350-355. 被引量：5
10丘威.XML文档相似度量应用研究[J].嘉应学院学报,2007,25(6):77-82.

同被引文献25

1Lenzerini M.Data integration:a theoretical perspective[C]//PODS.New York,USA,2002:233-246.
2Kolaitis P G.Schema mappings,data exchange,and metadata rnanagement[C]//pODS.New York,USA,2005:61-75.
3BemsteinP A,Melnik S.Model management 2.0:manipulating richer mappings[C]//SIGMOD.New York,USA,2007:1-12.
4Barceló P.Logical foundations of relational data exchange[J].SIGMOD Rec.,2009,38:49-58.
5Fagin R,Kolaitis P G,Popa L.Data exchange:getting to the core[J].ACM Trans.Database Syst.,2005,30:174-210.
6Gottlob G,Nash A.Data exchange:computing cores in polynomial time[C]//PODS.New York,USA,2006:40-49.
7Libkin L,Sirangelo C.Data exchange and schema mappings in open and closed worlds[J].Journal of Computer and System Sciences(In Press,Corrected Proof),2010.
8Fagin R,Kimelfeld B,Kolaitis P G.Probabilistic data exchange[C]// ICDT.New York,USA,2010:76-88.
9Arenas M,Libkin L.XML data exchange:Consistency and query answering[J].J.ACM,2008,55:1-72.
10Amano S,Libkin L,Murlak F.XML schema mappings[C]//PODS.New York,USA,2009:33-42.

引证文献3

1任柯,杨霞.一种基于排序的XML文档数据交换算法[J].计算机科学,2014,41(5):223-226. 被引量：1
2吴小龙,曹存根.基于等价压缩快速聚类的Web表格知识抽取[J].中文信息学报,2019,33(4):75-84. 被引量：1
3郑晓梅,钱正轩,李刚,王天舒.基于模型的移动应用功能场景自动标注方法[J].计算机工程与设计,2023,44(10):3039-3046.

二级引证文献2

1贾彩虹,赵文剑,邓记才.基于XML的异构数据集成系统的研究与设计[J].河南科技,2014,33(11):12-14. 被引量：3
2周航,张泽,马泽祺,张琳,鲍玉斌.基于知识图谱的航空领域问答系统设计[J].信息与电脑,2021,33(24):162-164. 被引量：1

1姚行艳,蔡乐才,莫再峰.基于向量空间模型的路径相似度蚁群算法研究[J].四川理工学院学报（自然科学版）,2008,21(5):43-45. 被引量：1
2王华,王治和,王平.Web用户聚类研究[J].甘肃联合大学学报（自然科学版）,2010,24(1):79-82. 被引量：3
3朱渊萍.一种新的时间序列相似性模式发现算法[J].海南师范大学学报（自然科学版）,2011,24(2):151-154.
4毛红保,张凤鸣,冯卉,吕慧刚.多元飞行数据相似模式查询[J].计算机工程与应用,2011,47(16):151-155. 被引量：6
5赵玉国.基于遗传算法的Web用户聚类[J].微计算机应用,2008,29(4):21-24. 被引量：1
6赵奇,赵阿群.一种基于A＊算法的多径寻由算法[J].电子与信息学报,2013,35(4):952-957. 被引量：7
7杨海斌,赵学锋,王秀花,张利香.一种求所有最长增量子序列的算法[J].山东大学学报（工学版）,2010,40(6):156-158.
8王曙燕,温春琰,孙家泽.基于自适应粒子群优化算法的测试数据扩增方法[J].计算机应用,2016,36(9):2492-2496. 被引量：6
9王霞,赵龙,夏秀峰.基于位置近邻的RFID路径聚类算法[J].沈阳航空航天大学学报,2012,29(2):46-50.
10李贵,陈成,李征宇,韩子扬,孙平,孙焕良.基于标签路径的Web结构化数据自动抽取[J].计算机科学,2013,40(06A):141-144. 被引量：3

中文信息学报

2012年第5期

浏览历史

内容加载中请稍等...

基于语义和结构的XML文档相似度的计算方法被引量：3

参考文献13

二级参考文献15

共引文献27

同被引文献25

引证文献3

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

基于语义和结构的XML文档相似度的计算方法 被引量：3

参考文献13

二级参考文献15

共引文献27

同被引文献25

引证文献3

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

基于语义和结构的XML文档相似度的计算方法被引量：3