数据源敏感的多源XML数据相似度量方法

Similarity Measure of Multi-Source XML Data by Means of Data Source-Sensitivity

下载PDF

导出

摘要将预处理后的XML数据当作文本信息采用词频-逆向文档频率(TF-IDF)模型进行处理时,逆向文档频率作为词项权重有其不足之处.为此,文中定义了词项的数据源敏感度作为逆向文档频率(IDF)的修正系数.其值取决于提供此词项的数据来源于不同数据源的概率,概率大则其值大,反之则其值小.然后在修正后的词项权重向量的基础上,定义了相似度函数.最后在模拟、真实数据集上进行数据重复检测实验.结果表明,新方法获得了更高的F测度值.这说明考虑词项的数据源敏感度可提高相似度函数的有效性. When preprocessed XML data are used as text information to be dealt with by the TF-IDF （ Term Fre-quency-Inverse Document Frequency ） model, the IDF as the weight of terms has imperfection of its own .In order to solve this problem , the data source-sensitivity of terms is defined as the modification coefficient of the IDF .Its value depends on the probability which provides the term with the data from different sources .When the probability is big, its value is big, and vice versa.Then, the similarity function is defined on the basis of the weight vector of the fixed terms.Finally, experiments of detecting duplicate XML data from multiple sources are conducted on real and simulated datasets .The results show that the proposed method achieves a higher F measure value , which indi-cates that the data source-sensitivity of terms helps improve the effectiveness of similarity function .

作者王继奎李少波

机构地区中国科学院成都计算机应用研究所贵州大学现代制造技术教育部重点实验室

出处《华南理工大学学报（自然科学版）》 EI CAS CSCD 北大核心 2014年第7期28-32,共5页 Journal of South China University of Technology(Natural Science Edition)

基金国家科技支撑计划项目(2012BAF12B14 2012BAH62F01) 贵州省科技项目(黔科合重大专项字[2012]6021 黔科合计工字[2012]4009)

关键词 XML 数据集成文本处理数据源敏感度 XML XML data integration text processing data source-sensitivity

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献15

1孔令波,唐世渭,杨冬青,王腾蛟,高军.XML数据的查询技术[J].软件学报,2007,18(6):1400-1418. 被引量：72
2Ko Y, Park J, Seo J. Improving text categorization using the importance of sentences [ J ]. Information Processing &Management,2004,40 ( 1 ) : 65- 79.
3Theobald M, Siddharth J, Paepcke A. Spotsigs : robust and efficient near duplicate detection in large web collections [ C]//Proceedings of the 31 st Annual International ACM SIGIR Conference on Research and Development in Infor- nmtion Retrieval. Singapore : ACM ,2008:563-570.
4黄承慧,印鉴,侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报,2011,34(5):856-864. 被引量：217
5Lin D. An information-theoretic definition of similarity [ C ]// Proceedings of the Fifteenth International Conference on Machine Learning. San Francisco:Morgan Kaufmann Pub- lishers Inc, 1998:296-304.
6Aliguliyev R M. A new sentence similarity measure and sen- tence based extractive technique for automatic text sum- marization [ J]. Expert Systems with Applications, 2009, 36(4) :7764-7772.
7Hliaoutakis A, Varelas G, Voutsakis E, et al. Information retrieval by semantic similarity [ J ] International Journal on Semantic Web and Information Systems (IJSWIS), 2006,2 ( 3 ) : 55- 73.
8Tat K C. The tree-to-tree correction problem [ J ]. Journal of the ACM(JACM), 1979,26(3 ) :422-433.
9郑仕辉,周傲英,张龙.XML文档的相似测度和结构索引研究[J].计算机学报,2003,26(9):1116-1122. 被引量：28
10Zhang Yun-tao, Gong Ling,Wang Yong-cheng. An improved TF-IDF approach for text classification [ J 1- Journal of Zhejiang University Science A,2005,6A (1) :49-55.

二级参考文献62

1王静,孟小峰,王珊.基于区域划分的XML结构连接[J].软件学报,2004,15(5):720-729. 被引量：35
2孟小峰,周龙骧,王珊.数据库技术发展趋势[J].软件学报,2004,15(12):1822-1836. 被引量：176
3万常选,刘云生,徐升华,刘喜平,林大海.基于区间编码的XML索引结构的有效结构连接[J].计算机学报,2005,28(1):113-127. 被引量：38
4王静,孟小峰,王宇,王珊.以目标节点为导向的XML路径查询处理[J].软件学报,2005,16(5):827-837. 被引量：21
5孟小峰,王宇,王小锋.XML查询优化研究[J].软件学报,2006,17(10):2069-2086. 被引量：44
6[1]Rahm E, Do H H.Data cleaning:problems and current approaches[J].IEEE Data Engineer Bulletin, 2000, 23(4):3～13
7[2]Galhardas H, Florescu D, Shasha D,et al .Declarative data cleaning:language,model,and algorithms[A].In:Apers P, Atzeni P,Ceri S,eds.Proceedings of the 27th VLDB Conference[C].Roma:Morgan Kaufmann, 2001.371～380
8[3]Monge A E.Matching algorithms within a duplicate detection system[J].IEEE Data Engineer Bulletin, 2000,23(4):14～20
9[4]Zhang K,Shasha D.Tree pattern matching[M].London:Oxford Univesity Press,1997
10[5]Guha S, Jagadish H V, Koudas N, et al .Approximate XML joins[A].In:Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data[C].Madison:ACM Press,2002

共引文献325

1杨一,邹昀瑾.以机器学习应对信息“爆炸”时代:公共管理研究的降维可视化探析[J].中国行政管理,2021(1):105-113. 被引量：14
2杨捷,范美位,罗成臣,张思路.面向电力工单文本的服务失误识别[J].云南大学学报（自然科学版）,2020,42(S02):32-37. 被引量：1
3赵增涛,张豪,余益龙.应用于电网资产模型搜索的交叉权积文本相似度计算方法[J].水利水电技术（中英文）,2020,51(S02):209-214.
4孙红,黎铨祺,赵娜.基于双层树状支持向量机的观点挖掘与倾向分析[J].智能计算机与应用,2021,11(3):44-47. 被引量：3
5王鹏,郑贵省,郭强,贾蓓.基于网络爬虫的民用运力数据获取[J].军事交通学院学报,2020,22(1):87-90. 被引量：1
6雷庆,吴扬扬,缑锦.从复杂XML文档中抽取目标关系片段的方法[J].郑州大学学报（理学版）,2009,41(1):40-43.
7谌志群,周其力.基于综合语义的XML文档相似度计算方法[J].杭州电子科技大学学报（自然科学版）,2009,29(3):64-67.
8赵嫣,马军,李森.一种计算结构化文档相关度的方法[J].计算机研究与发展,2007,44(z2):350-355.
9高飞,鱼江,任芳,黄保瑞,次旺多吉.四维文档向量模型的k-means新闻文本聚类算法[J].西藏大学学报（社会科学版）,2013,28(4):109-112.
10叶庆卫,汪同庆.基于二叉树相似性检测的变形文字识别研究[J].计算机工程与应用,2005,41(31):52-54. 被引量：1

1罗文兵,徐雄飞,王明文,左家莉.面向新闻的情感关键句抽取与判定[J].江西师范大学学报（自然科学版）,2015,39(6):642-646.
2陈锐,张蕾,胡艳华.基于语义的信息检索模型[J].计算机工程与应用,2009,45(26):141-143. 被引量：6
3刘竹松,杨张杰.基于布隆过滤器所有权证明的高效安全可去重云存储方案[J].计算机应用,2017,37(3):766-770. 被引量：13
4吴文昭.搜索引擎页面排序融合算法[J].计算机工程与设计,2010,31(8):1678-1681. 被引量：4
5江翰,赵鑫,吴悦昕,闫宏飞.基于语义查询扩展的产品评论检索[J].计算机科学与探索,2015,9(5):526-534. 被引量：1
6王庆福.基于PageRank算法的文本关键词权重计算研究[J].网络新媒体技术,2015,4(3):37-41.
7邓雪峰,孙瑞志,张永瀚,聂娟.基于数据位图的滑动分块算法[J].计算机研究与发展,2014,51(S2):30-38. 被引量：2
8李鹏,王斌,石志伟,崔雅超,李恒训.Tag-TextRank:一种基于Tag的网页关键词抽取方法[J].计算机研究与发展,2012,49(11):2344-2351. 被引量：56
9王明文,付翠琴,徐凡,洪欢.基于词项共现关系图模型的中文观点句识别研究[J].中文信息学报,2015,29(6):185-192. 被引量：5
10蓝海洋,周杰韩,张和明.文本索引词项相对权重计算方法与应用[J].计算机工程与应用,2003,39(15):68-70. 被引量：9

华南理工大学学报（自然科学版）

2014年第7期

浏览历史

内容加载中请稍等...

数据源敏感的多源XML数据相似度量方法

参考文献15

二级参考文献62

共引文献325

相关作者

相关机构

相关主题

浏览历史