期刊文献+

数据源敏感的多源XML数据相似度量方法

Similarity Measure of Multi-Source XML Data by Means of Data Source-Sensitivity
下载PDF
导出
摘要 将预处理后的XML数据当作文本信息采用词频-逆向文档频率(TF-IDF)模型进行处理时,逆向文档频率作为词项权重有其不足之处.为此,文中定义了词项的数据源敏感度作为逆向文档频率(IDF)的修正系数.其值取决于提供此词项的数据来源于不同数据源的概率,概率大则其值大,反之则其值小.然后在修正后的词项权重向量的基础上,定义了相似度函数.最后在模拟、真实数据集上进行数据重复检测实验.结果表明,新方法获得了更高的F测度值.这说明考虑词项的数据源敏感度可提高相似度函数的有效性. When preprocessed XML data are used as text information to be dealt with by the TF-IDF ( Term Fre-quency-Inverse Document Frequency ) model, the IDF as the weight of terms has imperfection of its own .In order to solve this problem , the data source-sensitivity of terms is defined as the modification coefficient of the IDF .Its value depends on the probability which provides the term with the data from different sources .When the probability is big, its value is big, and vice versa.Then, the similarity function is defined on the basis of the weight vector of the fixed terms.Finally, experiments of detecting duplicate XML data from multiple sources are conducted on real and simulated datasets .The results show that the proposed method achieves a higher F measure value , which indi-cates that the data source-sensitivity of terms helps improve the effectiveness of similarity function .
出处 《华南理工大学学报(自然科学版)》 EI CAS CSCD 北大核心 2014年第7期28-32,共5页 Journal of South China University of Technology(Natural Science Edition)
基金 国家科技支撑计划项目(2012BAF12B14 2012BAH62F01) 贵州省科技项目(黔科合重大专项字[2012]6021 黔科合计工字[2012]4009)
关键词 XML 数据集成 文本处理 数据源敏感度 XML XML data integration text processing data source-sensitivity
  • 相关文献

参考文献15

  • 1孔令波,唐世渭,杨冬青,王腾蛟,高军.XML数据的查询技术[J].软件学报,2007,18(6):1400-1418. 被引量:72
  • 2Ko Y, Park J, Seo J. Improving text categorization using the importance of sentences [ J ]. Information Processing &Management,2004,40 ( 1 ) : 65- 79.
  • 3Theobald M, Siddharth J, Paepcke A. Spotsigs : robust and efficient near duplicate detection in large web collections [ C]//Proceedings of the 31 st Annual International ACM SIGIR Conference on Research and Development in Infor- nmtion Retrieval. Singapore : ACM ,2008:563-570.
  • 4黄承慧,印鉴,侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报,2011,34(5):856-864. 被引量:217
  • 5Lin D. An information-theoretic definition of similarity [ C ]// Proceedings of the Fifteenth International Conference on Machine Learning. San Francisco:Morgan Kaufmann Pub- lishers Inc, 1998:296-304.
  • 6Aliguliyev R M. A new sentence similarity measure and sen- tence based extractive technique for automatic text sum- marization [ J]. Expert Systems with Applications, 2009, 36(4) :7764-7772.
  • 7Hliaoutakis A, Varelas G, Voutsakis E, et al. Information retrieval by semantic similarity [ J ] International Journal on Semantic Web and Information Systems (IJSWIS), 2006,2 ( 3 ) : 55- 73.
  • 8Tat K C. The tree-to-tree correction problem [ J ]. Journal of the ACM(JACM), 1979,26(3 ) :422-433.
  • 9郑仕辉,周傲英,张龙.XML文档的相似测度和结构索引研究[J].计算机学报,2003,26(9):1116-1122. 被引量:28
  • 10Zhang Yun-tao, Gong Ling,Wang Yong-cheng. An improved TF-IDF approach for text classification [ J 1- Journal of Zhejiang University Science A,2005,6A (1) :49-55.

二级参考文献62

共引文献325

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部