期刊文献+

基于EM算法的多真值发现问题研究 被引量:1

Research on multitruth finding based on EM algorithm
下载PDF
导出
摘要 大量的web应用需要融合来自不同数据源的真实信息,然而关于同一实体的同一属性,不同的数据源可能会提供多个不同甚至彼此冲突的数据.如何判断数据源的可靠性和甄别事实的真假,即真值发现问题,日益获得关注.针对web数据集成中的多真值发现问题,提出了一种基于贝叶斯分析和最大似然估计的迭代计算方法,将真值发现的每一步与数据源可靠性评估紧密结合.首先,根据提供更多真实信息的数据源具有更高可靠度和由可靠的数据源提供的事实数据更可能为真值的基本原则构建似然函数,将事实真值作为模型的隐变量,并将正确性和错误性两方面的数据源质量指标作为模型参数.然后,迭代执行E步(计算事实为真的概率)和M步(评估数据源的质量),直至参数收敛.最后,真实数据集上的实验结果表明我们的方法提高了真值发现的准确率,有效解决了数据融合过程中的多值冲突问题. A large number of web applications need to fuse data from different data sources.However,different data sources may provide different or even conflicting information about the same attribute of the same entity.How to determine the credibility of a data source and the reliability of a fact,which is called the truth finding,is gaining increasing attention.In this paper the multi truth finding for integrating of Web data is discussed since the web data sources often provide conflicting information about the same entities.We propose an iterative calculation method based on Bayesian analysis and maximum likelihood estimation,combining tightly every truth finding steps and the source reliability estimation.Firstly,the maximum likelihood function is constructed based on the principle that the sources which provide more real information are more reliable and the facts provided by reliable sources are more likely to be true.In the model,a latent variable indicates whether the corresponding fact is true and the parameters evaluate the data source quality from the two aspects of correctness and incorrectness.Then the E step(calculating the probability that each fact is true) and the step M(evaluating the quality of the data sources) are performed iteratively until the parameters converge.At last,experiments on real data sets show that our approach improves the accuracy for truth discovery and effectively solves the conflict in the process of data fusion.
作者 陈超 崔红霞
出处 《渤海大学学报(自然科学版)》 CAS 2017年第3期268-274,共7页 Journal of Bohai University:Natural Science Edition
基金 国家自然科学基金项目(No:41371425) 辽宁省教育科学规划项目(No:JB17DB016)
关键词 真值发现 贝叶斯分析 EM算法 多真值 数据融合 truth finding Bayesian analysis EM algorithm multi truth data fusion
  • 相关文献

参考文献4

二级参考文献35

  • 1Bleiholder ], Naumann F. Data fusion [ J]. ACM Computing Sur- veys,2008,41(1 ) :1-41.
  • 2Dong X L,Nanmann F. Data fusion:resolving data conflicts for in- tegration[ J ]. Proceedings of the VLDB Endowment,2009,2 ( 2 ) : 1654-1655.
  • 3Yin X,Han J, Yu P S. Truth discovery with multiple conflicting in- formation providers on the web [ C ]. Special Interest Group on Knowledge Discovery and Data Mining ( SIGKDD ), 2007 : 1048- 1052.
  • 4Pasternack l,Roth D. Knowing what to believe ( when you already know something) [ C]. Proceedings of the 23rd International Con- ference on Computational Linguistics, Association for Computation- al Linguistics,2010:877-885.
  • 5Galland A,Abiteboul S,Marian A, et al. Corroborating information from disagreeing views[ C]. Proceedings of the third ACM Interna- tional Conference on Web Search and Data Mining, ACM, 2010: 131-140.
  • 6Pasternack J, Roth D. Making better informed trust decisions with generalized fact-finding [ C ]. IJCAI Proceedings-international Joint Conference on Artificial Intelligence ,2011,22(3 ) :2324.
  • 7Dong X L, Berfi Equille L, Srivastava D. Integrating conflicting da- ta:the role of source dependence[ J]. Proceedings of the VLDB En- dowment,2009,2( 1 ) :550-561.
  • 8Qi G J, Aggarwal C C, Hart J, et al. Mining collective intelligence in diverse groups [ C ]. Proceedings of the 22nd International Confer- ence on World Wide Web, International World Wide Web Confer- ences Steering Committee,2013:1041-1052.
  • 9. Pastemack J ,Roth D. Latent credibility analysis[ C]. Proceedings of the 22nd International Conference on World Wide Web, Internation- al World Wide Web Conferences Steering Committee,2013 : 1009- 1020.
  • 10Zhao B ,Rubinstein B I P, Gemmell J ,et al. A bayesian approach to discovering truth from conflicting sources for data integration [ J ]. Proceedings of the VLDB Endowment,2012,5(6) :550-561.

共引文献201

同被引文献8

引证文献1

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部