摘要
大量的web应用需要融合来自不同数据源的真实信息,然而关于同一实体的同一属性,不同的数据源可能会提供多个不同甚至彼此冲突的数据.如何判断数据源的可靠性和甄别事实的真假,即真值发现问题,日益获得关注.针对web数据集成中的多真值发现问题,提出了一种基于贝叶斯分析和最大似然估计的迭代计算方法,将真值发现的每一步与数据源可靠性评估紧密结合.首先,根据提供更多真实信息的数据源具有更高可靠度和由可靠的数据源提供的事实数据更可能为真值的基本原则构建似然函数,将事实真值作为模型的隐变量,并将正确性和错误性两方面的数据源质量指标作为模型参数.然后,迭代执行E步(计算事实为真的概率)和M步(评估数据源的质量),直至参数收敛.最后,真实数据集上的实验结果表明我们的方法提高了真值发现的准确率,有效解决了数据融合过程中的多值冲突问题.
A large number of web applications need to fuse data from different data sources.However,different data sources may provide different or even conflicting information about the same attribute of the same entity.How to determine the credibility of a data source and the reliability of a fact,which is called the truth finding,is gaining increasing attention.In this paper the multi truth finding for integrating of Web data is discussed since the web data sources often provide conflicting information about the same entities.We propose an iterative calculation method based on Bayesian analysis and maximum likelihood estimation,combining tightly every truth finding steps and the source reliability estimation.Firstly,the maximum likelihood function is constructed based on the principle that the sources which provide more real information are more reliable and the facts provided by reliable sources are more likely to be true.In the model,a latent variable indicates whether the corresponding fact is true and the parameters evaluate the data source quality from the two aspects of correctness and incorrectness.Then the E step(calculating the probability that each fact is true) and the step M(evaluating the quality of the data sources) are performed iteratively until the parameters converge.At last,experiments on real data sets show that our approach improves the accuracy for truth discovery and effectively solves the conflict in the process of data fusion.
出处
《渤海大学学报(自然科学版)》
CAS
2017年第3期268-274,共7页
Journal of Bohai University:Natural Science Edition
基金
国家自然科学基金项目(No:41371425)
辽宁省教育科学规划项目(No:JB17DB016)