摘要
Web已成为一个浩瀚的信息海洋,其信息分散在不同的数据源中.不同数据源常常为同一对象实体提供冲突的属性值.如何从这些冲突属性值中找到真值被称为真值发现问题.根据属性值数量可将对象属性分为单值属性和多值属性,现有的多数真值发现算法对单值属性的真值发现比较有效.针对多值属性的真值发现问题,提出了一个多真值发现方法 MTruths,该方法将多真值发现问题转化为一个最优化问题,其目标是:各对象的真值与各数据源提供的观察值之间的相似性加权和达到最大.对象真值求解过程中,提出2种方法求真值列表的最优解:基于枚举的方法和贪心算法.与已有方法不同的是MTruths可以直接得到对象的多个真值.最后,通过图书和电影2个真实数据集上的实验表明,MTruths的2种实现方法的准确性以及贪心算法的效率优于现有真值发现方法.
W e b has bee n a massive information repository o n w h i c h information is scattered indifferent data sources.It is c o m m o n that different data sources provide conflicting information for thes a m e entity.It is called the truth finding p r o b l e m that h o w to find the truths f r o m conflictinginformation.A c c o r d i n g to the n u m b e r of attribute values,object attributes can be divided into t w ocategories:single-valued attributes a n d multiple-valued attributes.M o s t of existing truth findingw o r k is designed for truth finding o n single-valued attributes.In this paper,a m e t h o d called M T r u t h sis proposed to resolve truth finding p r o b l e m for multiple-valued attributes.W e m o d e l the p r o b l e musing an optimization problem.T h e objective is to m a x i m i z e the total weight similarity b e t w e e n thetruths a n d observations provided b y data sources.In truth finding process,t w o m e t h o d s are proposedto find the optimal solution:an e n u meration algorithm a n d a greedy algorithm.E x p e r i m e n t s o n t w oreal data sets s h o w that the correctness of our approache a n d the efficiency of the greedy algorithmo utperform the existing state-of-the-art techniques.
作者
马如霞
孟小峰
王璐
史英杰
Ma Ru xia;Wang Lu;Meng Xiaofeng;Shi Yingjie(School of Information,Renmin University of China,Beijing 100872;Department of Education Technology,Capital Normal University,Beijing 100048;School of Information Engineering,Beijing Institute of Fashion Technology,Beijing 100029)
出处
《计算机研究与发展》
EI
CSCD
北大核心
2016年第12期2858-2866,共9页
Journal of Computer Research and Development
基金
国家自然科学基金项目(61379050
91224008
61502279)
国家"八六三"高技术研究发展计划基金项目(2013AA013204)
高等学校博士学科点专项科研基金项目(20130004130001)
中国人民大学科学研究基金项目(11XNL010)~~
关键词
真值发现
数据冲突
单值属性
多值属性
数据源质量
truth finding
data conflicting
single-valued attributes
multi-valued attributes
quality of data sources