With the rocketing progress of the Internet, it is easier for people to get information about the objects that they are interested in. However, this information usually has conflicts. In order to resolve conflicts and...With the rocketing progress of the Internet, it is easier for people to get information about the objects that they are interested in. However, this information usually has conflicts. In order to resolve conflicts and get the true information, truth discovery has been proposed and received widespread attention. Many algorithms have been proposed to adapt to different scenarios. This paper aims to investigate these algorithms and summarize them from the perspective of algorithm models and specific concepts. Some classic datasets and evaluation metrics are given in this paper. Some future directions for readers are also provided to better understand the field of truth discovery.展开更多
There are errors in multi-source uncertain time series data.Truth discovery methods for time series data are effective in finding more accurate values,but some have limitations in their usability.To tackle this challe...There are errors in multi-source uncertain time series data.Truth discovery methods for time series data are effective in finding more accurate values,but some have limitations in their usability.To tackle this challenge,we propose a new and convenient truth discovery method to handle time series data.A more accurate sample is closer to the truth and,consequently,to other accurate samples.Because the mutual-confirm relationship between sensors is very similar to the mutual-quote relationship between web pages,we evaluate sensor reliability based on PageRank and then estimate the truth by sensor reliability.Therefore,this method does not rely on smoothness assumptions or prior knowledge of the data.Finally,we validate the effectiveness and efficiency of the proposed method on real-world and synthetic data sets,respectively.展开更多
In this era of big data, data are often collected from multiple sources that have different reliabilities, and there is inevitable conflict with respect to the various information obtained when it relates to the the s...In this era of big data, data are often collected from multiple sources that have different reliabilities, and there is inevitable conflict with respect to the various information obtained when it relates to the the same object.One important task is to identify the most trustworthy value out of all the conflicting claims, and this is known as truth discovery. Existing truth discovery methods simultaneously identify the most trustworthy information and source reliability degrees and are based on the idea that more reliable sources often provide more trustworthy information,and vice versa. However, there are often semantic constrains defined upon relational database, which can be violated by a single data source. To remove violations, an important task is to repair data to satisfy the constrains,and this is known as data cleaning. The two problems above may coexist, but considering them together can provide some benefits, and to the authors knowledge, this has not yet been the focus of any research. In this paper, therefore, a schema-decomposing based method is proposed to simultaneously discover the truth and to clean the data, with the aim of improving accuracy. Experimental results using real world data sets of notebooks and mobile phones, as well as simulated data sets, demonstrate the effectiveness and efficiency of our proposed method.展开更多
真值发现是数据集成领域具有挑战性的研究热点之一。传统的方法利用数据源与观测值之间的交互关系推断真值,缺乏足够的特征信息;基于深度学习的方法可以有效地进行特征抽取,但其性能依赖于大量手工标注,而在实际应用中很难获取到大量高...真值发现是数据集成领域具有挑战性的研究热点之一。传统的方法利用数据源与观测值之间的交互关系推断真值,缺乏足够的特征信息;基于深度学习的方法可以有效地进行特征抽取,但其性能依赖于大量手工标注,而在实际应用中很难获取到大量高质量的真值标签。为克服以上问题,本文提出一种基于多特征融合的无监督真值发现方法(Unsupervised truth discovery method based on multi-feature fusion,MFOTD)。首先,利用集成学习无监督标注“真值”标签;然后,分别使用预训练模型Bert和独热编码获取观测值的语义特征和交互特征;最后,融合观测值多种特征并使用其“真值”标签构建初始训练集,通过自训练方式训练真值预测模型。在两个真实数据集上的实验结果表明,与已有方法相比,本文所提出的方法具有更高的真值发现准确性。展开更多
为解决传统真值发现算法无法提取文本数据关键语义信息的问题,提出一种基于胶囊网络的文本数据真值发现算法(Truth Discovery of Text Data Based on Capsule Network,Caps-Truth),对传统卷积神经网络(Convolutional Neural Network,CNN...为解决传统真值发现算法无法提取文本数据关键语义信息的问题,提出一种基于胶囊网络的文本数据真值发现算法(Truth Discovery of Text Data Based on Capsule Network,Caps-Truth),对传统卷积神经网络(Convolutional Neural Network,CNN)进行改进,在神经网络模型中构造语义胶囊层替代CNN池化层表征文本语义信息。首先通过CNN卷积层获取文本数据全局特征,利用初级胶囊层将特征信息向量化,再通过语义胶囊层表征文本数据细粒度语义信息,将特征向量输入全连接神经网络挖掘文本数据可信度并获得可靠答案。上述算法在真值发现中引入胶囊网络,利用动态路由算法整合零散语义,有效提高了文本数据真值发现的效果。实验结果表明,Caps-Truth算法优于对比算法。展开更多
真值发现作为整合由不同数据源提供的冲突信息的一种手段,在传统数据库领域已经得到了广泛的研究.然而现有的很多真值发现方法不适用于数据流应用,主要原因是它们都包含迭代的过程.针对一种特殊的数据流——感知数据流上的连续真值发现...真值发现作为整合由不同数据源提供的冲突信息的一种手段,在传统数据库领域已经得到了广泛的研究.然而现有的很多真值发现方法不适用于数据流应用,主要原因是它们都包含迭代的过程.针对一种特殊的数据流——感知数据流上的连续真值发现问题进行了研究.结合感知数据本身及其应用特点,提出一种变频评估数据源可信度的策略,减少了迭代过程的执行,提高了每一时刻多源感知数据流真值发现的效率.首先定义并研究了当感知数据流真值发现的相对误差和累积误差较小时,相邻时刻数据源的可信度变化需要满足的条件,进而给出了一种概率模型,以预测数据源的可信度满足该条件的概率.之后,通过整合上述结论,实现在预测的累积误差以一定概率不超过给定阈值的前提下,最大化数据源可信度的评估周期以提高效率,并将该问题转化为一个最优化问题.在此基础上,提出了一种变频评估数据源可信度的算法——CTF-Stream(continuous truth finding over sensor data streams),CTF-Stream结合历史数据动态地确定数据源可信度的评估时刻,在保证真值发现结果达到用户给定精度的同时提高了效率.最后,通过在真实的感知数据集合上进行实验,进一步验证了算法在处理感知数据流的真值发现问题时的效率和准确率.展开更多
基金Fundamental Research Funds for the Central Universities,China (No. 22D111207)。
文摘With the rocketing progress of the Internet, it is easier for people to get information about the objects that they are interested in. However, this information usually has conflicts. In order to resolve conflicts and get the true information, truth discovery has been proposed and received widespread attention. Many algorithms have been proposed to adapt to different scenarios. This paper aims to investigate these algorithms and summarize them from the perspective of algorithm models and specific concepts. Some classic datasets and evaluation metrics are given in this paper. Some future directions for readers are also provided to better understand the field of truth discovery.
基金National Natural Science Foundation of China(No.62002131)Shuangchuang Ph.D Award(from World Prestigious Universities)of Jiangsu Province,China(No.JSSCBS20211179)。
文摘There are errors in multi-source uncertain time series data.Truth discovery methods for time series data are effective in finding more accurate values,but some have limitations in their usability.To tackle this challenge,we propose a new and convenient truth discovery method to handle time series data.A more accurate sample is closer to the truth and,consequently,to other accurate samples.Because the mutual-confirm relationship between sensors is very similar to the mutual-quote relationship between web pages,we evaluate sensor reliability based on PageRank and then estimate the truth by sensor reliability.Therefore,this method does not rely on smoothness assumptions or prior knowledge of the data.Finally,we validate the effectiveness and efficiency of the proposed method on real-world and synthetic data sets,respectively.
基金partially supported by the Key Research and Development Plan of National Ministry of Science and Technology (No. 2016YFB1000703)the Key Program of the National Natural Science Foundation of China (Nos. 61190115, 61472099, 61632010, and U1509216)+2 种基金National Sci-Tech Support Plan (No. 2015BAH10F01)the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province (No. LC2016026)MOE-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology
文摘In this era of big data, data are often collected from multiple sources that have different reliabilities, and there is inevitable conflict with respect to the various information obtained when it relates to the the same object.One important task is to identify the most trustworthy value out of all the conflicting claims, and this is known as truth discovery. Existing truth discovery methods simultaneously identify the most trustworthy information and source reliability degrees and are based on the idea that more reliable sources often provide more trustworthy information,and vice versa. However, there are often semantic constrains defined upon relational database, which can be violated by a single data source. To remove violations, an important task is to repair data to satisfy the constrains,and this is known as data cleaning. The two problems above may coexist, but considering them together can provide some benefits, and to the authors knowledge, this has not yet been the focus of any research. In this paper, therefore, a schema-decomposing based method is proposed to simultaneously discover the truth and to clean the data, with the aim of improving accuracy. Experimental results using real world data sets of notebooks and mobile phones, as well as simulated data sets, demonstrate the effectiveness and efficiency of our proposed method.
文摘真值发现是数据集成领域具有挑战性的研究热点之一。传统的方法利用数据源与观测值之间的交互关系推断真值,缺乏足够的特征信息;基于深度学习的方法可以有效地进行特征抽取,但其性能依赖于大量手工标注,而在实际应用中很难获取到大量高质量的真值标签。为克服以上问题,本文提出一种基于多特征融合的无监督真值发现方法(Unsupervised truth discovery method based on multi-feature fusion,MFOTD)。首先,利用集成学习无监督标注“真值”标签;然后,分别使用预训练模型Bert和独热编码获取观测值的语义特征和交互特征;最后,融合观测值多种特征并使用其“真值”标签构建初始训练集,通过自训练方式训练真值预测模型。在两个真实数据集上的实验结果表明,与已有方法相比,本文所提出的方法具有更高的真值发现准确性。
文摘为解决传统真值发现算法无法提取文本数据关键语义信息的问题,提出一种基于胶囊网络的文本数据真值发现算法(Truth Discovery of Text Data Based on Capsule Network,Caps-Truth),对传统卷积神经网络(Convolutional Neural Network,CNN)进行改进,在神经网络模型中构造语义胶囊层替代CNN池化层表征文本语义信息。首先通过CNN卷积层获取文本数据全局特征,利用初级胶囊层将特征信息向量化,再通过语义胶囊层表征文本数据细粒度语义信息,将特征向量输入全连接神经网络挖掘文本数据可信度并获得可靠答案。上述算法在真值发现中引入胶囊网络,利用动态路由算法整合零散语义,有效提高了文本数据真值发现的效果。实验结果表明,Caps-Truth算法优于对比算法。
文摘真值发现作为整合由不同数据源提供的冲突信息的一种手段,在传统数据库领域已经得到了广泛的研究.然而现有的很多真值发现方法不适用于数据流应用,主要原因是它们都包含迭代的过程.针对一种特殊的数据流——感知数据流上的连续真值发现问题进行了研究.结合感知数据本身及其应用特点,提出一种变频评估数据源可信度的策略,减少了迭代过程的执行,提高了每一时刻多源感知数据流真值发现的效率.首先定义并研究了当感知数据流真值发现的相对误差和累积误差较小时,相邻时刻数据源的可信度变化需要满足的条件,进而给出了一种概率模型,以预测数据源的可信度满足该条件的概率.之后,通过整合上述结论,实现在预测的累积误差以一定概率不超过给定阈值的前提下,最大化数据源可信度的评估周期以提高效率,并将该问题转化为一个最优化问题.在此基础上,提出了一种变频评估数据源可信度的算法——CTF-Stream(continuous truth finding over sensor data streams),CTF-Stream结合历史数据动态地确定数据源可信度的评估时刻,在保证真值发现结果达到用户给定精度的同时提高了效率.最后,通过在真实的感知数据集合上进行实验,进一步验证了算法在处理感知数据流的真值发现问题时的效率和准确率.