A key aspect of Knowledge fusion is Entity Matching.The objective of this study was to investigate how to identify heterogeneous expressions of the same real-world entity.In recent years,some representative works have...A key aspect of Knowledge fusion is Entity Matching.The objective of this study was to investigate how to identify heterogeneous expressions of the same real-world entity.In recent years,some representative works have used deep learning methods for entity matching,and these methods have achieved good results.However,the common limitation of these methods is that they assume that different attribute columns of the same entity are independent,and inputting the model in the form of paired entity records will cause repeated calculations.In fact,there are often potential relations between different attribute columns of different entities.These relations can help us improve the effect of entity matching,and can perform feature extraction on a single entity record to avoid repeated calculations.To use attribute relations to assist entity matching,this paper proposes the Relation-aware Entity Matching method,which embeds attribute relations into the original entity description to form sentences,so that entity matching is transformed into a sentence-level similarity determination task,based on Sentence-BERT completes sentence similarity calculation.We have conducted experiments on structured,dirty,and textual data,and compared them with baselines in recent years.Experimental results show that the use of relational embedding is helpful for entity matching on structured and dirty data.Our method has good results on most data sets for entity matching and reduces repeated calculations.展开更多
In recent years,a large number of intelligent sensing devices have been deployed in the physical world,which brings great difficulties to the existing entity search.With the increase of the number of intelligent sensi...In recent years,a large number of intelligent sensing devices have been deployed in the physical world,which brings great difficulties to the existing entity search.With the increase of the number of intelligent sensing devices,the accuracy of the search system in querying the entities to match the user’s request is reduced,and the delay of entity search is increased.We use the mobile edge technology to alleviate this problem by processing user requests on the edge side and propose a similar physical entity matching strategy for the mobile edge search.First,the raw data collected by the sensor is lightly weighted and expressed to reduce the storage overhead of the observed data.Furthermore,a physical entity matching degree estimation method is proposed,in which the similarity between the sensor and the given sensor in the network is estimated,and the matching search of the user request is performed according to the similarity.Simulation results show that the proposed method can effectively reduce the data storage overhead and improve the precision of the sensor search system.展开更多
Entity matching that aims at finding some records belonging to the same real-world objects has been studied for decades. In order to avoid verifying every pair of records in a massive data set, a common method, known ...Entity matching that aims at finding some records belonging to the same real-world objects has been studied for decades. In order to avoid verifying every pair of records in a massive data set, a common method, known as the blocking- based method, tends to select a small proportion of record pairs for verification with a far lower cost than O(n2), where n is the size of the data set. Furthermore, executing multiple blocking functions independently is critical since much more matching records can be found in this way, so that the quality of the query result can be improved significantly. It is popular to use the MapReduce (MR) framework to improve the performance and the scalability of some compli- cated queries by running a lot of map (/reduce) tasks in parallel. However, entity matching upon the MapReduce frame- work is non-trivial due to two inevitable challenges: load balancing and pair deduplication. In this paper, we propose a novel solution, called M rEin, to handle these challenges with the support of multiple blocking functions. Although the existing work can deal with load balancing and pair deduplication respectively, it still cannot deal with both challenges at the same time. Theoretical analysis and experimental results upon real and synthetic data sets illustrate the high effectiveness and efficiency of our proposed solutions.展开更多
Entity matching is a fundamental problem of data integration.It groups records according to underlying real-world entities.There is a growing trend of entity matching via deep learning techniques.We design mixed hiera...Entity matching is a fundamental problem of data integration.It groups records according to underlying real-world entities.There is a growing trend of entity matching via deep learning techniques.We design mixed hierarchical deep neural networks(MHN)for entity matching,exploiting semantics from different abstract levels in the record internal hierarchy.A family of attention mechanisms is utilized in different periods of entity matching.Self-attention focuses on internal dependency,inter-attention targets at alignments,and multi-perspective weight attention is devoted to importance discrimination.Especially,hybrid soft token alignment is proposed to address corrupted data.Attribute order is for the first time considered in deep entity matching.Then,to reduce utilization of labeled training data,we propose an adversarial domain adaption approach(DA-MHN)to transfer matching knowledge between different entity matching tasks by maximizing classifier discrepancy.Finally,we conduct comprehensive experimental evaluations on 10 datasets(seven for MHN and three for DA-MHN),which illustrate our two proposed approaches1 superiorities.MHN apparently outperforms previous studies in accuracy,and also each component of MHN is tested.DA-MHN greatly surpasses existing studies in transferability.展开更多
Entity matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) may fail when the stru...Entity matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra consolidated textual information (CText) of the record, but seldom work has been done on using the CText for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CText since there are hundreds or thousands of words with each piece of CText, while existing topic models either cannot work well since there are no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each piece of CText, and then measure the similarity between CText on the multiple sub-topic dimensions. To avoid ignoring some hidden important sub-topics, we let the crowd help us decide weights of different sub-topics in doing EM. Our empirical study on two real-world datasets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models.展开更多
基金This work is funded by Guangdong Basic and Applied Basic Research Foundation(No.2021A1515012307,2020A1515010450)Guangzhou Basic and Applied Basic Research Foundation(No.202102021207,202102020867)+4 种基金the National Natural Science Foundation of China(No.62072130,61702220,61702223)Guangdong Province Key Area R&D Program of China(No.2019B010136003,2019B010137004)Guangdong Province Universities and Colleges Pearl River Scholar Funded Scheme(2019)Guangdong Higher Education Innovation Group(No.2020KCXTD007)Guangzhou Higher Education Innovation Group(No.202032854)。
文摘A key aspect of Knowledge fusion is Entity Matching.The objective of this study was to investigate how to identify heterogeneous expressions of the same real-world entity.In recent years,some representative works have used deep learning methods for entity matching,and these methods have achieved good results.However,the common limitation of these methods is that they assume that different attribute columns of the same entity are independent,and inputting the model in the form of paired entity records will cause repeated calculations.In fact,there are often potential relations between different attribute columns of different entities.These relations can help us improve the effect of entity matching,and can perform feature extraction on a single entity record to avoid repeated calculations.To use attribute relations to assist entity matching,this paper proposes the Relation-aware Entity Matching method,which embeds attribute relations into the original entity description to form sentences,so that entity matching is transformed into a sentence-level similarity determination task,based on Sentence-BERT completes sentence similarity calculation.We have conducted experiments on structured,dirty,and textual data,and compared them with baselines in recent years.Experimental results show that the use of relational embedding is helpful for entity matching on structured and dirty data.Our method has good results on most data sets for entity matching and reduces repeated calculations.
基金This work was supported by the National Natural Science Foundation of China(61871062,61771082,61901071)Science and Technology Research Program of Chongqing Municipal Education Commission(KJQN201800615)General Project of Natural Science Foundation of Chongqing(cstc2019jcyj-msxmX0303).
文摘In recent years,a large number of intelligent sensing devices have been deployed in the physical world,which brings great difficulties to the existing entity search.With the increase of the number of intelligent sensing devices,the accuracy of the search system in querying the entities to match the user’s request is reduced,and the delay of entity search is increased.We use the mobile edge technology to alleviate this problem by processing user requests on the edge side and propose a similar physical entity matching strategy for the mobile edge search.First,the raw data collected by the sensor is lightly weighted and expressed to reduce the storage overhead of the observed data.Furthermore,a physical entity matching degree estimation method is proposed,in which the similarity between the sensor and the given sensor in the network is estimated,and the matching search of the user request is performed according to the similarity.Simulation results show that the proposed method can effectively reduce the data storage overhead and improve the precision of the sensor search system.
基金Acknowledgements Our research is supported by the National Basic Research Program of China (2012CB316203), the National Natural Science Foundation of China (Grant Nos. 61370101 and U1501252), Shanghai Knowledge Service Platform Project (ZF1213), and Innovation Program of Shanghai Municipal Education Commission (14ZZ045).
文摘Entity matching that aims at finding some records belonging to the same real-world objects has been studied for decades. In order to avoid verifying every pair of records in a massive data set, a common method, known as the blocking- based method, tends to select a small proportion of record pairs for verification with a far lower cost than O(n2), where n is the size of the data set. Furthermore, executing multiple blocking functions independently is critical since much more matching records can be found in this way, so that the quality of the query result can be improved significantly. It is popular to use the MapReduce (MR) framework to improve the performance and the scalability of some compli- cated queries by running a lot of map (/reduce) tasks in parallel. However, entity matching upon the MapReduce frame- work is non-trivial due to two inevitable challenges: load balancing and pair deduplication. In this paper, we propose a novel solution, called M rEin, to handle these challenges with the support of multiple blocking functions. Although the existing work can deal with load balancing and pair deduplication respectively, it still cannot deal with both challenges at the same time. Theoretical analysis and experimental results upon real and synthetic data sets illustrate the high effectiveness and efficiency of our proposed solutions.
基金the National Natural Science Foundation of China under Grant Nos.62002262,61672142,61602103,62072086 and 62072084the National Key Research and Development Project of China under Grant No.2018YFB1003404.
文摘Entity matching is a fundamental problem of data integration.It groups records according to underlying real-world entities.There is a growing trend of entity matching via deep learning techniques.We design mixed hierarchical deep neural networks(MHN)for entity matching,exploiting semantics from different abstract levels in the record internal hierarchy.A family of attention mechanisms is utilized in different periods of entity matching.Self-attention focuses on internal dependency,inter-attention targets at alignments,and multi-perspective weight attention is devoted to importance discrimination.Especially,hybrid soft token alignment is proposed to address corrupted data.Attribute order is for the first time considered in deep entity matching.Then,to reduce utilization of labeled training data,we propose an adversarial domain adaption approach(DA-MHN)to transfer matching knowledge between different entity matching tasks by maximizing classifier discrepancy.Finally,we conduct comprehensive experimental evaluations on 10 datasets(seven for MHN and three for DA-MHN),which illustrate our two proposed approaches1 superiorities.MHN apparently outperforms previous studies in accuracy,and also each component of MHN is tested.DA-MHN greatly surpasses existing studies in transferability.
文摘Entity matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra consolidated textual information (CText) of the record, but seldom work has been done on using the CText for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CText since there are hundreds or thousands of words with each piece of CText, while existing topic models either cannot work well since there are no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each piece of CText, and then measure the similarity between CText on the multiple sub-topic dimensions. To avoid ignoring some hidden important sub-topics, we let the crowd help us decide weights of different sub-topics in doing EM. Our empirical study on two real-world datasets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models.