The multi-signature method can improve the accuracy of entity resolution. However,it will bring the redundant computation problem in the parallel processing framework. In this paper,a multisignature based parallel ent...The multi-signature method can improve the accuracy of entity resolution. However,it will bring the redundant computation problem in the parallel processing framework. In this paper,a multisignature based parallel entity resolution method called multi-sig-er is proposed. The method was implemented in MapReduce-based framework which first tagged multiple signatures for each input object and utilized these signatures to generate key-value pairs,then shuffled the pairs to the reduce tasks that are responsible for similarity computation. To improve the performance,two strategies were adopted. One is for pruning the candidate pairs brought by the blocking technique and the other is for eliminating the redundancy according to the transitive property. Both strategies reduce the number of similarity computation without affecting the resolution accuracy. Experimental results on real-world datasets show that the method tends to handle large datasets rather than small datasets,and it is more suitable for complex similarity computation as compared to simple similarity matching.展开更多
In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity res...In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity resolution for image and text integrating global and fine-grained joint attention mechanism method.First,we map the cross-modal data to a common embedding space utilizing a feature extraction network.Then,we integrate global joint attention mechanism and fine-grained joint attention mechanism,making the model have the ability to learn the global semantic characteristics and the local fine-grained semantic characteristics of the cross-modal data,which is used to fully exploit the cross-modal semantic correlation and boost the performance of cross-modal entity resolution.Moreover,experiments on Flickr-30K and MS-COCO datasets show that the overall performance of R@sum outperforms by 4.30%and 4.54%compared with 5 state-of-the-art methods,respectively,which can fully demonstrate the superiority of our proposed method.展开更多
Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may ...Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges.展开更多
Entity resolution (ER) aims to identify whether two entities in an ER task refer to the same real-world thing.Crowd ER uses humans, in addition to machine algorithms, to obtain the truths of ER tasks. However, inacc...Entity resolution (ER) aims to identify whether two entities in an ER task refer to the same real-world thing.Crowd ER uses humans, in addition to machine algorithms, to obtain the truths of ER tasks. However, inaccurate orerroneous results are likely to be generated when humans give unreliable judgments. Previous studies have found thatcorrectly estimating human accuracy or expertise in crowd ER is crucial to truth inference. However, a large number ofthem assume that humans have consistent expertise over all the tasks, and ignore the fact that humans may have variedexpertise on different topics (e.g., music versus sport). In this paper, we deal with crowd ER in the Semantic Web area.We identify multiple topics of ER tasks and model human expertise on different topics. Furthermore, we leverage similartask clustering to enhance the topic modeling and expertise estimation. We propose a probabilistic graphical model thatcomputes ER task similarity, estimates human expertise, and infers the task truths in a unified framework. Our evaluationresults on real-world and synthetic datasets show that, compared with several state-of-the-art approaches, our proposedmodel achieves higher accuracy on the task truth inference and is more consistent with the human real expertise.展开更多
Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match ru...Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based en- tity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes' comparisons with proper thresholds. We use ac- tive learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity res- olution approaches in accuracy.展开更多
Entity resolution(ER)is a significant task in data integration,which aims to detect all entity profiles that correspond to the same real-world entity.Due to its inherently quadratic complexity,blocking was proposed to...Entity resolution(ER)is a significant task in data integration,which aims to detect all entity profiles that correspond to the same real-world entity.Due to its inherently quadratic complexity,blocking was proposed to ameliorate ER,and it offers an approximate solution which clusters similar entity profiles into blocks so that it suffices to perform pair-wise comparisons inside each block in order to reduce the computational cost of ER.This paper presents a comprehensive survey on existing blocking technologies.We summarize and analyze all classic blocking methods with emphasis on different blocking construction and optimization techniques.We find that traditional blocking ER methods which depend on the fixed schema may not work in the context of highly heterogeneous information spaces.How to use schema information flexibly is of great significance to efficiently process data with the new features of this era.Machine learning is an important tool for ER,but end-to-end and efficient machine learning methods still need to be explored.We also sum up and provide the most promising trend for future work from the directions of real-time blocking ER,incremental blocking ER,deep learning with ER,etc.展开更多
基金National Natural Science Foundation of China(No.61402100)the Fundamental Research Funds for the Central Universities of China(No.17D111205)
文摘The multi-signature method can improve the accuracy of entity resolution. However,it will bring the redundant computation problem in the parallel processing framework. In this paper,a multisignature based parallel entity resolution method called multi-sig-er is proposed. The method was implemented in MapReduce-based framework which first tagged multiple signatures for each input object and utilized these signatures to generate key-value pairs,then shuffled the pairs to the reduce tasks that are responsible for similarity computation. To improve the performance,two strategies were adopted. One is for pruning the candidate pairs brought by the blocking technique and the other is for eliminating the redundancy according to the transitive property. Both strategies reduce the number of similarity computation without affecting the resolution accuracy. Experimental results on real-world datasets show that the method tends to handle large datasets rather than small datasets,and it is more suitable for complex similarity computation as compared to simple similarity matching.
基金the Special Research Fund for the China Postdoctoral Science Foundation(No.2015M582832)the Major National Science and Technology Program(No.2015ZX01040201)the National Natural Science Foundation of China(No.61371196)。
文摘In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity resolution for image and text integrating global and fine-grained joint attention mechanism method.First,we map the cross-modal data to a common embedding space utilizing a feature extraction network.Then,we integrate global joint attention mechanism and fine-grained joint attention mechanism,making the model have the ability to learn the global semantic characteristics and the local fine-grained semantic characteristics of the cross-modal data,which is used to fully exploit the cross-modal semantic correlation and boost the performance of cross-modal entity resolution.Moreover,experiments on Flickr-30K and MS-COCO datasets show that the overall performance of R@sum outperforms by 4.30%and 4.54%compared with 5 state-of-the-art methods,respectively,which can fully demonstrate the superiority of our proposed method.
基金This work was partially supported by the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2015BAH10F01, the National Natural Science Foundation of China under Grant Nos. U1509216, 61472099, and 61133002, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province of China under Grant No. LC2016026, and the Ministry of Education (MOE)-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology.
文摘Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges.
基金This work was supported by the National Natural Science Foundation of China under Grant Nos. 61872172 and 61772264.
文摘Entity resolution (ER) aims to identify whether two entities in an ER task refer to the same real-world thing.Crowd ER uses humans, in addition to machine algorithms, to obtain the truths of ER tasks. However, inaccurate orerroneous results are likely to be generated when humans give unreliable judgments. Previous studies have found thatcorrectly estimating human accuracy or expertise in crowd ER is crucial to truth inference. However, a large number ofthem assume that humans have consistent expertise over all the tasks, and ignore the fact that humans may have variedexpertise on different topics (e.g., music versus sport). In this paper, we deal with crowd ER in the Semantic Web area.We identify multiple topics of ER tasks and model human expertise on different topics. Furthermore, we leverage similartask clustering to enhance the topic modeling and expertise estimation. We propose a probabilistic graphical model thatcomputes ER task similarity, estimates human expertise, and infers the task truths in a unified framework. Our evaluationresults on real-world and synthetic datasets show that, compared with several state-of-the-art approaches, our proposedmodel achieves higher accuracy on the task truth inference and is more consistent with the human real expertise.
基金The authors thank anonymous reviewers for their in- spiting doubts and helpful suggestions during the reviewing process. This work was supported by the National Basic Research Program of China (973 Program) (2012CB316201), the Fundamental Research Funds for the Cen- tral Universities (N 120816001) and the National Natural Science Foundation of China (Grant Nos. 61472070, 61402213).
文摘Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based en- tity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes' comparisons with proper thresholds. We use ac- tive learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity res- olution approaches in accuracy.
基金supported by the National Natural Science Foundation of China under Grant No.61772268the Fundamental Research Funds for the Central Universities of China under Grant Nos.NS2018057 and NJ2018014.
文摘Entity resolution(ER)is a significant task in data integration,which aims to detect all entity profiles that correspond to the same real-world entity.Due to its inherently quadratic complexity,blocking was proposed to ameliorate ER,and it offers an approximate solution which clusters similar entity profiles into blocks so that it suffices to perform pair-wise comparisons inside each block in order to reduce the computational cost of ER.This paper presents a comprehensive survey on existing blocking technologies.We summarize and analyze all classic blocking methods with emphasis on different blocking construction and optimization techniques.We find that traditional blocking ER methods which depend on the fixed schema may not work in the context of highly heterogeneous information spaces.How to use schema information flexibly is of great significance to efficiently process data with the new features of this era.Machine learning is an important tool for ER,but end-to-end and efficient machine learning methods still need to be explored.We also sum up and provide the most promising trend for future work from the directions of real-time blocking ER,incremental blocking ER,deep learning with ER,etc.