期刊文献+
共找到6篇文章
< 1 >
每页显示 20 50 100
Redundancy Elimination in Multi-signature Based Parallel Entity Resolution
1
作者 燕彩蓉 阮文洁 +1 位作者 徐淑华 黄永锋 《Journal of Donghua University(English Edition)》 EI CAS 2017年第4期556-562,共7页
The multi-signature method can improve the accuracy of entity resolution. However,it will bring the redundant computation problem in the parallel processing framework. In this paper,a multisignature based parallel ent... The multi-signature method can improve the accuracy of entity resolution. However,it will bring the redundant computation problem in the parallel processing framework. In this paper,a multisignature based parallel entity resolution method called multi-sig-er is proposed. The method was implemented in MapReduce-based framework which first tagged multiple signatures for each input object and utilized these signatures to generate key-value pairs,then shuffled the pairs to the reduce tasks that are responsible for similarity computation. To improve the performance,two strategies were adopted. One is for pruning the candidate pairs brought by the blocking technique and the other is for eliminating the redundancy according to the transitive property. Both strategies reduce the number of similarity computation without affecting the resolution accuracy. Experimental results on real-world datasets show that the method tends to handle large datasets rather than small datasets,and it is more suitable for complex similarity computation as compared to simple similarity matching. 展开更多
关键词 entity resolution MAPREDUCE blocking technique redundancy elimination
下载PDF
Cross-Modal Entity Resolution for Image and Text Integrating Global and Fine-Grained Joint Attention Mechanism
2
作者 曾志贤 曹建军 +2 位作者 翁年凤 袁震 余旭 《Journal of Shanghai Jiaotong university(Science)》 EI 2023年第6期728-737,共10页
In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity res... In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity resolution for image and text integrating global and fine-grained joint attention mechanism method.First,we map the cross-modal data to a common embedding space utilizing a feature extraction network.Then,we integrate global joint attention mechanism and fine-grained joint attention mechanism,making the model have the ability to learn the global semantic characteristics and the local fine-grained semantic characteristics of the cross-modal data,which is used to fully exploit the cross-modal semantic correlation and boost the performance of cross-modal entity resolution.Moreover,experiments on Flickr-30K and MS-COCO datasets show that the overall performance of R@sum outperforms by 4.30%and 4.54%compared with 5 state-of-the-art methods,respectively,which can fully demonstrate the superiority of our proposed method. 展开更多
关键词 cross-modal entity resolution joint attention mechanism deep learning feature extraction semantic correlation
原文传递
EntityManager: Managing Dirty Data Based on Entity Resolution 被引量:2
3
作者 Xue-Li Liu Hong-Zhi Wang +1 位作者 Jian-Zhong Li Hong Gao 《Journal of Computer Science & Technology》 SCIE EI CSCD 2017年第3期644-662,共19页
Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may ... Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges. 展开更多
关键词 dirty data entity resolution uncertain attribute query processing query optimization
原文传递
A genetic algorithm based entity resolution approach with active learning 被引量:1
4
作者 Chenchen SUN Derong SHEN +2 位作者 Yue KOU Tiezheng NIE Ge YU 《Frontiers of Computer Science》 SCIE EI CSCD 2017年第1期147-159,共13页
Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match ru... Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based en- tity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes' comparisons with proper thresholds. We use ac- tive learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity res- olution approaches in accuracy. 展开更多
关键词 entity resolution genetic algorithm active learning data quality data integration
原文传递
CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing 被引量:3
5
作者 An-Zhen Zhang Jian-Zhong Li +3 位作者 Hong Gao Yu-Biao Chen Heng-Zhao Ma Mohamed Jaward Bah 《Journal of Computer Science & Technology》 SCIE EI CSCD 2018年第2期366-379,共14页
Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate q... Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, Crow~ dOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy. 展开更多
关键词 online aggregation entity resolution crowdsourcing cloud computing
原文传递
Active transfer learning of matching query results across multiple sources 被引量:2
6
作者 Jie XIN Zhiming CUI +1 位作者 Pengpeng ZHAO Tianxu HE 《Frontiers of Computer Science》 SCIE EI CSCD 2015年第4期595-607,共13页
Entity resolution (ER) is the problem of identi- fying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance unde... Entity resolution (ER) is the problem of identi- fying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance under super- vised learning. However, the prohibitive cost of labeling training data is still a huge obstacle for detecting duplicate query records from online sources. Furthermore, the unique combinations of noisy data with missing elements make ER tasks more challenging. To address this, transfer learning has been adopted to adaptively share learned common structures of similarity scoring problems between multiple sources. Al- though such techniques reduce the labeling cost so that it is linear with respect to the number of sources, its random sam- piing strategy is not successful enough to handle the ordinary sample imbalance problem. In this paper, we present a novel multi-source active transfer learning framework to jointly select fewer data instances from all sources to train classi- fiers with constant precision/recall. The intuition behind our approach is to actively label the most informative samples while adaptively transferring collective knowledge between sources. In this way, the classifiers that are learned can be both label-economical and flexible even for imbalanced or quality diverse sources. We compare our method with the state-of-the-art approaches on real-word datasets. Our exper- imental results demonstrate that our active transfer learning algorithm can achieve impressive performance with far fewerlabeled samples for record matching with numerous and var- ied sources. 展开更多
关键词 entity resolution active learning transfer learn-ing convex optimization
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部