期刊文献+

Crowd-Guided Entity Matching with Consolidated Textual Data

Crowd-Guided Entity Matching with Consolidated Textual Data
原文传递
导出
摘要 Entity matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra consolidated textual information (CText) of the record, but seldom work has been done on using the CText for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CText since there are hundreds or thousands of words with each piece of CText, while existing topic models either cannot work well since there are no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each piece of CText, and then measure the similarity between CText on the multiple sub-topic dimensions. To avoid ignoring some hidden important sub-topics, we let the crowd help us decide weights of different sub-topics in doing EM. Our empirical study on two real-world datasets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models. Entity matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra consolidated textual information (CText) of the record, but seldom work has been done on using the CText for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CText since there are hundreds or thousands of words with each piece of CText, while existing topic models either cannot work well since there are no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each piece of CText, and then measure the similarity between CText on the multiple sub-topic dimensions. To avoid ignoring some hidden important sub-topics, we let the crowd help us decide weights of different sub-topics in doing EM. Our empirical study on two real-world datasets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models.
出处 《Journal of Computer Science & Technology》 SCIE EI CSCD 2017年第5期858-876,共19页 计算机科学技术学报(英文版)
关键词 entity matching consolidated textual data crowdsourcing entity matching consolidated textual data crowdsourcing
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部