Crowd-Guided Entity Matching with Consolidated Textual Data

Crowd-Guided Entity Matching with Consolidated Textual Data

导出

摘要 Entity matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra consolidated textual information (CText) of the record, but seldom work has been done on using the CText for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CText since there are hundreds or thousands of words with each piece of CText, while existing topic models either cannot work well since there are no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each piece of CText, and then measure the similarity between CText on the multiple sub-topic dimensions. To avoid ignoring some hidden important sub-topics, we let the crowd help us decide weights of different sub-topics in doing EM. Our empirical study on two real-world datasets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models. Entity matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra consolidated textual information (CText) of the record, but seldom work has been done on using the CText for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CText since there are hundreds or thousands of words with each piece of CText, while existing topic models either cannot work well since there are no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each piece of CText, and then measure the similarity between CText on the multiple sub-topic dimensions. To avoid ignoring some hidden important sub-topics, we let the crowd help us decide weights of different sub-topics in doing EM. Our empirical study on two real-world datasets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models.

作者 Zhi-Xu Li Qiang Yang An Liu Guan-Feng Liu Jia Zhu Jia-Jie Xu Kai Zheng Min Zhang

机构地区 School of Computer Science and Technology Guangdong Key Laboratory of Big Data Analysis and Processing School of Computer Science and Technology School of Computer School of Computer Science and Technology Beijing Key Laboratory of Big Data Management and Analysis Methods

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2017年第5期858-876,共19页 计算机科学技术学报（英文版）

关键词 entity matching consolidated textual data crowdsourcing entity matching consolidated textual data crowdsourcing

分类号 P208 [天文地球—地图制图学与地理信息工程] TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

1陈清棠,李辉,等.Crow—Foukase综合征一例报道[J].中华神经精神科杂志,1989,22(2):124-124.
2Pei-Feng Li,Guo-Dong Zhou.Three-Layer Joint Modeling of Chinese Trigger Extraction withConstraints on Trigger and Argument Semantics[J].Journal of Computer Science & Technology,2017,32(5):1044-1056.
3孙玉忠.必修5第5单元长难句剖析[J].考试与评价（英语高二专刊）,2017,0(9):8-10.
4陈春花.组织如何拥有知识[J].中国企业家,2017,0(20):104-104.
5Jia-Xu Liu,Yu-Dian Ji,Wei-Feng Lv,Ke Xu.Budget-Aware Dynamic Incentive Mechanism in SpatialCrowdsourcing[J].Journal of Computer Science & Technology,2017,32(5):890-904.
6宋玲玲.基于PloS和WoS的论文科学评价研究——以民用航天医学为例[J].情报探索,2017(10):13-18.
7樊峰峰,李战怀,陈群,刘海龙.一种基于离群点检测的自动实体匹配方法[J].计算机学报,2017,40(10):2197-2211. 被引量：10
8Yan-Di Xie,Hui Ma,Bo Feng,Lai Wei.Efficacy of Real-world Entecavir Therapy in Treatment-naive Chronic Hepatitis B Patients[J].Chinese Medical Journal,2017(18):2190-2197. 被引量：16
9Li Xiaohong.When Confucius ＂Appears＂ on the American Stage -- Tour Record of Dance Drama Confucius in New York and Washington[J].China & The World Cultural Exchange,2017,83(3):32-33.

Journal of Computer Science & Technology

2017年第5期

浏览历史

内容加载中请稍等...

Crowd-Guided Entity Matching with Consolidated Textual Data

相关作者

相关机构

相关主题

浏览历史