摘要
为解决互联网上不同源中同一实体描述多样性的问题,提出一种基于非主属性离群点检测的实体匹配方法。利用非主属性值消除主属性值不同带来的歧义,非主属性值可较快排除不匹配实体,极大提高匹配效率。该匹配方法在一定程度上克服了离群点匹配在传统奇异值分解中不能应用在大规模数据的弊端,其基于规则的方法对数据进行粗筛选,降低实体对的数据规模;根据离群点检测模型做进一步筛选,得到初步的实体对集;根据生成的实体对集进行采样,利用机器学习选择合适的匹配器并训练来获取匹配对。实验结果表明,该方法使准确率和召回率得到提高,其有效性得到验证。
To solve the problem of diversity of the same entity in different sources on the Internet,an entity matching method based on non-primary attribute outlier detection was proposed.Non-primary attribute values were used to eliminate the ambiguity caused by different primary attribute values.At the same time,non-primary attribute values also excluded unmatched entities quickly,which greatly improved the matching efficiency.In addition,this matching method overcame the disadvantages that outlier matching can not be applied to large-scale data in traditional singular value decomposition to a certain extent.A rule-based method was used to roughly filter the data and reduce the data size of the entity pair.Further screening was performed according to the outlier detection model to obtain a preliminary entity pair set.The set of entities was sampled.Machine learning was used to select the appropriate matcher and it was trained to get matched pair.Experimental results verify the effectiveness of the method and it greatly improves the accuracy and recall rate.
作者
曹卫东
王广森
王怀超
CAO Wei-dong;WANG Guang-sen;WANG Huai-chao(College of Computer Science and Technology,Civil Aviation University of China,Tianjin 300300,China)
出处
《计算机工程与设计》
北大核心
2019年第8期2247-2252,共6页
Computer Engineering and Design
基金
民航科技重大专项基金项目(MHRD20150107、MHRD20160109)
中央高校基本业务费基金项目(3122014C017)
关键词
实体匹配
非主属性
离群点检测
粗筛选
匹配器
entity matching
non-primary attribute
outlier detection
rough filter
matcher