摘要
Web中大量可访问的数据源为人们获取有用的信息带来了极大的便利。作为Web数据源集成的一个必要的步骤,需要将存在于不同数据源表达形式各异的重复Web实体准确地识别出来。在已有的重复实体识别的工作中,主要是在两个数据源之间进行。由于Web数据源数量众多,使得这些方法无法应用于多个Web数据源之间的重复实体识别。针对这个问题提出了一种基于迭代训练的Web重复实体识别方法,可以在较小规模的训练样本上实现在多个Web数据源上的重复实体识别。通过在图书和计算机产品两个不同领域中多个Web数据源上的广泛实验,表明了提出方法的有效性。
A large number of Web data sources that can be accessed online make users convenient to obtain their desired information. As the necessary step in Web data integration, the duplicate Web entities with various presentations should be identified accurately from Web data sources. To the best of our knowledge, previous works focus on this issue only between two data sources. The large quantity of Web data sources make these approaches unpractical. To this end, an effective iterative-training-based approach is proposed to address this issue of duplicate Web entity identification, which can be applied to multiple Web data sources using a small training set. The extensive experiments on book domain and computer domain validate the effectiveness of the proposed approach.
出处
《计算机科学与探索》
CSCD
2010年第7期599-607,共9页
Journal of Frontiers of Computer Science and Technology
基金
国家自然科学基金No.60875033
中国博士后科学基金No.20080440256
200902014~~
关键词
Web实体
重复实体识别
WEB数据集成
迭代训练
Web entity
duplicate entity identification
Web data integration
iterative training