摘要
模式匹配是数据集成领域的一个重要研究内容,列名与数据值不确定是模式匹配中的一种常见情况,当前较普遍的方法是基于互信息及欧式空间距离。但该方法没有解决因属性相似度相同或相近而引起的错误匹配问题。针对该问题,提出了多重迭代筛选方法,首先确定两个关系模式中能一次性正确匹配的部分属性对,再从中求出最优属性对,然后给出基于条件互信息的匹配方法,利用最优属性对计算未匹配属性的条件互信息,进一步计算各属性之间的欧氏距离,最终得到匹配结果,从而解决了错误匹配问题。实验结果表明所提算法正确、有效。
Schema matching is an important research in the field of data integration. The uncertainty of column names and data values is a common situation. The common method at present dealing with schema matching problem is based on mutual information and Euclidean distance. But this method does not solve the mistaken matching problem caused by the identity or the high similarity of the attributes. To solve this problem, this paper proposed multiple iterative screen- ing method, which firstly, in two relation models, fixes some of the corrects attribute pairs in one time and then selects the best optimized attribute pair. Secondly, this paper lodged the method based on conditional mutual information, which utilizes the best optimized attribute pair to calculate the conditional mutual information of un-matched attributes and further calculates the Euclidean distance between each attribute. Finally, the matching result was acquired. The wrong matching problem was solved. The experiment result indicates the given algorithm is correct and effective.
出处
《计算机科学》
CSCD
北大核心
2014年第8期85-89,共5页
Computer Science
基金
国家自然科学基金资助项目(61272098)
科技部973项目(2012CB316200)
南北极环境综合考察与评估专项(CHINARE2012-04-07)资助
关键词
不确定性
模式匹配
条件互信息
Uncertainty, Schema matching, Conditional mutual information