摘要
领域词典是一种领域知识的表现形式,是数据规范化和数据清洗的重要参考信息。映射关系指表格中某两列间的对应关系。领域词典构建与扩充以Web表格为主要数据来源,需要对众多Web表格中的局部映射关系进行联结和扩展,但Web表格中存在异构和数据质量问题,不能单纯地依靠模式匹配等数据集成技术。本文提出了一种基于映射关系的领域词典抽取算法。首先利用带IDF权重的Jaccard最大包含度和编辑距离进行近似字符串匹配,并利用高斯混合模型实现数值离散化,从而解决了数据层面的异构性问题。然后由点互信息和函数依赖确定包含映射关系的候选表;接下来定义了候选表间的相容性和相斥性,构造出映射关系图模型,以进行候选表联结,实现了以映射关系为形式的领域词典抽取。最后,为保证领域词典的质量,加入了冲突消解过程。在实验验证阶段,本文利用房地产领域数据集,与其他从Web获取领域知识的算法进行比较,验证了本文所提出算法的有效性和可靠性。
The domain dictionary is a form of expression of domain knowledge and the important reference information for data normalization and data cleaning. The mapping relationships refer to the cor-responding relationship between two columns in a table. The construction and expansion of domain dictionary takes Web tables as the main data source, and it is necessary to connect and expand the local mapping relationships in many Web tables. However, there are heterogeneous and data qual-ity problems in Web tables, data integration technologies, for example, pattern matching cannot be relied on. This paper proposes a domain dictionary extraction algorithm based on mapping rela-tions. Firstly, we use the IDF-Jaccard maximum containment and edit distance for approximate string matching, and use Gaussian mixture model to achieve numerical discretization, thereby solving the heterogeneity problem at the data level. Next, the candidate table containing mapping relationships is determined by the pointwise mutual information and functional dependence;then the compatibility and repulsion between the candidate tables are defined, and the mapping rela-tionship graph model is constructed to connect the candidate tables, and the domain dictionary with the form of mapping relationships is extracted. Finally, to ensure the quality of the domain dic-tionary, a conflict resolution process was added. In the experimental verification, this paper used real estate data sets, compared with other algorithms that obtain domain knowledge from the Web, so the effectiveness and reliability of the algorithm proposed was verified.
出处
《数据挖掘》
2021年第2期59-76,共18页
Hans Journal of Data Mining