

A Domain Dictionary Extraction Algorithm Based on Mapping Relationships
摘要 领域词典是一种领域知识的表现形式,是数据规范化和数据清洗的重要参考信息。映射关系指表格中某两列间的对应关系。领域词典构建与扩充以Web表格为主要数据来源,需要对众多Web表格中的局部映射关系进行联结和扩展,但Web表格中存在异构和数据质量问题,不能单纯地依靠模式匹配等数据集成技术。本文提出了一种基于映射关系的领域词典抽取算法。首先利用带IDF权重的Jaccard最大包含度和编辑距离进行近似字符串匹配,并利用高斯混合模型实现数值离散化,从而解决了数据层面的异构性问题。然后由点互信息和函数依赖确定包含映射关系的候选表;接下来定义了候选表间的相容性和相斥性,构造出映射关系图模型,以进行候选表联结,实现了以映射关系为形式的领域词典抽取。最后,为保证领域词典的质量,加入了冲突消解过程。在实验验证阶段,本文利用房地产领域数据集,与其他从Web获取领域知识的算法进行比较,验证了本文所提出算法的有效性和可靠性。 The domain dictionary is a form of expression of domain knowledge and the important reference information for data normalization and data cleaning. The mapping relationships refer to the cor-responding relationship between two columns in a table. The construction and expansion of domain dictionary takes Web tables as the main data source, and it is necessary to connect and expand the local mapping relationships in many Web tables. However, there are heterogeneous and data qual-ity problems in Web tables, data integration technologies, for example, pattern matching cannot be relied on. This paper proposes a domain dictionary extraction algorithm based on mapping rela-tions. Firstly, we use the IDF-Jaccard maximum containment and edit distance for approximate string matching, and use Gaussian mixture model to achieve numerical discretization, thereby solving the heterogeneity problem at the data level. Next, the candidate table containing mapping relationships is determined by the pointwise mutual information and functional dependence;then the compatibility and repulsion between the candidate tables are defined, and the mapping rela-tionship graph model is constructed to connect the candidate tables, and the domain dictionary with the form of mapping relationships is extracted. Finally, to ensure the quality of the domain dic-tionary, a conflict resolution process was added. In the experimental verification, this paper used real estate data sets, compared with other algorithms that obtain domain knowledge from the Web, so the effectiveness and reliability of the algorithm proposed was verified.
机构地区 沈阳建筑大学
出处 《数据挖掘》 2021年第2期59-76,共18页 Hans Journal of Data Mining
  • 相关文献



  • 1朱嫣岚,闵锦,周雅倩,黄萱菁,吴立德.基于HowNet的词汇语义倾向计算[J].中文信息学报,2006,20(1):14-20. 被引量:326
  • 2路斌,万小军,杨建武,等.基于同义词词林的词汇褒贬计算[C]//中国计算技术与语言问题研究-第七届中文信息处理国际会议论文集.北京:电子工业出版社,2007:17-23.
  • 3姚天防,娄德成.汉语情感词语义倾向判别的研究[C]//IC-CC2007:第七届中文信息处理国际会议论文集.北京:电子工业出版社,2007:221-225.
  • 4Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1): 1- 47.
  • 5D. Lewis, Ringuette. A comparison of two learning algorithms for text categorization. Symposium on Document Analysis and IR,Las Vegas, 1994.
  • 6Yiming Yang, Xin Liu. A re-examination of text categorization methods. In: Proc. 22nd Annual Int'l ACM SIGIR Conf.Research and Development in Information Retrieval. New York:ACM Press, 1999. 42-49.
  • 7Scott, Sam, Stan Matwin. Text classification using WordNet hypernyms. The COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems, Montreal, 1998.
  • 8L.D. Baker, A. K. MCallum. Distributional clustering of words for text classification. In: Proc. 21st Annual Int'l ACM SIGIR Conf. Research and Development in Information Retrieval. New York: ACM Press, 1998. 96- 103.
  • 9Sangkon Lee, Masami Shishibori. Passage segmentation based on topic matter. Computer Processing of Oriental Languages, 2002,15(3): 305-340.
  • 10Chen Wenliang, Chang Xingzhi, Wang Huizhen, et al.Automatic word clustering for text categorization using global information. AIRS2004, Beijing, 2004.









使用帮助 返回顶部