摘要
实体解析是数据集成、数据挖掘等技术中不可或缺的步骤,其具体任务是查找引用自同一真实世界的实体的数据记录.现有的方法多数是通过计算实体记录的属性相似度来评估是否为同一实体,由于该方法需要预先对齐记录属性,无法适应属性中token误放的情形,也不能有效利用跨属性中tokens的语义和结构信息,影响实体识别准确性.本文提出了一种采用主题异构图嵌入的token粒度的实体解析方法(THGE-ER).在token、属性和记录基础上,利用LDA模型为实体记录添加一个主题层级,并构建了一个由token、属性、记录和主题4类节点组成的主题异构图;采用区分节点类型的异构图嵌入表示方法,并将节点间的语义和结构信息嵌入到token层级的嵌入向量中;进一步结合多层次注意力机制,完成最终的实体解析决策.经过大量的实验证明,本文提出的方法表现出了良好的性能.
Entity resolution is an indispensable step in data integration,data mining and other technologies,and its specific task is to identify entity records that refer to the same real-world entity.Most of the existing methods evaluate whether they describe the same entity by calculating the attribute similarity of the records.Because this method requires to align the attributes in the record in advance,it cannot adapt to the misplacement of the token in the attribute,and it also cannot effectively use the semantic and structural information of tokens in cross-attributes,which affects the accuracy of entity recognition.In this paper,we propose an entity resolution method with token granularity based on topic heterogeneous graph embedding(THGE-ER).Our method utilizes the LDA model to add a topic level to entity records on the basis of tokens,attributes,and records,and construct a topic heterogeneous graph composed of four types of nodes of token,attributes,records and topics;adopts the heterogeneous graph embedding representation method that distinguishes node types,and embeds the semantic and structural information between nodes into the embedding vector of the token level;and further combines the multi-level attention mechanism to complete the final entity resolution decision.After a large number of experiments,our method has shown excellent performance.
作者
初慧琳
申德荣
窦文周
聂铁铮
寇月
CHU Hui-lin;SHEN De-rong;DOU Wen-zhou;NIE Tie-zheng;KOU Yue(School of Computer Science and Engineering,Northeastern University,Shenyang 110819,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2023年第7期1398-1404,共7页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(62072086,62072084,62172082)资助。