摘要
针对知识库中存在单条实体定义特征稀疏和人工设置相似度阈值适用性不强的问题,本文提出了一种基于分步聚类的人名消歧算法。首先,将知识库中人名实体定义的人物属性特征作为查询特征,利用文本检索的方式实现基于知识库的初次聚类,弥补了知识库中单条实体定义中特征稀疏的问题;然后,利用初次聚类的结果,采用基于自适应阈值的凝聚层次聚类算法实现知识库人名消歧;最后,采用条件随机场进行Other类识别,利用基于自适应阈值的凝聚层次聚类完成S类聚类,从而实现非知识库人名消歧。在CLP2012的中文人名消歧评测语料上进行实验,结果表明本文的算法能够有效地对人名进行消歧。
In the knowledge base there exist characteristics of sparse for a single entity, and it is difficult to determine the similarity threshold of clustering. Therefore, this paper presents a name disambiguation algorithm based on cluster by step. Firstly, query features for character attribute are obtained from knowledge base, and the initial clustering based on knowledge base is carried out by text retrieval, which make up characteristics of sparse for a single entity name defined in knowledge base. Then, taking initial clustering results as input, name disambiguation in knowledge base is completed by using hierarchical clustering algorithm based on adaptive threshold. Finally, the other classes are identified by conditional random fields, and the cluster by using hierarchical clustering algorithm based on adaptive threshold is completed. The experiment on data of CLP2012 Chinese person name disambiguation results shows that the proposed algorithm can effectively achieve disambiguation names.
出处
《数据采集与处理》
CSCD
北大核心
2016年第1期213-222,共10页
Journal of Data Acquisition and Processing
基金
国家社会科学基金(14BXW028)资助项目
全军军事研究生课题(2011JY002k-158)资助项目
关键词
人名消歧
特征稀疏
文本检索
凝聚层次聚类
相似度阈值
name disambiguation
characteristics of sparse
text retrieval
hierarchical clustering
similarity threshold