摘要
搜索引擎中关于人名的相关文档往往数据量庞大,且数据为增量式更新过程,新文档出现的时间与规模都存在不确定性。现有的方法多为全局的人名聚类方法,在处理大规模数据时往往效率较低,且无法实现增量聚类。本文提出了一种基于关键证据与E^2LSH的增量式人名聚类消歧方法。对于初始文档集,采用全局的人名聚类方法,保证聚类性能且能有效控制全局聚类的文档规模,提高聚类效率。对于增量文档集,利用提出的关键证据与E2LSH方法生成候选文档集,极大降低了需要计算相似度的文档规模,提高方法效率。实验结果表明,本文提出的增量式人名聚类消歧方法能有效改善人名聚类的效率,且具有良好的性能。
There are a large number of documents related with query person name which are indexed into the search engine. These documents are updated incrementally, and the update time and scale of new documents are uncertain. The most of existed methods are more focused on global clustering for person name disambiguation, but they are usually inefficient while processing a large-scale data, and cannot support incremental clustering. In this paper, an incremental clustering method based on key evidence and E2LSH for person name disambiguation is proposed. For initial document set, a global clustering method is adopted, and this method can achieve higher performance and reduce this size of documents, which the global clustering method needs to process, for the purpose of increasing the efficiency of document clustering. For incremental document set, the method based on key evidence and E2LSH is proposed to generate candidate document set. It significantly reduces the size of documents that need to compute the similarity, and increases the efficiency. The experimental results show that our method can improve clustering efficiency for person name disambiguation, and achieve good clustering performance.
出处
《情报学报》
CSSCI
北大核心
2016年第7期714-722,共9页
Journal of the China Society for Scientific and Technical Information
基金
国家社会科学基金项目"网上舆情斗争系统建模与应对策略研究"(14BXW028)资助
关键词
人名消歧
增量聚类
关键证据
E2LSH
大规模文档
person name disambiguation, incremental clustering, key evidence, E2LSH, scalable documents