摘要
将名词、形容词、动名词和命名实体作为文本特征,考虑词序与词频,结合特征项的语义,提出一种基于改进最长公共子序列的文本聚类(LCSC)方法.实验结果表明:相对于传统的余弦值聚类方法,LCSC方法在人名消歧的P-IP指标上,F平均值由74.2%提高到了84.9%;相对于最长公共子序列方法,总体性能也提高了3.7%.
This paper uses nouns,adjectives,gerunds and named entities as text features,and also considers the word order and word frequency when computing the text similarity.A text clustering method based on revised longest common subsequence(LCSC)is proposed.The experimental results show that the LCSC method can significantly improve the overall performance in person name disambiguation compared with traditional clustering method and make the average Fmeasure increase from 74.2%to 84.9%.The overall performance also improved by 3.7% when compared with the longest common subsequence method.
出处
《华侨大学学报(自然科学版)》
CAS
北大核心
2016年第2期201-206,共6页
Journal of Huaqiao University(Natural Science)
基金
福建省科技计划重大项目(2011H6016)
福建省科技计划重点项目(2011H0028)
关键词
人名消歧
文本相似度
最长公共子序列
层次聚类
person name disambiguation
text similarity
longest common subsequence
hierarc