期刊文献+

集成多种特征匹配中文实体名称

Matching Chinese entity names with multiple features
下载PDF
导出
摘要 准确匹配实体名称在信息系统集成中有广泛的应用,而在中文环境中,实体名称的变化和笔误使得中文实体名称难以准确匹配,所以需要开发出适应这些变化和笔误的匹配方法。中文实体名称的相似度从字、词、语义三个层次计算出来,将这些相似度线性合并起来,集成各自的优势。为了利用更多的匹配特征,引入了两种机器学习的方法:第一种方法通过训练获得一个优化排序和最佳切分点;第二种方法利用支持向量机来判断两个名称是否指向同一实体。在中文实体名称的数据集上的实验表明,这些方法和特征有效提高了匹配的效果。 Entity name matching plays an important role in information system integration applications, while the name variations and clerical errors in Chinese entity names make exact string matching problematic. Therefore it is important to develop methodologies that can handle the different variants of the same name entity. The Chinese entity name similarity is measured based on character, word and semantic levels separately, and a hybrid solution is introduced by combining these similarities linearly. Two machine learning methods are developed to integrate editing features for more precise matching: the optimized ranking list and best cut point are achieved from a training process; a Support Vector Machine is used to judge the name pairs. The results of an experimental study on a real dataset of Chinese entity names are reported; the experiment results show the methods are effective.
作者 巩军
出处 《计算机工程与应用》 CSCD 2012年第27期136-141,共6页 Computer Engineering and Applications
关键词 字符串相似度 名字消歧 名字匹配 机器学习 string similarity name disambiguation name-matching machine learning
  • 相关文献

参考文献18

  • 1Cohen W W, Ravikumar P, Fienberg S.A comparison of string metrics for matching ceedings of KDD Workshop ject Consolidation, 2003. names and records[C]//Pro- on Data Cleaning and Ob-.
  • 2Jaro M A.Advances in record-linkage methodology as ap- plied to matching the 1985 census of Tampa,Florida[J]. Journal of the American Statistical Association, 1989,84: 414-420.
  • 3Jaro M A.Probabilistic linkage of large public health datafiles[J].Statistics in Medicine, 1995,14- 491-498.
  • 4Winkler W E.The state of record linkage and current re- search problems[EB/OL]. ( 1999).http://www.census.gov/srd/ www/byname.html.
  • 5Monge A, Elkan C.The field-matching problem: algorithm and applications[C]//Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996.
  • 6Monge A,Elkan C.An efficient domain-independent algo- rithm for detecting approximately duplicate database re- cords[C]//Proceedings of the SIGMOD 1997 Workshop on Data Mining and Knowledge Discovery, 1997.
  • 7Piskorski J,Wieloch K,Pikula M,et al.Towards person name matching for inflective languages[C]//WWW 2008 Workshop NLP Challenges in the Information Explosion Era, 2008.
  • 8Arehart M D,Miller K J.A ground truth dataset for match- ing culturally diverse romanized person names[C]//Lan- guage Resources and Evaluation Conference, Marrakesh, Morocco, 2008.
  • 9张晓孪,王西锋.基于概念图的汉语语义计算的研究与实现[J].计算机工程与应用,2011,47(10):120-123. 被引量:10
  • 10张亮,尹存燕,陈家骏.基于语义树的中文词语相似度计算与分析[J].中文信息学报,2010,24(6):23-30. 被引量:36

二级参考文献22

共引文献326

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部