摘要
在对大规模姓名样本库统计的基础上,研究了各种中文人名的姓氏、名字用字规律,并通过对大规模语料库的统计分析,得到了每个姓氏用字在真实文本中用作真实姓氏的概率及其上下文规律;针对汉族人名和少数民族人名及音译人名,分别提出了多级姓氏阈值和多级首字阈值的概念,并使用3σ法则确定阈值。实验结果表明,基于多级阈值的中文人名识别模型是有效的。
This paper presents the rules of surname words and name words of all kinds of Chinese personal names based on a large scale personal names hase.lt also shows the probability of all surname words being a surname and their contexts rules by making a statistics on a large scale corpus.In allusion to personal names of Chinese Han Natinnality,multilevel threshold of surname is proposed.In order to recognize personal names of Chinese minority nationalities and transliterated personal names,it pro- poses multilevel threshold of the first word of personal name as well,And these thresholds are chosen by 3σ rule.The results show that the model of multilevel threshold is effective in recognizing Chinese personal names.
出处
《计算机工程与应用》
CSCD
北大核心
2007年第33期1-3,18,共4页
Computer Engineering and Applications
基金
国家高技术研究发展计划(863)(the National High-Tech Researchand Development Plan of Chinaunder Grant No.2006AA012140)
关键词
自然语言处理
未登录词识别
中文人名识别
多级阈值
3σ法则
natural language processing
unknnwn words reengnition
Chinese personal name recognition
multilevel threshold
3cr rule natural language processing
unknnwn words reengnition
Chinese personal name recognition
multilevel threshold
3σ rule