摘要
汉语词性标注的难点在于确定具有多个词类的词 (兼类词 )在上下文中的词性 .基于兼类词在词典中仅占很小的比例 (约为 3% ) ,提出了具有双重状态的隐马尔可夫模型 ,它不但有一个常规的状态转移概率矩阵 ,还在逻辑上为每个具有多个词类的词保留一个专有的状态转移概率矩阵 ,使模型从一个状态转移到另一个状态的概率不再和观察无关 。
The key problem of Part of Speech (POS) tagging is to identify the POS of the words that have multiple categories in the context. Since multiple categories words only take up a small portion in dictionary, this paper presented a bi states hidden Markov model, which not only has a regular state transfer probability matrix, but also maintains a state transfer matrix for each multiple category words. The state transfer matrix is no longer context free, which improves the accuracy of the model.
出处
《上海交通大学学报》
EI
CAS
CSCD
北大核心
2003年第6期897-900,共4页
Journal of Shanghai Jiaotong University
关键词
词性标注
隐马尔可夫模型
自然语言处理
part of speech(POS) tagging
hidden Markov model
natural language processing(NLP)