摘要
利用最大熵模型研究中文自动分词中交集型切分歧义的消解.模型输出的类别为两种:前两个字结合成词和后两个字结合成词.模型采用的特征为:待切分歧义字段的上下文各一个词、待切分歧义字段和歧义字段两种切分可能的词概率大小关系.通过正向最大匹配(FMM)和逆向最大匹配(BMM)相结合的分词方法,发现训练文本中的交集型歧义字段并进行标注,用于最大熵模型的训练.实验用1998年1月《人民日报》中出现的交集型歧义字段进行训练和测试,封闭测试正确率98.64%,开放测试正确率95.01%,后者比常用的词概率法提高了3.76%.
The resolution of overlapping ambiguity strings (OAS) is studied based on maximum entropy model. There are two model outputs, where either the first two characters form a word or the last two characters form a word. Features of the model include one word in context of OAS, the current OAS and word probability relation of two kinds of segmentations result. OAS in the training text is found by the combination of FMM and BMM segmentation method. After feature tagging they are used to train the maximum entropy model. The People Daily corpus of January 1998 is used in training and testing. Experimental result shows a closed test precision of 98.64% and an open test precision of 95.01%. The open test precision is improved 3.76% compared with that of the precision of common word probability method.
出处
《北京理工大学学报》
EI
CAS
CSCD
北大核心
2005年第7期590-593,共4页
Transactions of Beijing Institute of Technology
关键词
中文信息处理
汉语自动分词
交集型歧义
最大熵模型
Chinese information processing
Chinese automatic word segmentation
overlapping ambiguity strings
maximum entropy model