摘要
在中文分词领域,基于字标注的方法得到广泛应用,通过字标注分词问题可转换为序列标注问题,现在分词效果最好的是基于条件随机场(CRFs)的标注模型。作战命令的分词是进行作战指令自动生成的基础,在将CRFs模型应用到作战命令分词时,时间和空间复杂度非常高。为提高效率,对模型进行分析,根据特征选择算法选取特征子集,有效降低分词的时间与空间开销。利用CRFs置信度对分词结果进行后处理,进一步提高分词精确度。实验结果表明,特征选择算法及分词后处理方法可提高中文分词识别性能。
In Chinese word segmentation fields,the most widely used method is character-based tagging,which reformulates segmentation task to a sequence tagging task.The Conditional Random Fields(CRFs) tagger is the best tagger which can achieve state-of-the-art performance.The segmentation of the command orders is one of the basics of the auto-generation of command orders.Yet when using the model for command orders segmentation,problems of bad time and space efficiency are encountered.The model is analyzed and feature subsets are selected by using the feature selection algorithm,which cut the overhead of time and space effectively and improve the efficiency of the model.Then a novel post-process using CRFs confidence is presented to further improve performance.By combining the feature selection method and the confidence-based post-process,great improvement is achieved and the experimental results are satisfactory.
出处
《信息与电子工程》
2012年第2期184-187,共4页
information and electronic engineering
关键词
中文分词
条件随机场
特征选择
置信度
Chinese word segmentation
Conditional Random Fields
feature selection
confidence