摘要
针对当前大多数词法分析系统"流水线"式处理方式存在的不足,提出一种一体化同步词法分析机制.在最长次长匹配分词的基础上,在切分有向图中增加词性信息和候选未登录词节点,并拓展隐马尔可夫模型,在切分有向图内同步完成分词、歧义消解、未登录词识别和词性标注等词法分析任务.实现了分词与词性标注的一体化、未登录词识别与分词的一体化以及不确定词性未登录词处理的一体化.一体化机制使词法分析中各步骤实现真正意义上的同步完成,充分利用上下文词法信息提高整体精度并保证了系统的高效性,避免了各步骤间的冲突.开放测试表明,系统综合测试的F值为98.03%.
An integrative lexical analysis mechanism is proposed in order to solve the limitation of mostly existing lexical analysis system with″pipelining″mechanism.Based on maximum matching and second-maximum matching(MMSM) model,in the directed graph built by MMSM model,candidate words,parts-of-speech(POS) tags and all the candidate unknown words are added and considered,hidden Markov model(HMM) is extended,so Chinese word segmentation,ambiguity resolution,unknown word recognition and POS tagging are solved synchronously.The integrations of word segmentation and POS tagging,unknown words recognition and known word segmentation,uncertain unknown words recognition are realized.All the tasks of lexical analysis are accomplished synchronously,the conflicts between all the tasks in the Chinese lexical analysis are avoided,and high precision can be gained.The open test indicates that the F-score of the system is 98.03%.
出处
《大连理工大学学报》
EI
CAS
CSCD
北大核心
2010年第6期1028-1034,共7页
Journal of Dalian University of Technology
基金
中央高校基本科研业务费专项资金资助项目(DUT10RW202)
关键词
中文词法分析
一体化模型
最长次长匹配
未登录词
切分有向图
Chinese lexical analysis
integrative model
maximum matching and second-maximum matching
unknown word
segmentation directed graph