摘要
汉语词法分析是中文信息处理的基础,现阶段汉语词法分析的主流技术是基于统计的方法,这类方法的本质都是把词法分析过程看作序列数据标注问题。上下文是统计方法中获取语言知识和解决自然语言处理中多种实际应用问题必须依靠的资源和基础。汉语词法分析时需要从上下文获取相关的语言知识,但上文和下文是否同样重要呢?为克服仅凭主观经验给出猜测结果的不足,对基于字标注汉语词法分析的分词、词性标注、命名实体识别这3项子任务进行了深入研究,对比了上文和下文对各个任务性能的影响;在国际汉语语言处理评测Bakeoff多种语料上进行了封闭测试,采用分别表征上文和下文的特征模板集进行了对比实验。结果表明,在字标注框架下,下文对汉语词法分析性能的贡献比上文的贡献高出6个百分点以上。
Chinese lexical analysis is a foundational task for Chinese information processing.At the current,the mainstream technology of Chinese lexical analysis is based on statistical methods.These methods treat the analysis process as a sequence data tagging problem.Context is the necessary resource not only for obtaining linguistic knowledge in statistical linguistics but also for solving the problem in natural language processing.Chinese lexical analysis needs the help of correlative context.However,are above and below the same important? To overcome the lack of giving the result by the subjective experience,we studied the contribution of above and below for character-based tagging Chinese lexical analysis via the large number of experiments about word segmentation,POS tagging and named entity recognition.Closed evaluations were performed on many kinds of corpus from the international Chinese language processing Bakeoff,and comparative experiments were performed on different feature templates which describe above-context and below-context.Experimental results show that the performance by the below-context increases 6 percentage points than by the above-context.
出处
《计算机科学》
CSCD
北大核心
2012年第11期201-203,236,共4页
Computer Science
基金
高等学校博士学科点专项科研基金项目(20050007023)
河南省高等学校青年骨干教师项目(2009GGJS-108)资助
关键词
汉语词法分析
字标注
上下文
分词
词性标注
命名实体识别
Chinese lexical analysis
Character tagging
Context
Word segmentation
POS tagging
Named entity recognition