摘要
该文介绍了以《淮南子》为文本的上古汉语分词及词性标注语料库及其构建过程。该文采取了自动分词与词性标注并结合人工校正的方法构建该语料库,其中自动过程使用领域适应方法优化标注模型,在分词和词性标注上均显著提升了标注性能。分析了上古汉语的词汇特点,并以此为基础描述了一些显式的词汇形态特征,将其运用于我们的自动分词及词性标注中,特别对词性标注系统带来了有效帮助。总结并分析了自动分词和词性标注中出现的错误,最后描述了整个语料库的词汇和词性分布特点。提出的方法在《淮南子》的标注过程中得到了验证,为日后扩展到其他古汉语资源提供了参考。同时,基于该文工作得到的《淮南子》语料库也为日后的古汉语研究提供了有益的资源。
In this paper, we present a segmented and part-of-speech (POS) tagged Archaic Chinese corpus along with its construction process, which is performed by automatic segmentation and tagging with manual correction as post-processing. We use both Modern and Archaic Chinese labeled data for training word segmenter and POS tagger, which are further improved by domain adaptation techniques, as well as by adding linguistic and morphological features derived from the characteristics of Archaic Chinese language. The experimental results showed the effec tiveness of our approach. In particular, the domain adaptation techniques and the added features significantly improve POS tagging performance. During our manual correction, we categorize the errors resulted from the automatic segmentation and POS tagging process, and investigate the sources of those errors. Finally, we give the statistics of the resulted corpus on the distributions of words and POS tags. Our work is a preliminary study that could be easily extended to annotating other Archaic Chinese text, and the resulted corpus is a valuable resource for research on Archaic Chinese language.
出处
《中文信息学报》
CSCD
北大核心
2013年第6期6-15,81,共11页
Journal of Chinese Information Processing
关键词
上古汉语语料库
分词
词性标注
领域适应
Archaic Chinese corpus
word segmentation
Part-of-speech Tagging
domain adaptation