摘要
介绍一种基于HNC理论的、人机结合的汉语语料语义标注模型。首先分析了HNC语义标注的内容,在此基础上定义了标注的流程。因标注十分复杂,在流程的主要环节使用机器标注来帮助人工标注。具体地说,在语义块切分问题上采用最大熵模型,其正确率和召回率分别达到了83.78%和91.17%;在句类判断问题上采用基于实例的模型,其正确率达到了51.64%。运用此标注模型建设了HNC语义标注语料库,目前语料规模已达到40万字。
This paper introduced a novel conceptual tagging model for corpus which is based on the Hierarchical Network of Concepts (HNC) theory,and which benefits from manual work and automatic machine. Firstly, the contents of tagging were given, and the process of tagging was defined. For the complexity of the process, some machine tagging ways were used to help manual work. A maximum entropy model was adopted to deal with the problem of semantic chunks segmentation, and the test precision and recall are 83.78 % and 91.17 %. An example based model was adopted to deal with the problem of sentence category parsing, and the test precision is 51.64 %. Relying on the model,a HNC corpus was constructed,which currently reaches 400,000 characters.
出处
《计算机科学》
CSCD
北大核心
2009年第5期238-240,268,共4页
Computer Science
基金
国家973项目"自然语言理解的交互引擎研究"(2004CB318104)
中国科学院声学研究所"所长择优基金"(GS13SJJ04)资助