摘要
语料标注是语料库构建的一项重要的基础性工作。基于搜狗日志,该文借助XML文档的结构化特点,将语料标注转换成节点属性的改写,根据语料的特点,制定了一套服务于搜索引擎用短语词典构建的短语语料标注加工规范及执行原则,并对标注集及加工规范进行了详细描述。利用此规范,已完成145 645条查询词串的标注,而且标注质量很高。
Corpus annotation is a fundamental work of corpus construction.Based on Sogou logs,this paper develops a set of annotation specification according to the characteristics of the corpus to build the phrases dictionary for search engine.In practice,the annotation process is completed as the task of node attribution filling in the XML file.With the proposed guideline,145 645 query strings has been annotated for their labels with a high quality.
出处
《中文信息学报》
CSCD
北大核心
2013年第2期47-51,共5页
Journal of Chinese Information Processing
基金
国家社会科学基金资助项目(09CYY021)
关键词
语料标注
搜狗日志
短语词典
加工规范
corpus annotation
Sogou logs
phrases dictionary
annotation specification