摘要
提出了一种基于词向量的两层词性标注方法,使用少量人工提取的特征,大部分特征可使用词向量和第1层标注向量自动训练得到.该方法将标注集分成两类,分别作为不同层的标注集.首先,对容易标注的类别进行标注;然后,对难以标注的动词或者名词进行第2层标注,将其标注为具体的某类动词或名词.利用该方法对中国学生写的英语文章进行词性标注的准确率可从95.23%提高到95.63%,超过了现有基于词向量词性标注器对相同语料词性标注的准确率.
A tagging algorithm about two layers part-of-speech base on word embedding was proposed.Only a few artificial features are needed in this algorithm, most features are replaced by word embedding and tagging vector that is got in the first layer.In addition, the tag set is divided into two categories, which are the tag sets of different layers.The ones which are easily to be tagged are tagged firstly in the first layer.Those tags which are hardly to be tagged as noun and verb are tagged in the second layer.Using this algorithm, the accuracy of part-of-speech tagging of essays written by Chinese English learner is improved from 95.23% to 95.63%, which outperforms the state-of-art word results of part-of-speech tagging of essays written by Chinese English learner based on vector based on word embedding.
出处
《北京邮电大学学报》
EI
CAS
CSCD
北大核心
2017年第2期16-20,共5页
Journal of Beijing University of Posts and Telecommunications
关键词
词性标注
中国学生
文章
词向量
part-of-speech tagging
Chinese English learner
essays
word vector