摘要
本文首先用分词工具对收集的大量语料文档进行预处理,并进行分词和词性标注;其次,编写脚本对已经词性标注的语料库按照情感词的词性进行提取,建立候选情感词库,并用候选情感词库与外部情感词库取交集得到基准情感词表;再次,用Word2Vec工具对自己创建的候选情感词库进行词向量训练,参照基准情感词表,计算情感词之间的distance值;最后,比较distance值判定情感词,即值越大则词汇之间的语义相似度就越高,从而按照距离远近选择情感新词。
Firstly,this paper preprocesses a large number of collected corpus documents with word segmentation tools,and carries out word segmentation and part of speech tagging;Secondly,a script is written to extract the part of speech labeled corpus according to the part of speech of emotional words,establish a candidate emotional thesaurus,and use the intersection between the candidate emotional thesaurus and the external emotional thesaurus to obtain the benchmark emotional thesaurus;Thirdly,use Word2Vec tool to train the word vector of the candidate emotional thesaurus created by yourself,and calculate the distance value between emotional words with reference to the benchmark emotional thesaurus;Finally,compare the distance value to determine the emotional words,that is,the greater the value,the higher the semantic similarity between words,so as to select the emotional new words according to the distance.
作者
胡创业
HU Chuangye(Xinjiang Normal University,Urumqi Xinjiang 830054,China)
出处
《信息与电脑》
2021年第17期50-52,共3页
Information & Computer
基金
汉语-乌兹别克语平行语料库自动构建方法研究(项目编号:XJNUSYS2019B10)。