摘要
词向量评测是词向量研究的基础,包括内部评测(intrinsic evaluation)和外部评测(extrinsic evaluations)。外部评测是将得到的词向量应用到具体某个任务中进行评测,是词向量研究的目标。内部评测是通过建立词之间的语义相似度或相关性能力的评测集,评价词向量模型的性能,是一种常用的词向量评测方式。该文通过分析英文、汉文词向量评测集构建方法,结合藏文的特点,研究藏文词向量评测集构建方法,构建了用于评价藏文词向量相似度和相关性的评测集TWordSim215和TWordRel215,并分析其有效性。
Evaluation of words embedding as an essential issue in the research can be performed by intrinsic evaluation or extrinsic evaluation.The intrinsic evaluation,as a basic solution,usually demands an evaluation set describing the similarity or relevance among words.After examing the construction methods of words embedding evaluation sets of English and Chinese,this paper investigate the construction of Tibetan words embedding evaluation set according to the characteristic of Tibetan.The evaluation sets WordSim215 and TWordRel215 are constructed and analyzed for their effectiveness of evaluating Tibetan words embedding similarity and relevance.
作者
才智杰
孙茂松
才让卓玛
CAI Zhijie;SUN Maosong;CAI Rangzhuoma(College of Computer Science and Technology,Qinghai Normal University,Xining,Qinghai 810016,China;Qinghai Provincial Key Laboratory of Tibetan Information Processing and Machine Translation,Xining,Qinghai 810008,China;Key Laboratory of Tibetan Information Processing,Ministry of Education,Xining,Qinghai 810008,China;Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China)
出处
《中文信息学报》
CSCD
北大核心
2019年第7期81-87,100,共8页
Journal of Chinese Information Processing
基金
国家自然科学基金(61866032,61163018)
国家社会科学基金(13BYY141,16BYY167)
教育部“春晖计划”合作科研项目(Z2012093,Z2016077)
青海省基础研究项目(2017-ZJ-767,2019-SF-129)
“长江学者和创新团队发展计划”创新团队资助项目(IRT1068)
青海省重点实验室项目(2013-Z-Y17,2014-Z-Y32,2015-Z-Y03)
藏文信息处理与机器翻译重点实验室项目(2013-Y-17)
关键词
自然语言处理
藏文
词向量
评测集
natural language processing
tibetan
words embedding
evaluation set