摘要
众多自然语言处理(Natural Language Processing,NLP)任务受益于在大规模语料上训练的词向量。由于预训练的词向量具有大语料上的通用语义特征,因此将这些词向量应用到特定的下游任务时,往往需要通过微调进行一定的更新和调整,使其更适用于目标任务。但是,目标语料集中的低频词由于缺少训练样本,导致在微调过程中无法获得稳定的梯度信息,使得词向量无法得到有效更新。而在短文本分类任务中,这些低频词对分类结果同样有着重要的指示性。因此,在具体的短文本分类任务上获得一个更好的低频词词向量表示是有必要的。针对这个问题,文中提出了一种与下游任务模型无关的低频词词向量更新算法,通过基于K近邻的词向量偏移计算方法,利用通用词向量中与低频词相似的高频词所获得的任务特征信息,来指导低频词的信息更新,从而获得更准确的且适用于当前任务语境的低频词词向量表示;并以TextCNN作为基准模型,基于word2vec和GloVe得到的两个通用预训练词向量,在3个公开的短文本数据集上进行了优化算法的效果验证。实验结果表明,使用优化算法更新低频词词表示后,模型分类准确率能达到84.3%~94%,较更新前提升了0.4%~1.4%,体现了优化算法的有效性,也进一步证明了短文本分类任务中低频词对分类结果的影响,为短文本分类的研究工作提供了一定的借鉴。
Many Natural Language Processing(NLP)tasks have benefitted from the public availability of general-purpose vector representations of words trained with large-scale datasets.Since pre-trained word embeddings only have general semantic features from large corpus,it is often necessary to fine-tune these embeddings to make them more suitable for target tasks when it is applied to certain downstream tasks.But,the words with low occurrence frequencies can hardly receive stable gradient information when fine-tuning.However,low-frequency terms are likely to convey important class-specific information in tasks for short text classification.Therefore,it is necessary to obtain a better low-frequency word embedding on the specific task.To address the problem,this paper proposes a model-agnostic algorithm,which optimizes the vector representations of these words according to the task specifics.This approach leverages the update information from common words to guide the embedding updating on rare words.It helps achieve more effective embeddings for the low-frequency words.Our evaluation on three public short-text classification tasks shows that the proposed algorithm produces better task-specific embeddings for rarely occurring words,as a result,the model performance is improved from 0.4%to 1.4%on these tasks.It proves the positive influence of low frequency words on short-text classification tasks,which can shed light on short text classification tasks.
作者
程婧
刘娜娜
闵可锐
康昱
王新
周扬帆
CHENG Jing;LIU Na-na;MIN Ke-rui;KANG Yu;WANG Xin;ZHOU Yang-fan(School of Computer Science,Fudan University,Shanghai 201203,China;Shanghai Key Laboratory of Intelligent Information Processing,Shanghai 201203,China;META SOTA,Shanghai 200135,China;Microsoft Research,Beijing 100080,China)
出处
《计算机科学》
CSCD
北大核心
2020年第8期255-260,共6页
Computer Science
基金
国家自然科学基金(61702107)
赛尔网络下一代互联网技术创新项目(NGII20180611)。
关键词
词向量
低频词
微调
短文本分类
Word embedding
Low-frequency word
Fine-tuning
Short text classification