摘要
大多数句嵌模型仅利用文本字面信息来完成句子向量化表示,导致这些模型对普遍存在的一词多义现象缺乏甄别能力.为了增强句子的语义表达能力,本文使用短文本概念化算法为语料库中的每个句子赋予相关概念,然后学习概念化句嵌入(Conceptual sentence embedding,CSE).因此,由于引入了概念信息,这种语义表示比目前广泛使用的句嵌入模型更具表达能力.此外,我们通过引入注意力机制进一步扩展概念化句嵌入模型,使模型能够有区别地选择上下文语境中的相关词语以实现更高效的预测.本文通过文本分类和信息检索等语言理解任务来验证所提出的概念化句嵌入模型的性能,实验结果证明本文所提出的模型性能优于其他句嵌入模型.
Most sentence embedding models typically represent each sentence only using word surface,which makes these models indiscriminative for ubiquitous homonymy and polysemy.In order to enhance representation capability of sentence,we employ short-text conceptualization algorithm to assign associated concepts for each sentence in the text corpus,and then learn conceptual sentence embedding(CSE).Hence,this semantic representation is more expressive than some widely-used text representation models such as latent topic model,especially for short-text.Moreover,we further extend CSE models by utilizing an attention mechanism that select relevant words within the context to make more efficient prediction.In the experiments,we evaluate the CSE models on three tasks,text classification and information retrieval.The experimental results show that the proposed models outperform typical sentence embed-ding models.
作者
王亚珅
黄河燕
冯冲
周强
WANG Ya-Shen;HUANG He-Yan;FENG Chong;ZHOU Qiang(Beijing Engineering Research Center of High Volume Lan-guage Information Processing and Cloud Computing Applica-tions,School of Computer,Beijing Institute of Technology,Bei-jing 100081;Baidu Inc.,Beijing 100085)
出处
《自动化学报》
EI
CSCD
北大核心
2020年第7期1390-1400,共11页
Acta Automatica Sinica
基金
国家自然科学基金重点项目(61751201)资助。
关键词
句嵌入
短文本概念化
注意力机制
词嵌入
语义表达
Sentence embedding
short-text conceptualization
attention mechanism
word embedding
semantic representation