摘要
文本表示是自然语言处理中的基础任务,针对传统短文本表示高维稀疏问题,提出1种基于语义特征空间上下文的短文本表示学习方法。考虑到初始特征空间维度过高,通过计算词项间互信息与共现关系,得到初始相似度并对词项进行聚类,利用聚类中心表示降维后的语义特征空间。然后,在聚类后形成的簇上结合词项的上下文信息,设计3种相似度计算方法分别计算待表示文本中词项与特征空间中特征词的相似度,以形成文本映射矩阵对短文本进行表示学习。实验结果表明,所提出的方法能很好地反映短文本的语义信息,能对短文本进行合理而有效的表示学习。
Text representation is a basic task in natural language processing.Aiming at the drawback of the traditional high-dimensional sparse representation of short text,we propose a short text representation learning method based on semantic feature space context,called SFCR.Given the high dimension of the initial feature space,we firstly calculate the mutual information and co-occurrence relationship between terms,based on which we obtain the initial similarity and perform semantic clustering of terms.And the semantic feature space after dimensionality reduction can then be represented via the cluster center.Secondly,by combining the context information of the terms on the cluster formed after clustering,three similarity calculation methods are designed to calculate the similarity between the terms of the short text to be represented and the feature terms in the feature space.Thereafter the text mapping matrix for short text representation learning is constructed.Experimental results show that the proposed method can well reflect the semantic information of short text,and make reasonable and effective representation learning of short text.
作者
脱婷
马慧芳
魏家辉
刘海姣
TUO Ting;MA Hui-fang;WEI Jia-hui;LIU Hai-jiao(College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070;Guangxi Key Laboratory of Trusted Software,Guilin University of Electronic Technology,Guilin 514004,China)
出处
《计算机工程与科学》
CSCD
北大核心
2019年第2期378-384,共7页
Computer Engineering & Science
基金
国家自然科学基金(61762078
61363058)
广西可信软件重点实验室研究课题(kx201705)
西北师范大学"学生创新能力计划"2018年支持项目(CX2018Y048)
关键词
语义特征空间
相似度计算
文本映射矩阵
短文本表示
semantic feature space
similarity calculation
text mapping matrix
short text representation