摘要
微博中包含大量具有时间、用户等信息的短文本数据,通过挖掘其语义信息来实现精准搜索已受到广泛关注.将传统的主题模型应用于微博短文本语义建模时通常会存在以下问题.一方面,微博的短文本会引起语义稀疏性;另一方面,由于传统的主题模型仅建模文档之间的信息,不能充分挖掘文档内部的上下文信息,因此其仅能捕获全局语义.针对以上问题,文中提出了面向搜索的微博短文本语义建模方法,该方法包含三部分:基于词向量的短文本扩展算法、基于扩展的微博主题模型和微博搜索.首先,所提扩展算法以具有局部语义的词向量为基础,通过计算单词间相似度对微博短文本进行扩展,以此缓解短文本的语义稀疏性并实现局部语义与全局语义的相互补充.其次,将扩展后的长文本作为所提主题模型的输入所提主题模型,以扩展后的长文本作为输入,通过建模双词进一步克服语义稀疏性,并同时利用微博多种特征(文本、时间、用户信息)来约束主题的生成过程从而提高短文本语义表示的质量.最后,基于生成的统一语义表示,可以计算短文本间相似度从而实现微博搜索.本文在真实的新浪微博数据集上进行了多组实验,对所提的微博短文本语义建模方法语义建模方法得到的语义表示进行了分析与评价并将其应用于微博搜索,实验结果验证了所提方法的有效性.
Microblogs contain lots of short text data with time and user information.It has received widespread attention to achieve accurate search by mining the semantics of Microblogs.When applying the traditional topic models to Microblog short text semantic modeling task,they usually will face the following issues.First,traditional topic modeling methods cannot deal with the problem of semantic sparsity that caused by the shortness of Microblogs.Second,since topic models only acquire semantic at document-level,they cannot mine the local semantic existing in contexts.Therefore,rough semantic representation will result in inaccurate search results.In order to obtain high-quality semantic representation and realize precise search,we propose a Microblog short text semantic modeling method for search(MSSMS),which contains three components:a short text expansion algorithm based on embedding vector,a microblog topic model based on expansion and Microblog search.The short text expansion algorithm aims to expand short text into long text.To realize this purpose,it utilizes the embedding vectors to construct similar-word sets for each word in the short text.As the embedding vectors contain local semantics,by using the expanded long text as the input of the topic model,the local semantics contained in embedding vectors and the global semantics acquired through the topic model can be combined.Besides,as short texts have turned into long texts,the semantic sparsity of short text can be weakened.In the proposed Microblog topic model,to further alleviate the semantic sparsity of short text,we introduce the bi-term pattern,which assigns word-word pairs to share the same topic.In addition,the proposed Microblog topic model also models multiple characteristics(text,time,user information)simultaneously.This operation could further improve the quality of the generated semantic representation,for the reason that multiple characteristics can constrain the generation of topics.After that,we can acquire the document-topic distributions,topic-word distributions,topic-time Beta distributions,and topic-user distributions,as these multiple characteristics(text,time,user information)are all mapped into the topic semantic space,they can be seemed as the unified semantic representation.Based on the generated unified semantic representation,we can calculate the similarities between short texts.Through sorting these similarities,we can realize the precise Microblog search.Finally,to verify the effectiveness of the proposed MSSMS,we conduct extensive experiments on real-world datasets of Sina Weibo,and these experiments are divided into two categories.One is to evaluate the semantic modeling ability of the MSSMS,and the other is to apply the MSSMS into Microblog search.In order to comprehensively evaluate the semantic modeling ability of the MSSMS,we not only use objective evaluation metric to measure the topic coherence but also use subjective evaluation methods to access the quality of the generated semantic representation.The experimental results show that compared with the comparison algorithm,the semantic representation generated by the proposed MSSMS method has the highest quality,and the MSSMS method has the best semantic modeling ability.In addition,the microblog search experiment results also verify that the proposed MSSMS method can achieve accurate microblog search.
作者
寇菲菲
杜军平
石岩松
杨从先
崔婉秋
梁美玉
石磊
KOU Fei-Fei;DU Jun-Ping;SHI Yan-Song;YANG Cong-Xian;CUI Wan-Qiu;Liang Mei-Yu;SHI Lei(Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia,Beijing University of Posts and Telecommunications,Beijing 100876)
出处
《计算机学报》
EI
CSCD
北大核心
2020年第5期781-795,共15页
Chinese Journal of Computers
基金
国家重点研发计划(2018YFB1402600)
中国博士后科学基金资助项目(2019M660564)
国家自然科学基金项目(61772083,61532006,61877006,61802028)
广西科技重大专项(桂科AA18118054)资助。
关键词
社交网络
微博
短文本
语义建模
搜索
social network
Microblogs
short text
semantic modeling
search