期刊文献+

基于NWD集成算法的多粒度微博用户兴趣画像构建 被引量:2

Multi-granularity Microblog User Interest Portrait Construction Based on NWD Integrated Algorithm
下载PDF
导出
摘要 微博文本特殊性的存在使得微博用户兴趣画像难以有效构建。为此,提出了一种集成算法--新词发现-双向长短期记忆网络-梯度提升算法。首先针对微博文本的非正式性,提出了一种基于支持度视角的新词发现(New Word Discovery, NWD)算法,发掘其中大量存在的网络用语以实现更加准确的分词及语义把握;其次,引入Simhash算法使得微博文本中的"信息过载"现象得到改观;再次,为改善微博文本的简洁性而引起的特征稀疏问题,采用双向长短期记忆网络(Bidirectional Long Short-term Memory,Bi-LSTM)模型提取博文语义特征;最后,通过融合微博用户静态特征训练梯度提升(extreme Gradient Boosting,XGBoost)模型,从而有效构建多粒度微博用户兴趣画像。实验结果表明,粗粒度(一级)兴趣标签模型NWD-Bi-LSTM和细粒度(二级)兴趣标签模型NWD-Bi-LSTM-XGBoost的宏平均F1值(Macroaverage F1 score, mF1)和受试者工作特征曲线下面积(Area Under ROC Crave, AUC)分别高达83.6%, 79.7%和70.4%,63.6%,相对于基准模型, NWD算法的集成使得模型的m F1值和AUC值均能提升3%~5%,其促进作用优于现有的新词发现方法。 The special features of microblog text cause difficulties in building microblog user interest portrait.To address the problem,an ensemble algorithm based on NWD-Bi-LSTM-XGBoost is proposed.Firstly,a new word discovery algorithm from the perspective of support is raised to deal with the informality of microblog text,exploring the ubiquitous internet phrases and achieving more accurate word segmentation and semantic understanding.Then,a Simhash algorithm is introduced to mitigate the information overload of microblog text.To improve the feature sparsity caused by microblog text’s conciseness,bidirectional long short-term memory networks are used to extract semantic features.Finally,the XGBoost model is trained by combining the static features of microblog users with the semantic features of the blog text for constructing the multi-granularity microblog user interest portrait efficiently.The experimental results show that the macro-average F1 score and AUC value of coarse-granularity(primary)interest tag model are up to 83.6%and 79.7%and that of finegranularity(secondary)interest tag model are 70.4%and 63.6%,respectively.Compared with other benchmark models,the macro-average F1 score and AUC value of the models increase by 3%~5%due to ensemble of the NWD algorithm,which is superior to the existing new word discovery methods.
作者 张舒 莫赞 柳建华 杨培琛 刘洪伟 Zhang Shu;Mo Zan;Liu Jian-hua;Yang Pei-chen;Liu Hong-wei(School of Management,Guangdong University of Technology,Guangzhou 510520,China)
出处 《广东工业大学学报》 CAS 2020年第4期42-50,共9页 Journal of Guangdong University of Technology
基金 国家自然科学基金资助项目(71671048)。
关键词 新词发现 双向长短期记忆网络 XGBoost梯度提升 多粒度 微博用户兴趣画像 new word discovery bidirectional long short-term memory extreme Gradient Boosting multigranularity microblog user interest portrait
  • 相关文献

参考文献10

二级参考文献85

共引文献167

同被引文献31

引证文献2

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部