摘要
微博文本特殊性的存在使得微博用户兴趣画像难以有效构建。为此,提出了一种集成算法--新词发现-双向长短期记忆网络-梯度提升算法。首先针对微博文本的非正式性,提出了一种基于支持度视角的新词发现(New Word Discovery, NWD)算法,发掘其中大量存在的网络用语以实现更加准确的分词及语义把握;其次,引入Simhash算法使得微博文本中的"信息过载"现象得到改观;再次,为改善微博文本的简洁性而引起的特征稀疏问题,采用双向长短期记忆网络(Bidirectional Long Short-term Memory,Bi-LSTM)模型提取博文语义特征;最后,通过融合微博用户静态特征训练梯度提升(extreme Gradient Boosting,XGBoost)模型,从而有效构建多粒度微博用户兴趣画像。实验结果表明,粗粒度(一级)兴趣标签模型NWD-Bi-LSTM和细粒度(二级)兴趣标签模型NWD-Bi-LSTM-XGBoost的宏平均F1值(Macroaverage F1 score, mF1)和受试者工作特征曲线下面积(Area Under ROC Crave, AUC)分别高达83.6%, 79.7%和70.4%,63.6%,相对于基准模型, NWD算法的集成使得模型的m F1值和AUC值均能提升3%~5%,其促进作用优于现有的新词发现方法。
The special features of microblog text cause difficulties in building microblog user interest portrait.To address the problem,an ensemble algorithm based on NWD-Bi-LSTM-XGBoost is proposed.Firstly,a new word discovery algorithm from the perspective of support is raised to deal with the informality of microblog text,exploring the ubiquitous internet phrases and achieving more accurate word segmentation and semantic understanding.Then,a Simhash algorithm is introduced to mitigate the information overload of microblog text.To improve the feature sparsity caused by microblog text’s conciseness,bidirectional long short-term memory networks are used to extract semantic features.Finally,the XGBoost model is trained by combining the static features of microblog users with the semantic features of the blog text for constructing the multi-granularity microblog user interest portrait efficiently.The experimental results show that the macro-average F1 score and AUC value of coarse-granularity(primary)interest tag model are up to 83.6%and 79.7%and that of finegranularity(secondary)interest tag model are 70.4%and 63.6%,respectively.Compared with other benchmark models,the macro-average F1 score and AUC value of the models increase by 3%~5%due to ensemble of the NWD algorithm,which is superior to the existing new word discovery methods.
作者
张舒
莫赞
柳建华
杨培琛
刘洪伟
Zhang Shu;Mo Zan;Liu Jian-hua;Yang Pei-chen;Liu Hong-wei(School of Management,Guangdong University of Technology,Guangzhou 510520,China)
出处
《广东工业大学学报》
CAS
2020年第4期42-50,共9页
Journal of Guangdong University of Technology
基金
国家自然科学基金资助项目(71671048)。
关键词
新词发现
双向长短期记忆网络
XGBoost梯度提升
多粒度
微博用户兴趣画像
new word discovery
bidirectional long short-term memory
extreme Gradient Boosting
multigranularity
microblog user interest portrait