期刊文献+

基于种子词和数据集的垃圾弹幕屏蔽词典的自动构建 被引量:3

Automatic construction of the garbage barrage shielding dictionary based on seed words and dataset
下载PDF
导出
摘要 随着弹幕视频的流行,弹幕已经成为了互联网时代青年互动交流的一种形式,但随着弹幕数量的增多,如何屏蔽垃圾弹幕成为一个问题。在各类视频网站提出的关键词屏蔽法的基础上,提出了分别基于种子词和数据集的2类屏蔽词典自动构建方法。第1类方法主要采用Google的自然语言处理工具word2vec和PMI,寻找与种子词相似性较大或共现次数较多的词添加到屏蔽词典中去;第2类方法主要采用TF-IDF、LDA主题模型和信息增益IG的方法,从垃圾弹幕数据集中提取关键词来构建屏蔽词典。最后,对所构建的屏蔽词典进行评测,实验结果表明,词典规模在400~500时,弹幕屏蔽效果最好。同时,还考察了LDA主题数和数据集规模等因素对弹幕屏蔽效果的影响。 With the popularity of barrage video,barrage has become a form of interactive communication among young people in the Internet age,but with the increase in the number of barrage,how to block junk barrage has become a problem.On the basis of keyword masking method proposed by various video websites,this paper proposes two automatic shielding dictionary construction methods based on seed words and data sets respectively.The first method mainly uses Google’s natural language proces-sing tool(word2vec)and point mutual information(PMI).These words with greater similarity to seed words or more co-occurrences are added into the shielding dictionary.The second method mainly adopts TF-IDF(Term Frequency Inverse Document Frequency),LDA(Latent Dirichlet Allocation)topic model and IG(Information Gain),and extracts the keywords from the garbage barrage dataset to construct the shielding dictionary.Finally,the constructed shielding dictionaries are evaluated.The experimental results show that the barrage shielding effect is best when the dictionary scale is 400~500.Besides,the influence of LDA topic number and dataset size on the shielding effect of the barrage is also investigated.
作者 汪舸 吴方君 WANG Ge;WU Fang-jun(School of Information Management,Jiangxi University of Finance and Economics,Nanchang 330013;Jiangxi Key Laboratory of Data and Knowledge Engineering,Jiangxi University of Finance and Economics,Nanchang 330013,China)
出处 《计算机工程与科学》 CSCD 北大核心 2020年第7期1302-1308,共7页 Computer Engineering & Science
关键词 弹幕 关键词屏蔽 屏蔽词典 种子词 barrage keyword shielding shielding dictionary seed words
  • 相关文献

参考文献8

二级参考文献129

共引文献360

同被引文献36

引证文献3

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部