期刊文献+

融合BTM主题特征的短文本分类方法 被引量:11

Improved short text classification method based on BTM topic features
下载PDF
导出
摘要 针对短文本特征较少而导致使用传统文本分类算法进行分类效果并不理想的问题,提出了一种融合BTM主题特征和改进了特征权重计算的综合特征提取方法来进行短文本分类。方法中,在TF-IWF的基础上降低词频权重并引入词分布熵,衍生出新的算法计算权重。结合BTM主题模型中各主题下的主题词对词数较少的文档进行补充,并选择每篇文档在各个主题下的概率分布作为另一部分文档特征。通过KNN算法进行多组分类实验,结果证明该方法与传统的TF-IWF等方法计算特征进行比较,F1的结果提高了10%左右,验证了方法的有效性。 Short texts are normally featured with less content, looser text format, varied sentence length and relativelycomplex sentence structure. Consequently, the effects of traditional classification algorithms are quite unsatisfactory. Thispaper presents an authentic comprehensive method by the fusion of BTM theme features and well-improved weight calculationmethod for short text classification. In order to achieve this, two steps are in necessity. Firstly, the paper reduces theterm frequency weight according to the TF-IWF. In the meantime, it introduces the word distribution probability value sothat a new algorithm for computing weights will derive. Secondly, it uses the topic words of BTM topic model to complementempty documents. Meanwhile, the probability distribution of each document in each topic will be carefully selectedas the document’s other features. Experimental results indicate that with the help of this newly created method, the resultsof F1 has been improved by around 10% compared to the original TF-IWF method.
出处 《计算机工程与应用》 CSCD 北大核心 2016年第13期95-100,共6页 Computer Engineering and Applications
基金 安徽省高校自然科学研究重点项目(No.KJ2013A020) 安徽省自然科学基金(No.11040606M133)
关键词 短文本 权重计算 TF-IWF方法 主题模型 short text weight calculation Inverse Word Frequency(TF-IWF) topic model
  • 相关文献

参考文献10

二级参考文献152

共引文献380

同被引文献69

引证文献11

二级引证文献58

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部