摘要
针对短文本特征较少而导致使用传统文本分类算法进行分类效果并不理想的问题,提出了一种融合BTM主题特征和改进了特征权重计算的综合特征提取方法来进行短文本分类。方法中,在TF-IWF的基础上降低词频权重并引入词分布熵,衍生出新的算法计算权重。结合BTM主题模型中各主题下的主题词对词数较少的文档进行补充,并选择每篇文档在各个主题下的概率分布作为另一部分文档特征。通过KNN算法进行多组分类实验,结果证明该方法与传统的TF-IWF等方法计算特征进行比较,F1的结果提高了10%左右,验证了方法的有效性。
Short texts are normally featured with less content, looser text format, varied sentence length and relativelycomplex sentence structure. Consequently, the effects of traditional classification algorithms are quite unsatisfactory. Thispaper presents an authentic comprehensive method by the fusion of BTM theme features and well-improved weight calculationmethod for short text classification. In order to achieve this, two steps are in necessity. Firstly, the paper reduces theterm frequency weight according to the TF-IWF. In the meantime, it introduces the word distribution probability value sothat a new algorithm for computing weights will derive. Secondly, it uses the topic words of BTM topic model to complementempty documents. Meanwhile, the probability distribution of each document in each topic will be carefully selectedas the document’s other features. Experimental results indicate that with the help of this newly created method, the resultsof F1 has been improved by around 10% compared to the original TF-IWF method.
出处
《计算机工程与应用》
CSCD
北大核心
2016年第13期95-100,共6页
Computer Engineering and Applications
基金
安徽省高校自然科学研究重点项目(No.KJ2013A020)
安徽省自然科学基金(No.11040606M133)
关键词
短文本
权重计算
TF-IWF方法
主题模型
short text
weight calculation
Inverse Word Frequency(TF-IWF)
topic model