摘要
随着互联网的飞速发展,传统的文本分类已经不能满足人们对信息服务系统的要求,为了实现大规模海量信息的有效利用,高准确率的分类算法成为近年的研究热点。通常情况下,网络上的影评属于短文本,文本中可供抽取的信息词量较少,而对文本分类不起作用的停用词比例相对较大,产生了向量维度高和特征稀疏这两大难题,因而研究难度更大。针对短文本特征稀疏和样本高度不均衡等特点,本文提出方法作为短文本特征权重的计算方法,既考虑了特征项在单个样本中的分布,又考虑了文本的类别特征,提高了短文本分类的查准率和查全率。实验结果表明,与传统的特征权重计算方法相比,该方法更适合短文本的分类。
With the rapid development of the Internet, the traditional text classification can not satisfy people's requirements of information service system, in order to achieve effective use of large-scale mass of information, high accuracy of classification algorithms has become a hot topic in recent years. Under normal circumstances, the film review on network belongs to short text, there are less information words for extraction available in the text, while stop words make a large proportion in the text, resulting in two big issues of high vector dimension and sparse feature that are more difficult to study. In view of the inherent sparse features and unbalanced sample of the short text, the paper proposes a approach to resolve this problem, an approach of short text feature weight named MIDF(t)was proposed. This approach integrated the distribution of features in sample, and improved the precision and recall of short text categorization. The result of experiment indicates that the proposed approach is more suitable for short text classification compared to traditional feature weight calculation methods.
出处
《黑龙江科学》
2016年第16期28-29,共2页
Heilongjiang Science
基金
黑龙江省哲学社会科学研究规划项目"基于模糊支持向量机的英语语篇情感分析"(13E024)
关键词
短文本
文本分类
特征权重
Short text
Text classification
Feature weight