摘要
针对短文本的特征稀疏性和上下文依赖性两个问题,提出一种基于隐含狄列克雷分配模型的短文本分类方法。利用模型生成的主题,一方面区分相同词的上下文,降低权重;另一方面关联不同词以减少稀疏性,增加权重。采用K近邻方法对自动抓取的网易页面标题数据进行分类,实验表明新方法在分类性能上比传统的向量空间模型和基于主题的相似性度量分别高5%和2.5%左右。
In order to solve the two key problems of the short text classification, very sparse features and strong context dependency, a new method based on latent Dirichlet allocation was proposed. The generated topics not only discriminate contexts of common words and decrease their weights, but also reduce sparsity by connecting distinguishing words and increase their weights. In addition, a short text dataset was constructed by crawling titles of Netease pages. Experiments were done by classifying these short titles using K-nearest neighbors. The proposed method outperforms vector space model and topic-based similarity.
出处
《计算机应用》
CSCD
北大核心
2013年第6期1587-1590,共4页
journal of Computer Applications
基金
国家自然科学基金资助项目(60970061
61075056
61103067)
中央高校基本科研业务费专项资金资助项目
关键词
短文本
分类
K近邻
相似度
隐含狄列克雷分配
short text
classification
K-Nearest Neighbor (K-NN)
similarity measure
latent Dirichlet allocation