摘要
针对短文本特征稀疏、噪声大等特点,提出一种基于LDA高频词扩展的方法,通过抽取每个类别的高频词作为向量空间模型的特征空间,用TF-IDF方法将短文本表示成向量,再利用LDA得到每个文本的隐主题特征,将概率大于某一阈值的隐主题对应的高频词扩展到文本中,以降低短文本的噪声和稀疏性影响。实验证明,这种方法的分类性能高于常规分类方法。
Short texts are different from traditional documents in their shortness and sparseness. Feature extension can ease the problem of high sparse in the vector space model, but feature extension inevitably introduces noise. To resolve the problem, this paper proposes a high - frequency words expansion method based on LDA. By extracting high - frequency words from each category as the feature space, using LDA to derive latent topics from the corpus, it extends the topic words into the short -text. Extensive experiments conducted on Chinese short messages and news titles show that the new method proposed for Chinese short - text classification can obtain a higher classification performance comparing with the conventional classification methods.
出处
《现代图书情报技术》
CSSCI
北大核心
2013年第6期42-48,共7页
New Technology of Library and Information Service
基金
国家863计划基金项目"农产品全供应链多源信息感知技术与产品开发"(项目编号:2012AA101701-03)的研究成果之一
关键词
短文本分类
高频词
LDA
特征扩展
Short- text classification
High frequency words
LDA
Feature expansion