摘要
网络短文本以短小、快速等特点,现已成为重要的数据资源。但因其内容短小,不利于主题模型等自然语言处理算法从中提取有效的特征表达,严重限制了算法的应用。针对短文本数据特点,本文结合词汇对主题模型和K竞争自编码模型数据处理方式,建立表达词汇间关系的“词汇对”表达数据,将竞争关系引入主题表达,突出重点主题;以全连接结构建立主题全局关系,弥补主题模型忽略词汇关系和主题关系的不足,有效增强主题特征的表达能力。实验结果表明,本文方法的分类准确率明显高于传统主题模型。
Internet short texts have already become the most important data resource as the texts are short and spread widely and rapidly. The short texts usually have a few words, which makes it difficult to extract the effective features for natural language processing algorithms, such as topic model, and limits the applications of the models. For the characteristics of short-text data, this paper combines the data processing of biterm topic model and K-competitive autoencoder to propose a new method. The method uses the biterms to express word relationship, introduces competitive relationship into the topic features, and builds the global relationship of topics by fully-connected layers. Therefore the method highlights the key topics, overcomes the limitation of ignoring the word relationship and the topic relationship, and enhances the representative ability of topic features. The experimental results of short-text classification on two standard datasets(20 newsgroup and Reuters-21578) show that the method outperforms the traditional topic models.
作者
潘智勇
赵港
PAN Zhiyong;ZHAO Gang(School of Computer Science and Technology,Beihua University,Jilin Jilin 132013,China)
出处
《智能计算机与应用》
2022年第9期32-36,共5页
Intelligent Computer and Applications
基金
吉林省教育厅科学技术项目(JJKH20190645KJ)
吉林省科技发展计划项目(20210203050SF)
吉林市科技发展计划杰出青年人才培养专项(20200104075)。
关键词
短文本分类
主题模型
K竞争关系
全连接结构
short-text classification
topic model
K-competitive relationship
fully-connected layer