期刊文献+

融合多粒度信息的文本分类研究 被引量:5

Research on Text Classification by Fusing Multi-Granularity Information
下载PDF
导出
摘要 目前对中文文本分类的研究主要集中于对字符粒度、词语粒度、句子粒度、篇章粒度等数据信息的单一模式划分,这往往缺少不同粒度下语义所包含的信息特征。为了更加有效提取文本所要表达的核心内容,提出一种基于注意力机制融合多粒度信息的文本分类模型。该模型对字、词和句子粒度方面构造嵌入向量,其中对字和词粒度采用Word2Vec训练模型将数据转换为字向量和词向量,通过双向长短期记忆网络(bi-directional long short-term memory,BiLSTM)获取字和词粒度向量的上下文语义特征,利用FastText模型提取句子向量中包含的特征,将不同种特征向量分别送入到注意力机制层进一步获取文本重要的语义信息。实验结果表明,该模型在三种公开的中文数据集上的分类准确率比单一粒度和两两粒度结合的分类准确率都有所提高。 Current research on Chinese text classification focuses on a single pattern of classifying data information at character granularity,word granularity,sentence granularity and chapter granularity,which often lacks the information features contained in the semantics at different granularities.In order to extract the core content of the text more effectively,a text classification model based on attention mechanism fusing multi-granularity information is proposed.The model constructs embedding vectors for character,word and sentence granularity,where the Word2Vec training model is used for character and word granularity to convert the data into character and word vectors,and the contextual semantic features of the character and word granularity vectors are obtained through a bidirectional long and short-term memory network,and the features contained in the sentence vectors are extracted using the FastText model,and the different feature vectors are fed into the attention mechanism layer to obtain further important semantic information about the text.The experimental results show that the classification accuracy of the model on the three publicly available Chinese datasets is improved over both single granularity and a combination of two or two granularities.
作者 辛苗苗 马丽 胡博发 XIN Miaomiao;MA Li;HU Bofa(School of Information Engineering,Hebei GEO University,Shijiazhuang 050031,China;Laboratory of Artificial Intelligence and Machine Learning,Hebei GEO University,Shijiazhuang 050031,China)
出处 《计算机工程与应用》 CSCD 北大核心 2023年第9期104-111,共8页 Computer Engineering and Applications
基金 河北省高等学校科学技术研究重点项目(ZD2018043) 河北地质大学博士基金(BQ2017045)。
关键词 多粒度 信息融合 文本分类 注意力机制 multi-granularity information fusion text classification attention mechanism
  • 相关文献

参考文献6

二级参考文献39

  • 1赵世奇,张宇,刘挺,陈毅恒,黄永光,李生.基于类别特征域的文本分类特征选择方法[J].中文信息学报,2005,19(6):21-27. 被引量:21
  • 2盛骤,谢式千,潘乘毅.概率论与数理统计[M].北京:高等教育出版社,2010.
  • 3MitchellTM著 曾华军 张银奎译.机器学习[M].北京:机械工业出版社,2003..
  • 4Sebastiani F. Machine learning in automated text cat- egorization[J]. ACM Computing Surveys, 2002, 34 (1) : 1-9.
  • 5Finn A, Kushmeick N, Smyth B. Genre classifica- tion and domain transfer for information filtering[C] //Proceedings of the 24th BCS-IRSG European Col- loquium on Information Retrieval Research.. Ad- vances in Information Retrieval. UK.. Springer, 2002: 353-362.
  • 6Yu H, Hatzivassiloglou V. Towards answering opin- ion questions: Separating facts /rom opinions and i- dentifying the polarity of opinion sentences [C]// Proceedings of the 2003 Conference on EMNLP. USA: ACL, 2003: 129-136.
  • 7Pang B, Lee L. A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts [C] // Proceedings of the 42nd Annual Meeting of the Association for Computational Lin- guistics. Morristown, NJ, USA.. ACL, 2004.. 271- 278.
  • 8中国科学院计算技术研究所.ICTCLAS特色[EB/OL].http://ictclas.org/index.html,2008/2013.InstituteofComputingTechnology.ICTCLAS[EB/OL].http://ictclas.org/index.html,2008/2013.
  • 9吕国云,赵荣椿,张艳宁,樊养余,Sahli Hichem.基于三音素动态贝叶斯网络模型的大词汇量连续语音识别[J].数据采集与处理,2009,24(1):1-6. 被引量:3
  • 10林纲.网络新闻文本结构的语法特征[J].社会科学家,2010,25(7):155-157. 被引量:7

共引文献112

同被引文献44

引证文献5

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部