期刊文献+

融合类别特征扩展与N-gram子词过滤的fastText短文本分类 被引量:4

Short Text Classification with Category Feature Expansion and N-gram Subword Filtration Based on fastText
下载PDF
导出
摘要 以提升fastText短文本分类模型性能为目标,从获取高质量的类别特征、降低N-gram子词中低类别区分贡献度子词对模型学习高类别区分贡献度语义特征时产生的干扰角度展开研究,提出基于TF-IDF的LDA类别特征提取方法以提升类别特征质量,提出基于词汇信息熵的N-gram子词过滤方法过滤N-gram子词中低类别区分贡献度子词,并构建更专注于高类别区分贡献度语义特征学习的EF-fastText短文本分类模型.实验结果表明基于TF-IDF的LDA类别特征提取方法,以及基于词汇信息熵的N-gram子词过滤方法对于EF-fastText短文本分类模型性能提升是有效性的. In order to improve the performance of fastText short text classification model,the research which gets the higher quality category features,and reduces the interference of n-gram subwords with low category distinction contribution to model learning semantic features with high category distinction contribution,is carried out.A LDA category feature extraction method based on TF-IDF is proposed to improve the quality oncategory feature,a N-gram subwords filtering method based on lexical comentropy is proposed to filter the subwords with low category distinction contribution in the n-gram subwords list,and a short text classification model named by EF-fastText that focuses more on the learning of semantic features with high classification contribution is constructed.The experimental results show that the LDA extract category feature method based on TF-IDF and the N-gram subwords filtering method based on lexical comentropy are effective in improving the performance of EF-fastText short text classification model.
作者 李志明 孙艳 何宜昊 申利民 LI Zhi-ming;SUN Yan;HE Yi-hao;SHEN Li-min(College of Information Science and Engineering,Yanshan University,Qinhuangdao 066004,China;Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province,Qinhuangdao 066004,Chian;Key Laboratory for Software Engineering of Hebei Province,Qinhuangdao 066004,China;High-end Equipment Industry Technology Rresearch Institute of Hebei Province,Qinhuangdao 066004,China)
出处 《小型微型计算机系统》 CSCD 北大核心 2022年第8期1596-1601,共6页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61772450)资助 河北省重点研发计划项目(20375001D)资助 河北省高等学校科技计划重点项目(ZD2018219)资助.
关键词 短文本分类 fastText 类别特征 词汇信息熵 N-GRAM short text classification fastText category feature lexical comentropy N-gram
  • 相关文献

参考文献9

二级参考文献48

  • 1徐凤亚,罗振声.文本自动分类中特征权重算法的改进研究[J].计算机工程与应用,2005,41(1):181-184. 被引量:56
  • 2李彦平,张佳骥.文本聚类中的降维技术研究[J].无线电工程,2005,35(6):51-53. 被引量:8
  • 3胡燕,吴虎子,钟珞.中文文本分类中基于词性的特征提取方法研究[J].武汉理工大学学报,2007,29(4):132-135. 被引量:26
  • 4NLPIR 汉语分词系统[EB/OL]. [2013-07-10]. http://ictclas.nlpir.org/downloads.
  • 5Uysal A K,GunalS.The Impact ofPreprocessing on Text Classification [J].Information Processing & Management,2014,50(1):104-112.
  • 6Cooper W S.Getting Beyond Boole[J].Information Processing & Management,1988,24(3):243-248.
  • 7Fuhr N,Buekley C.A Probabilistic Learning Approach for Document Indexing[J].ACM Transactions on Information Systems,1991,9(3):223-248.
  • 8Salton G,Lesk M E.Computer Evaluation of Indexing and Text Processing[J].Journal of the ACM,1968,15(1):8-36.
  • 9Salton G,Buckley B.Term-weighting Approaches in Automatic TextRetrieval[J].Information Processing & Management,1998,24(5):513-523.
  • 10Peng T,Liu L,Zou W.PU Text Classification Enhanced by Term Frequency-inverse Document Frequency-improved Weighting [J].Concurrency and Computation:Practice and Experience,2014,26(3):728-741.

共引文献157

同被引文献55

引证文献4

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部