摘要
以提升fastText短文本分类模型性能为目标,从获取高质量的类别特征、降低N-gram子词中低类别区分贡献度子词对模型学习高类别区分贡献度语义特征时产生的干扰角度展开研究,提出基于TF-IDF的LDA类别特征提取方法以提升类别特征质量,提出基于词汇信息熵的N-gram子词过滤方法过滤N-gram子词中低类别区分贡献度子词,并构建更专注于高类别区分贡献度语义特征学习的EF-fastText短文本分类模型.实验结果表明基于TF-IDF的LDA类别特征提取方法,以及基于词汇信息熵的N-gram子词过滤方法对于EF-fastText短文本分类模型性能提升是有效性的.
In order to improve the performance of fastText short text classification model,the research which gets the higher quality category features,and reduces the interference of n-gram subwords with low category distinction contribution to model learning semantic features with high category distinction contribution,is carried out.A LDA category feature extraction method based on TF-IDF is proposed to improve the quality oncategory feature,a N-gram subwords filtering method based on lexical comentropy is proposed to filter the subwords with low category distinction contribution in the n-gram subwords list,and a short text classification model named by EF-fastText that focuses more on the learning of semantic features with high classification contribution is constructed.The experimental results show that the LDA extract category feature method based on TF-IDF and the N-gram subwords filtering method based on lexical comentropy are effective in improving the performance of EF-fastText short text classification model.
作者
李志明
孙艳
何宜昊
申利民
LI Zhi-ming;SUN Yan;HE Yi-hao;SHEN Li-min(College of Information Science and Engineering,Yanshan University,Qinhuangdao 066004,China;Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province,Qinhuangdao 066004,Chian;Key Laboratory for Software Engineering of Hebei Province,Qinhuangdao 066004,China;High-end Equipment Industry Technology Rresearch Institute of Hebei Province,Qinhuangdao 066004,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2022年第8期1596-1601,共6页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61772450)资助
河北省重点研发计划项目(20375001D)资助
河北省高等学校科技计划重点项目(ZD2018219)资助.