期刊文献+

基于语料库文本自动分类算法及应用比较研究 被引量:2

A Comparative Study of Corpus Based Automatic Text Classification Algorithms
下载PDF
导出
摘要 基于Python语言,利用公开中文语料库,测试不同算法模型对中文文本分类的效果。选择语料中不同数量的语料种类,首先对文本进行格式化读取、清洗等处理,而后以2∶1∶1的比例,分为训练集、验证集、测试集,最后依照文本表示、特征提取、分类算法选择、效果评估的步骤,依次在词袋、词嵌入、语言3种模型中选取典型代表进行中文文本分类。在深度学习模型的帮助下,文本分类得到了快速的发展,当前的主流分类方法基本都能满足不同任务的文本分类需求,特别是BERT语言模型可极大地提升文本分类的效果。 Based on Python,open Chinese corpus was used to test the effect of different algorithm models on Chinese text categorization.This paper selects different types of corpus,firstly formats,reads and cleans the text,and then divides it into training set,verification set and test set in the ratio of 2∶1∶1,and finally according to the steps of text representation,feature extraction,classification algorithm selection and effect evaluation,selects typical representatives from the three models of bag of words,word embedding and language Line Chinese text classification.With the help of deep learning model,text classification has developed rapidly.The current mainstream classification methods can basically meet the text classification requirements of different tasks,especially the BERT language model,which improves the effect of text classification to an unprecedented height.
作者 许和旭 王兰成 XU Hexu;WANG Lancheng
出处 《图书情报导刊》 2021年第6期45-53,共9页 Journal of Library and Information Science
基金 中国索引学会重点课题“基于人工智能的自动索引编制研究”(项目编号:CSI20A02)。
关键词 文本分类 TF-IDF Word2Vec BERT 深度学习 text classification TF-IDF Word2Vec BERT deep learning
  • 相关文献

参考文献5

二级参考文献26

  • 1孙茂松,邹嘉彦.汉语自动分词研究评述[J].当代语言学,2001,3(1):22-32. 被引量:101
  • 2张春霞,郝天永.汉语自动分词的研究现状与困难[J].系统仿真学报,2005,17(1):138-143. 被引量:60
  • 3苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:383
  • 4熊文新,宋柔.信息检索用户查询语句的停用词过滤[J].计算机工程,2007,33(6):195-197. 被引量:16
  • 5黄昌宁 等.对自动分词的反思[A]..语言计算与基于内容的文本处理[C].北京:清华大学出版社,2003,7.26-38.
  • 6Sebastiani F. Machine Learning in Automated Text Categorization.ACM Computing Surveys, 2002, 34(1): 1-47
  • 7YANG Yiming. An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval, 1999, 1(1-2): 69-903.周水庚.一个无须词典支持和切词处理的中文文档分类系统.计算机研究与发展,2001,38(7):839-844
  • 8Jain G,Ginwala A,Aslandogan Y A.An approach to text classification using dimensionality reduction and combination of classifiers,Information Reuse and Integration,2004.IRI 2004.In:Proceedings of the 2004 IEEE International Conference on,Nov.2004.564-569
  • 9Wang Baoyi,Zhang Shaomin.A Novel Text Classification Algorithm Based on Nave Bayes and KL-Divergence,Parallel and Distributed Computing,Applications and Technologies,2005.PDCAT 2005.In:Sixth International Conference on,Dec.2005.913-915
  • 10Yang Yiming,Liu Xin.A Re-Examination of Text Categorization Methods.In:22nd Annual International SIGIR.1999.42-49

共引文献280

同被引文献22

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部