摘要
为提高中文文本分类的效率和准确率,针对汉字象形字的特点和数据量剧增的大数据背景,建立基于深度学习的中文文本分类算法。首先根据汉字子字符(字形、偏旁、笔画等)象形字即形状自带含义的特点,建立基于子字符和上下文特征的双通道CBOW模型实现中文文本向量化;其次基于大数据的背景,针对传统的kNN算法分类速度慢的缺点,提出一种基于LSC聚类和多目标数据筛选的快速kNN分类算法;最后运用快速kNN算法对文本数据转化的特征词向量数据进行分类。实验结果表明,改进后的中文文本分类算法增加了算法的使用范围,能够更精确地处理中文文本数据,更快地处理大数据问题,在分类速率和效果上都有一定程度的提升。
By taking account of the characteristics of pictographic characters and the background of big data,a Chinese text classification algorithm based on deep learning is established to improve the efficiency and accuracy of text classification.According to the characteristics of the Chinese subcharacters(glyph,radical,stroke,etc.),that is,the pictographs′ shapes have their own meanings,a two-channel CBOW(continuous bag-of-words) model based on subcharacters and context is established for Chinese text vectorization. Due to the disadvantage of the slow classification speed of the traditional kNN(k-nearest neighbor)algorithm,a fast kNN classification algorithm based on LSC(landmark-based spectral clustering)and multiobjective data screening is proposed on the basis of the background of big data. The fast kNN algorithm is used to classify the feature vector data converted from the text data. The experimental results show that the improved Chinese text classification algorithm can enlarge its application range,process the Chinese text data more accurately and deal with big data problems more quickly. Its classification rate and effect have been improved to some extent.
作者
丁正生
马春洁
DING Zhengsheng;MA Chunjie(Xi’an University of Science and Technology,Xi’an 710600,China)
出处
《现代电子技术》
2022年第1期100-103,共4页
Modern Electronics Technique
基金
国家自然科学基金项目(71473194)。