摘要
主要研究不同的分词模式对文本分类结果的影响,采用两种传统的文本表示方法:LDA和LSA,采用两种分类方法:支持向量机和逻辑回归,一共四组不同的实验来比较分析.实验结果表明相对于传统的分词方法来说,第二种搜索引擎式的分词方法通过拆分、添加组合词对分类结果更有效.具体来说,对两种分词采用LDA得到文本表示后,模式二的分类准确率最高95.38%,模式一为93.7%.在对两种分词采用LSA得到文本表示后,模式二的分类准确率最高为96.44%,模式一最高为95.2%.
In this paper, we mainly study the difference between the different word segmentation in text classification, we use two kinds of traditional text representation methods: LDA and LSA,and using two kinds of classification methods: support vector machine and logistic regression, four different experiments for each word segmentation. The experimental results show that compared with the traditional word segmentation methods, the second search engine word segmentation methods are more effective in the classification results by splitting and adding the combination words. Specifically, two kinds of word segmentation using LDA to represent text, the second word segmentation get highest 95.38%, and the first is 93.7%. After the two kinds of word segmentation using LSA to obtain text representation, the classification accuracy of pattern two is 96.44%, and the pattern one is 95.2%.
出处
《数学的实践与认识》
北大核心
2018年第1期116-123,共8页
Mathematics in Practice and Theory
基金
中国科学院随机复杂结构与数据科学重点实验室开放基金资助
国家自然科学基金重大研究计划培育项目“管理决策大数据分析方法与关键技术”(91546102)