期刊文献+

基于非参控制图的文本分类方法

Text classification method based on non-parameter control graph
下载PDF
导出
摘要 传统的文本分类方法多是基于词语本身.由于中文词语的复杂庞大,并且不断增加,直接用词语作为研究对象,容易造成文本特征向量的高度稀疏性和维数灾难,使得处理起来效率不高,难以计算.但是其词性的种类是固定不变的.随着中文分词和词性标注的研究越来越完善,使用词性作为文本分析的研究对象,吸引了越来越多公众和研究人员的兴趣.文章提出基于非参控制图的文本分类方法,仅使用句子中的各词性出现次数作为文本分类研究的特征值.利用中文自然语言处理平台,对句子进行词语切分以及词性标注,并保留出现次数最多的5个词性,计算每个词性在句子中的占比,进行等距对数比变换.最后把处理好的数据放入基于多元符号指数加权移动平均(MSEWMA)控制图模型中判别类型.该方法思想原理简单、容易实现、训练后处理速度快,通过实验证实可以很好地区分文本类别. Traditional text classification methods are mostly based on words themselves.Due to the large and increasing complexity of Chinese words.Direct use of Chinese words as the research object is likely to cause highly sparse and dimensional disasters of text feature vectors.So the processing inefficiency is difficult to calculate.But,the number of parts of speech does not change.With the progress of Chinese word segmentation and part of speech tagging,use of part of speech as an object of study in text analysis has attracted growing interest from the general public and researchers.The purpose of this paper is to propose a text classification method based on non-parameter control graph,which only the frequency of occurrence of each part of speech in sentences is used as the eigenvalue to study text classification.By using the Chinese natural language processing platform,the sentences are segmented into words and marked with part of speech,and the 5 parts of speech that appear most often are retained.Calculate the proportion of each part of speech in a sentence and transforms the data by isometric log-ratio transformation.Finally,the processed data are put into a multivariate synthetic exponentially weighted moving average(MSEWMA)control chart to discriminate the types.The principle of this method is simple,easy to implement,and the processing speed after training is fast.Experimental results show that this method is able to distinguish text categories.
作者 熊健 鲍玉 徐芃 XIONG Jian;BAO Yu;XU Peng(School of Economics and Statistics,Guangzhou University,Guangzhou 510006,China)
出处 《广州大学学报(自然科学版)》 CAS 2020年第6期41-50,共10页 Journal of Guangzhou University:Natural Science Edition
基金 国家社科一般资助项目(18BYY082) 教育部人文社科一般资助项目(17YJAZH098)。
关键词 文本分类 词性标注 等距对数比变换 MSEWMA控制图 text classification part of speech tagging isometric log-ratio transformation MSEWMA control chart
  • 相关文献

参考文献3

二级参考文献19

  • 1章成志.基于多层特征的字符串相似度计算模型[J].情报学报,2005,24(6):696-701. 被引量:40
  • 2[1]Brook. D and Evans. D. A. (1972), An Approach to the Probability Distribution of CUSUM Run Length. Biometrika, VOL. 59, Issue 3, 539-549.
  • 3[2]Feller. W. (1971), An Introduction to Probability Theory and Its Applications.VOL. Ⅱ, 2nd ed, John Wiley & Sons, New York, NY.
  • 4[3]Liu. R. Y. (1988), On a Nontion of Simplicial Depth. in Proceedings of the National Academy of Sciences, 85, 1732-1734.
  • 5[4]Liu. R. Y. and Singh. K. (1993), A Quality Index Based On Data Depth and Multivariate Rank Tests. Journal of the American Statistical Association 88, 252-260.
  • 6[5]Lowry, C. A,; Woodall, W. H.; Champ, C. W. and Rigdon, S. E. (1992), A Multivariate Exponentially Weighted Moving Average Control Chart. Technometrics,VOL. 34, No. 1, 46-53.
  • 7[6]Stoumbos. Z. G and Sullivan. J. H. (2002), Roubustness to Non-Normality of the Multivariate EWMA Control Chart. Journal of Quality Technology, VOL. 34, NO.3, July 2002, 260-275.
  • 8梁南元.书面汉语自动分词综述[J]计算机应用与软件,1987(03).
  • 9刘源,梁南元.汉语处理的基础工程——现代汉语词频统计[J]中文信息学报,1986(01).
  • 10关英春,秦蓓.汉语文字自动统计系统CWSS[J]中文信息学报,1986(01).

共引文献60

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部