摘要
针对高考答题系统中现代文阅读理解的体裁自动分类问题,采用统计分析的方法分析文学作品和科技文在词性分布、标点符号和词汇使用上的差异,提出基于词性、符号和词汇特征的体裁分类特征抽取方法和基于类内和类间离差的特征评价方法,在此基础上使用SVM算法训练分类器。实验结果表明,特征抽取、特征选择、分类方法可行有效,分类器在高考测试集上平均准确率达到96%,能较好地解决高考语文现代文阅读材料体裁自动分类问题。
Aiming at the problem of genre auto classification in GaoKao reading comprehension,the differences between literary works and sci-tech articles in the usage of character,lexical and Part-of-Speech(POS)features were analyzed using statistical methods,based on which the feature abstraction methods for text genre classification were proposed.The feature selection method based on between-class deviation and within-class deviation was proposed.With these methods,the text genre classification features were constructed,and the classifier was trained using support vector machine method based on these features.Experimental results show the methods of features abstraction,feature selection and classification are feasible and effective.The average precision of the classifier reaches 96%above on GaoKao testing data-sets,which can efficiently solve the problem of reading texts genre auto classification in GaoKao reading comprehension.
作者
苏雪峰
李茹
张虎
SU Xue-feng1 , LI Ru2,3, ZHANG Hu2(1. Department of Electronic Business, Business College of Shanxi University, Taiyuan 030031, China; 2. School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China; 3. Key Laboratory of Ministry of Education for Computation Intelligence and Chinese Information Processing, Shanxi University, Taiyuan 030006, Chin)
出处
《计算机工程与设计》
北大核心
2018年第6期1755-1760,1794,共7页
Computer Engineering and Design
基金
国家863高技术研究发展计划基金项目(2015AA015407)
山西省自然科学基金项目(201601D102030)
关键词
体裁分类
词性特征
符号特征
词汇特征
支持向量机
genre classification
part-of-speech features
character features
lexical features
support vector machine