摘要
本研究根据中文特性发展可读性指标,接着建立中文文本可读性数学模型,并进行模型效度验证。本研究以所发展24个可读性指标为预测变项,386篇教科书文章之年级值为效标变项,建立逐步回归(stepwise regression)与SVM可读性数学模型,再以96篇新文章为测试资料进行模型验证。研究结果显示:在逐步回归模型中,难词数、单句数比率、实词频对数平均与人称代名词数为重要的预测变项;以SVM模型F-score方法所得的重要预测变项则为难词数、二字词数、字数与中笔画字元数等。逐步回归模型与SVM模型对新文章的预测正确性分别为55.21%及72.92%,两种模型预测低年级文章之正确性均高於高年级文章。
This study aims to (a) develop readability indicators based on the textual factors that influence reading comprehension; (b) construct the readability model for Chinese text; and (c) validate the proposed readability models. This study constructs readability models employing step regression and SVM, using 24 readability indicators as its predictive variable and the grade level of 386 textbook articles as the criteria. The proposed models are then validated according to an additional 96 texts. The results show that in step regression, the critical predictors are the number of complex words, proportion of simple sentences, average logarithm of content word frequency, and number of personal pronouns. In the SVM model, the critical predictors selected by using the F-score include the number of complex words, number of two-character words, number of characters, and number of intermediate-stroke characters. The accuracy rates of step regression and SVM are 55.21% and 72.92%, respectively. Both models predict the texts more accurately at the lower grade levels than at the higher grade levels.
关键词
可讀性
正確性
逐步迴歸
SVM數學模型
accuracy
readability
stepwise regression
support vector machine