摘要
本文基于汉语二语文本可读性的特征集合,通过对比六种机器学习模型的效果,引入特征选择算法,实现了汉语水平考试(HSK)阅读文本可读性的自动评估。实验结果表明,支持向量机模型在HSK阅读文本可读性评估中的表现最好;基于汉字、词汇、句法和篇章的全特征模型的预测准确率达0.876;不同层面的特征预测能力存在差异,其中词汇层面表现最好;剔除冗余特征后,词汇和汉字两个层面的18个特征进入最优模型,句法和篇章特征未能进入该模型。本研究对HSK阅读文本的选择和改编及其他类型的文本可读性评估具有一定的参考意义。
This paper proposed a set of features for CSL text readability assessment and then compared the effectiveness of six machine learning models in addition to employing the algorithms of feature selection to assess the readability of the Hanyu Shuiping Kaoshi(HSK)reading texts.The experiments demonstrated that the prediction of the support vector machine was significantly higher than others.The accuracy based on the full-featured model including Chinese characters,lexical,syntactic,and discourse reached 0.876 and there existed gaps at different linguistic levels,among which the lexical-level features were the most reliable.The optimal model consisted of 18 features at the lexical level and character level after eliminating the redundant features,while syntactic and discourse features were not in the model.This study has implications for the selection and adaptation of HSK reading texts and the readability evaluation of other types of texts.
作者
杜月明
王亚敏
王蕾
DU Yueming;WANG Yamin;WANG Lei
出处
《语言文字应用》
CSSCI
北大核心
2022年第3期73-86,共14页
Applied Linguistics
基金
国家社会科学基金重大项目“面向全球孔子学院的中国概况教学创新研究及其数字课程建设”(18ZDA339)的资助。
关键词
文本可读性
HSK阅读文本
语言特征
机器学习
支持向量机
text readability
HSK reading text
linguistic features
machine learning
Support vector machine