Versant英语口语测试(The Versant English Test)是一种测试非英语母语成年英语学习者口语能力的计算机辅助口语考试,也是世界上第一个使用语音识别和处理技术的全自动口语考试。本文首先从考试界面和流程、试题和题型、成绩报告三个方...Versant英语口语测试(The Versant English Test)是一种测试非英语母语成年英语学习者口语能力的计算机辅助口语考试,也是世界上第一个使用语音识别和处理技术的全自动口语考试。本文首先从考试界面和流程、试题和题型、成绩报告三个方面分析了Versant英语口语测试的特点,然后对Versant英语口语测试自动评分技术做出阐述,最后介绍了此考试系统在研发过程中所做的效度验证及其他相关研究。希望能够为努力提升自己英语口语水平的广大英语学习者带来一定帮助,并对国内的计算机辅助英语口语测试的研发有所启示。展开更多
近年来,随着深度学习和语音识别技术的飞速发展,基于深度学习语音识别的计算机辅助外语口语学习成为当前人工智能技术应用研究的一个热点。本文结合当前最先进的智能语音信息处理理论,在阐述英语口语自动评测的基本原理和算法的基础上,...近年来,随着深度学习和语音识别技术的飞速发展,基于深度学习语音识别的计算机辅助外语口语学习成为当前人工智能技术应用研究的一个热点。本文结合当前最先进的智能语音信息处理理论,在阐述英语口语自动评测的基本原理和算法的基础上,针对中考、高考口语考试考生音频的特点,提出了两种基于深度神经网络声学模型的更具噪音鲁棒性的评分算法。依据在初中和高中英语口语大规模统一考试的真实场景数据进行的验证实验,本文提出的自动评测方法比传统基于GOP(Goodness of Pronunciation)的方法具有较大的性能优势。本研究开发的部分技术已实际应用于全国多地的中考、高中期末考试及高考模拟考试的口语自动阅卷系统中,取得了良好的社会效益。展开更多
Low rating reliability has long been the primary concern in school-based oral English achievement tests. In this study, a computer-aided rating system (CARS) was developed to improve inter- and intra-rater reliabili...Low rating reliability has long been the primary concern in school-based oral English achievement tests. In this study, a computer-aided rating system (CARS) was developed to improve inter- and intra-rater reliability through the instantiation of rating criteria, task division and random distribution, on-line training, reliability verification and sound wave "reading" and "writing." A rating experiment was conducted among six raters to compare intra- and inter-rater reliability between traditional rating and rating with CARS. At the end of each round of rating, a conference was held. Both quantitative an~ qualitative analyses show that CARS can significantly improve inter- and intra-rater reliability, mainly through helping raters use criteria more accurately and focus more attention on rating. In addition, the research has also shed light upon further study on improving rating reliability.展开更多
As direct measure of learners' communicative language ability, performance assessment (typically writing and speaking assessment) claims construct validity and a strong power for predictive utility of test scores....As direct measure of learners' communicative language ability, performance assessment (typically writing and speaking assessment) claims construct validity and a strong power for predictive utility of test scores. However, it is also of common concern that the subjectivity of rating process and the potential unfairness for test takers who encounter different writing prompts and speaking tasks would constitute threats to reliability and validity of test scores, especially in those large-scale and high-stakes tests. Therefore, appropriate means for quality control of subjective scoring should be held essential in test administration and validation. Based upon raw scores from one administration of speaking test in PETS Band3 held in Hangzhou, the present study investigates and models possible sources of score variability within the framework of Many-Facet Rasch Model (MFRM). MFRM conceptualizes the possibility of a examinee being awarded a certain score as a function of several facets — examinee ability, rater severity, domain difficulty and step difficulty between the adjacent score categories and provides estimates of the extent to which the examinee's test score is influenced by those facets. Model construction and data analysis was carried out in FACETS Version 3.58, computer program for conducting MFRM analysis. The results demonstrate statistically significant differences within each facet. Despite the generally acceptable rater consistency across examinees and rating domains, fit statistics indicate some unexpected rating patterns in certain raters such as inconsistency and central tendency, to be avoided through future rater training. Fair scores for each examinee are also provided, minimizing the variability due to facets other than examinees' ability. MFRM manifests itself as effective in detecting whether each test method facet functions as intended in performance assessment and providing useful feedback for quality control of subjective scoring.展开更多
基金教育部考试中心-英国文化教育协会2020年英语测评科研课题“An Empirical Study into the Validity ofa CSE-based Placement test in China’s Transnational Higher Education”(考协[2020]263号)。
文摘Versant英语口语测试(The Versant English Test)是一种测试非英语母语成年英语学习者口语能力的计算机辅助口语考试,也是世界上第一个使用语音识别和处理技术的全自动口语考试。本文首先从考试界面和流程、试题和题型、成绩报告三个方面分析了Versant英语口语测试的特点,然后对Versant英语口语测试自动评分技术做出阐述,最后介绍了此考试系统在研发过程中所做的效度验证及其他相关研究。希望能够为努力提升自己英语口语水平的广大英语学习者带来一定帮助,并对国内的计算机辅助英语口语测试的研发有所启示。
文摘近年来,随着深度学习和语音识别技术的飞速发展,基于深度学习语音识别的计算机辅助外语口语学习成为当前人工智能技术应用研究的一个热点。本文结合当前最先进的智能语音信息处理理论,在阐述英语口语自动评测的基本原理和算法的基础上,针对中考、高考口语考试考生音频的特点,提出了两种基于深度神经网络声学模型的更具噪音鲁棒性的评分算法。依据在初中和高中英语口语大规模统一考试的真实场景数据进行的验证实验,本文提出的自动评测方法比传统基于GOP(Goodness of Pronunciation)的方法具有较大的性能优势。本研究开发的部分技术已实际应用于全国多地的中考、高中期末考试及高考模拟考试的口语自动阅卷系统中,取得了良好的社会效益。
文摘Low rating reliability has long been the primary concern in school-based oral English achievement tests. In this study, a computer-aided rating system (CARS) was developed to improve inter- and intra-rater reliability through the instantiation of rating criteria, task division and random distribution, on-line training, reliability verification and sound wave "reading" and "writing." A rating experiment was conducted among six raters to compare intra- and inter-rater reliability between traditional rating and rating with CARS. At the end of each round of rating, a conference was held. Both quantitative an~ qualitative analyses show that CARS can significantly improve inter- and intra-rater reliability, mainly through helping raters use criteria more accurately and focus more attention on rating. In addition, the research has also shed light upon further study on improving rating reliability.
基金Educational Measurement Research Project sponsored by 2006 National Education Science Research Plan of National Education Examinations Authority~~
文摘As direct measure of learners' communicative language ability, performance assessment (typically writing and speaking assessment) claims construct validity and a strong power for predictive utility of test scores. However, it is also of common concern that the subjectivity of rating process and the potential unfairness for test takers who encounter different writing prompts and speaking tasks would constitute threats to reliability and validity of test scores, especially in those large-scale and high-stakes tests. Therefore, appropriate means for quality control of subjective scoring should be held essential in test administration and validation. Based upon raw scores from one administration of speaking test in PETS Band3 held in Hangzhou, the present study investigates and models possible sources of score variability within the framework of Many-Facet Rasch Model (MFRM). MFRM conceptualizes the possibility of a examinee being awarded a certain score as a function of several facets — examinee ability, rater severity, domain difficulty and step difficulty between the adjacent score categories and provides estimates of the extent to which the examinee's test score is influenced by those facets. Model construction and data analysis was carried out in FACETS Version 3.58, computer program for conducting MFRM analysis. The results demonstrate statistically significant differences within each facet. Despite the generally acceptable rater consistency across examinees and rating domains, fit statistics indicate some unexpected rating patterns in certain raters such as inconsistency and central tendency, to be avoided through future rater training. Fair scores for each examinee are also provided, minimizing the variability due to facets other than examinees' ability. MFRM manifests itself as effective in detecting whether each test method facet functions as intended in performance assessment and providing useful feedback for quality control of subjective scoring.