This study investigates how raters make their scoring decisions when assessing tape-mediated speaking test performance. 24 Chinese EFL teachers were trained before scoring analytically five sample tapes selected from ...This study investigates how raters make their scoring decisions when assessing tape-mediated speaking test performance. 24 Chinese EFL teachers were trained before scoring analytically five sample tapes selected from TEM4-Oral, a national EFL speaking test designed for college English major sophomores in China. The raters' verbal reports concerning what they were thinking about while making their scoring decisions were audio-recorded and collected during and immediately after each assessment. Post-scoring interviews were used as supplements to the probe of the scoring process. A qualitative analysis of the data showed that the raters tended to give weight to the content, to punish both grammar and pronunciation errors and to reward the use of impressive and uncommon words. Moreover, the whole decision-making process was proved to be cyclic in nature. A flow chart describing the cyclic process of hypothesis forming and testing was then proposed and discussed.展开更多
As direct measure of learners' communicative language ability, performance assessment (typically writing and speaking assessment) claims construct validity and a strong power for predictive utility of test scores....As direct measure of learners' communicative language ability, performance assessment (typically writing and speaking assessment) claims construct validity and a strong power for predictive utility of test scores. However, it is also of common concern that the subjectivity of rating process and the potential unfairness for test takers who encounter different writing prompts and speaking tasks would constitute threats to reliability and validity of test scores, especially in those large-scale and high-stakes tests. Therefore, appropriate means for quality control of subjective scoring should be held essential in test administration and validation. Based upon raw scores from one administration of speaking test in PETS Band3 held in Hangzhou, the present study investigates and models possible sources of score variability within the framework of Many-Facet Rasch Model (MFRM). MFRM conceptualizes the possibility of a examinee being awarded a certain score as a function of several facets — examinee ability, rater severity, domain difficulty and step difficulty between the adjacent score categories and provides estimates of the extent to which the examinee's test score is influenced by those facets. Model construction and data analysis was carried out in FACETS Version 3.58, computer program for conducting MFRM analysis. The results demonstrate statistically significant differences within each facet. Despite the generally acceptable rater consistency across examinees and rating domains, fit statistics indicate some unexpected rating patterns in certain raters such as inconsistency and central tendency, to be avoided through future rater training. Fair scores for each examinee are also provided, minimizing the variability due to facets other than examinees' ability. MFRM manifests itself as effective in detecting whether each test method facet functions as intended in performance assessment and providing useful feedback for quality control of subjective scoring.展开更多
文摘This study investigates how raters make their scoring decisions when assessing tape-mediated speaking test performance. 24 Chinese EFL teachers were trained before scoring analytically five sample tapes selected from TEM4-Oral, a national EFL speaking test designed for college English major sophomores in China. The raters' verbal reports concerning what they were thinking about while making their scoring decisions were audio-recorded and collected during and immediately after each assessment. Post-scoring interviews were used as supplements to the probe of the scoring process. A qualitative analysis of the data showed that the raters tended to give weight to the content, to punish both grammar and pronunciation errors and to reward the use of impressive and uncommon words. Moreover, the whole decision-making process was proved to be cyclic in nature. A flow chart describing the cyclic process of hypothesis forming and testing was then proposed and discussed.
基金Educational Measurement Research Project sponsored by 2006 National Education Science Research Plan of National Education Examinations Authority~~
文摘As direct measure of learners' communicative language ability, performance assessment (typically writing and speaking assessment) claims construct validity and a strong power for predictive utility of test scores. However, it is also of common concern that the subjectivity of rating process and the potential unfairness for test takers who encounter different writing prompts and speaking tasks would constitute threats to reliability and validity of test scores, especially in those large-scale and high-stakes tests. Therefore, appropriate means for quality control of subjective scoring should be held essential in test administration and validation. Based upon raw scores from one administration of speaking test in PETS Band3 held in Hangzhou, the present study investigates and models possible sources of score variability within the framework of Many-Facet Rasch Model (MFRM). MFRM conceptualizes the possibility of a examinee being awarded a certain score as a function of several facets — examinee ability, rater severity, domain difficulty and step difficulty between the adjacent score categories and provides estimates of the extent to which the examinee's test score is influenced by those facets. Model construction and data analysis was carried out in FACETS Version 3.58, computer program for conducting MFRM analysis. The results demonstrate statistically significant differences within each facet. Despite the generally acceptable rater consistency across examinees and rating domains, fit statistics indicate some unexpected rating patterns in certain raters such as inconsistency and central tendency, to be avoided through future rater training. Fair scores for each examinee are also provided, minimizing the variability due to facets other than examinees' ability. MFRM manifests itself as effective in detecting whether each test method facet functions as intended in performance assessment and providing useful feedback for quality control of subjective scoring.