As direct measure of learners' communicative language ability, performance assessment (typically writing and speaking assessment) claims construct validity and a strong power for predictive utility of test scores....As direct measure of learners' communicative language ability, performance assessment (typically writing and speaking assessment) claims construct validity and a strong power for predictive utility of test scores. However, it is also of common concern that the subjectivity of rating process and the potential unfairness for test takers who encounter different writing prompts and speaking tasks would constitute threats to reliability and validity of test scores, especially in those large-scale and high-stakes tests. Therefore, appropriate means for quality control of subjective scoring should be held essential in test administration and validation. Based upon raw scores from one administration of speaking test in PETS Band3 held in Hangzhou, the present study investigates and models possible sources of score variability within the framework of Many-Facet Rasch Model (MFRM). MFRM conceptualizes the possibility of a examinee being awarded a certain score as a function of several facets — examinee ability, rater severity, domain difficulty and step difficulty between the adjacent score categories and provides estimates of the extent to which the examinee's test score is influenced by those facets. Model construction and data analysis was carried out in FACETS Version 3.58, computer program for conducting MFRM analysis. The results demonstrate statistically significant differences within each facet. Despite the generally acceptable rater consistency across examinees and rating domains, fit statistics indicate some unexpected rating patterns in certain raters such as inconsistency and central tendency, to be avoided through future rater training. Fair scores for each examinee are also provided, minimizing the variability due to facets other than examinees' ability. MFRM manifests itself as effective in detecting whether each test method facet functions as intended in performance assessment and providing useful feedback for quality control of subjective scoring.展开更多
基金Educational Measurement Research Project sponsored by 2006 National Education Science Research Plan of National Education Examinations Authority~~
文摘As direct measure of learners' communicative language ability, performance assessment (typically writing and speaking assessment) claims construct validity and a strong power for predictive utility of test scores. However, it is also of common concern that the subjectivity of rating process and the potential unfairness for test takers who encounter different writing prompts and speaking tasks would constitute threats to reliability and validity of test scores, especially in those large-scale and high-stakes tests. Therefore, appropriate means for quality control of subjective scoring should be held essential in test administration and validation. Based upon raw scores from one administration of speaking test in PETS Band3 held in Hangzhou, the present study investigates and models possible sources of score variability within the framework of Many-Facet Rasch Model (MFRM). MFRM conceptualizes the possibility of a examinee being awarded a certain score as a function of several facets — examinee ability, rater severity, domain difficulty and step difficulty between the adjacent score categories and provides estimates of the extent to which the examinee's test score is influenced by those facets. Model construction and data analysis was carried out in FACETS Version 3.58, computer program for conducting MFRM analysis. The results demonstrate statistically significant differences within each facet. Despite the generally acceptable rater consistency across examinees and rating domains, fit statistics indicate some unexpected rating patterns in certain raters such as inconsistency and central tendency, to be avoided through future rater training. Fair scores for each examinee are also provided, minimizing the variability due to facets other than examinees' ability. MFRM manifests itself as effective in detecting whether each test method facet functions as intended in performance assessment and providing useful feedback for quality control of subjective scoring.