期刊文献+

语言运用测试中的分数差异研究——基于多层面Rasch模型的方法(英文) 被引量:4

STUDY OF SOURCES OF SCORE VARIABILITY IN PERFORMANCE ASSESSMENT USING MFRM:A CASE OF SPEAKING TEST IN PETS BAND3
原文传递
导出
摘要 在语言运用测试中,分数的差异不仅仅体现了考生能力的差异还会受到其他多方面因素的影响。为确保考试的信效度,识别并估算不同的差异来源可以帮助研究者更好地认识测量误差的来源及大小。本文基于某考点一次PETS三级口试成绩,用多层面Rasch模型对分数差异进行了研究。结果发现:考官的严厉度、评分方式、评分标准和量表等方面都可能产生一定的测量误差,从而导致考生成绩的差异。多层面Rasch模型作为经典Rasch模型的延伸,能够综合分析考试成绩中多方面的误差,是进行考试质量控制的有效工具。 As direct measure of learners' communicative language ability, performance assessment (typically writing and speaking assessment) claims construct validity and a strong power for predictive utility of test scores. However, it is also of common concern that the subjectivity of rating process and the potential unfairness for test takers who encounter different writing prompts and speaking tasks would constitute threats to reliability and validity of test scores, especially in those large-scale and high-stakes tests. Therefore, appropriate means for quality control of subjective scoring should be held essential in test administration and validation. Based upon raw scores from one administration of speaking test in PETS Band3 held in Hangzhou, the present study investigates and models possible sources of score variability within the framework of Many-Facet Rasch Model (MFRM). MFRM conceptualizes the possibility of a examinee being awarded a certain score as a function of several facets — examinee ability, rater severity, domain difficulty and step difficulty between the adjacent score categories and provides estimates of the extent to which the examinee's test score is influenced by those facets. Model construction and data analysis was carried out in FACETS Version 3.58, computer program for conducting MFRM analysis. The results demonstrate statistically significant differences within each facet. Despite the generally acceptable rater consistency across examinees and rating domains, fit statistics indicate some unexpected rating patterns in certain raters such as inconsistency and central tendency, to be avoided through future rater training. Fair scores for each examinee are also provided, minimizing the variability due to facets other than examinees' ability. MFRM manifests itself as effective in detecting whether each test method facet functions as intended in performance assessment and providing useful feedback for quality control of subjective scoring.
作者 张洁 何莲珍
出处 《Chinese Journal of Applied Linguistics》 2008年第4期40-49,128,共11页 中国应用语言学(英文)
基金 Educational Measurement Research Project sponsored by 2006 National Education Science Research Plan of National Education Examinations Authority~~
关键词 PETS口语考试 评分质量控制 多层面RASCH模型 PETS speaking test quality control many-facet Rasch model(MFRM)
  • 相关文献

参考文献1

二级参考文献11

  • 1Woods, A., P. Fletcher & A. Hughes. 2000. Statistics in Language Studies [M]. Beijing. Foreign Language Teaching and Research Press.
  • 2Bachman, L. F. 1990. Fundamental Considerations in Language Testing [M]. Oxford: Oxford University Press.
  • 3Bachman, L. F. & A. S. Palmer. 1996. Language Testing in Practice: Designing and Developing Useful Language Tests E M1. Oxford: Oxford University Press.
  • 4Baker, D. 1989. Language Testing: A Critical Survey and Practical Guide [M]. London : Edward Arnold.
  • 5Carroll, B. J. & P. J. Hall. 1985. Make Your Own Language Tests:A Practical Guide to Writing Language Performance Tests [M].Oxford: Oxford University Press.
  • 6Fulcher, G. 1997. The testing of speaking in a second language [A]. In C. Clapham & D. Corson (eds.), Encyclopedia of Language and Education. Vol. 7: Language Testing and Assessment [C].Amsterdam: Kluwer Academic Publishers. 75-85.
  • 7Fulcher, G. & R. M. Reiter. 2003. Task difficulty in speaking tests [J]. Language Testing 20. 321-344.
  • 8Iwashita, N., T. McNamara & C. Elder. 2001. Can we predict task difficulty in an oral proficiency test? Exploring the potential of an information-processing approach to task design [J]. Language Learning 51:401-436.
  • 9McNamara, T. E. 1995. Modeling performance: Opening Pandora's box [J].Applied Linguistics 16:159-179.
  • 10McNamara, T.E.1996. Measuring Second Language Performance [M].London:Longman.

共引文献12

同被引文献33

  • 1陈茂建.浅析网上阅卷[J].福建教育学院学报,2002(7):17-19. 被引量:8
  • 2马世晔.网上阅卷的回顾与思考[J].中国考试,2004(7):24-26. 被引量:19
  • 3刘建达.话语填充测试方法的多层面Rasch模型分析[J].现代外语,2005,28(2):157-169. 被引量:46
  • 4王跃武,朱正才,杨惠中.作文网上评分信度的多面Rasch测量分析[J].外语界,2006(1):69-76. 被引量:28
  • 5Alderson, J. C. , Clapham, C. and Wall D. Language test construction and evaluation [ M ]. Cambridge/北京 : Cambridge University Press/北京外语教育与研究出版社,1995.
  • 6J. C. , Clapham, C. and Wall D. Language test construction and evaluation [ M ]. Cambridge/北京 : Cambridge University Press/北京外语教育与研究出版社,2000.
  • 7Charney, D. The validity of using holistic scoring to evaluate writing: a critical overview [ J ]. Research in the Teaching of English, 1984, (18) :65 -81.
  • 8Constable, E. and Andrich, A. Inter- judge reliability: is complete agreement among judges the ideal[ P]. Paper presented at the annual meeting of the National Council of Measurement in Education, New Orleans, LA. ,1984.
  • 9Elder, C. , Knoch, U. , Barkhuizen, G. , & Randow, J. v. Individual feedback to enhance rater training: Does it work[ J]. Language Assessment Quarterly,2005,2(3) :175 - 196.
  • 10Elder, C. , Barkhuizen, G. , Knoeh, U. , & Randow, J.v. Evaluating rater responses to an online training program for L2 writing assessment[ J]. Language Testing,2007,24 ( 1 ) :37 -64.

引证文献4

二级引证文献23

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部