语言运用测试中的分数差异研究——基于多层面Rasch模型的方法(英文) 被引量：4

STUDY OF SOURCES OF SCORE VARIABILITY IN PERFORMANCE ASSESSMENT USING MFRM:A CASE OF SPEAKING TEST IN PETS BAND3

导出

摘要在语言运用测试中,分数的差异不仅仅体现了考生能力的差异还会受到其他多方面因素的影响。为确保考试的信效度,识别并估算不同的差异来源可以帮助研究者更好地认识测量误差的来源及大小。本文基于某考点一次PETS三级口试成绩,用多层面Rasch模型对分数差异进行了研究。结果发现:考官的严厉度、评分方式、评分标准和量表等方面都可能产生一定的测量误差,从而导致考生成绩的差异。多层面Rasch模型作为经典Rasch模型的延伸,能够综合分析考试成绩中多方面的误差,是进行考试质量控制的有效工具。 As direct measure of learners' communicative language ability, performance assessment (typically writing and speaking assessment) claims construct validity and a strong power for predictive utility of test scores. However, it is also of common concern that the subjectivity of rating process and the potential unfairness for test takers who encounter different writing prompts and speaking tasks would constitute threats to reliability and validity of test scores, especially in those large-scale and high-stakes tests. Therefore, appropriate means for quality control of subjective scoring should be held essential in test administration and validation. Based upon raw scores from one administration of speaking test in PETS Band3 held in Hangzhou, the present study investigates and models possible sources of score variability within the framework of Many-Facet Rasch Model (MFRM). MFRM conceptualizes the possibility of a examinee being awarded a certain score as a function of several facets — examinee ability, rater severity, domain difficulty and step difficulty between the adjacent score categories and provides estimates of the extent to which the examinee's test score is influenced by those facets. Model construction and data analysis was carried out in FACETS Version 3.58, computer program for conducting MFRM analysis. The results demonstrate statistically significant differences within each facet. Despite the generally acceptable rater consistency across examinees and rating domains, fit statistics indicate some unexpected rating patterns in certain raters such as inconsistency and central tendency, to be avoided through future rater training. Fair scores for each examinee are also provided, minimizing the variability due to facets other than examinees' ability. MFRM manifests itself as effective in detecting whether each test method facet functions as intended in performance assessment and providing useful feedback for quality control of subjective scoring.

作者张洁何莲珍

机构地区广东外语外贸大学外国语言学与应用语言学研究基地/浙江大学外国语言文化与国际交流学院浙江大学外国语言文化与国际交流学院

出处《Chinese Journal of Applied Linguistics》 2008年第4期40-49,128,共11页 中国应用语言学（英文）

基金 Educational Measurement Research Project sponsored by 2006 National Education Science Research Plan of National Education Examinations Authority~~

关键词 PETS口语考试评分质量控制多层面RASCH模型 PETS speaking test quality control many-facet Rasch model(MFRM)

分类号 H319 [语言文字—英语]

引文网络
相关文献

参考文献1

1王华.《二语口语测试》评介[J].现代外语,2005,28(2):210-213. 被引量：13

二级参考文献11

1Woods, A., P. Fletcher & A. Hughes. 2000. Statistics in Language Studies [M]. Beijing. Foreign Language Teaching and Research Press.
2Bachman, L. F. 1990. Fundamental Considerations in Language Testing [M]. Oxford: Oxford University Press.
3Bachman, L. F. & A. S. Palmer. 1996. Language Testing in Practice: Designing and Developing Useful Language Tests E M1. Oxford: Oxford University Press.
4Baker, D. 1989. Language Testing: A Critical Survey and Practical Guide [M]. London : Edward Arnold.
5Carroll, B. J. & P. J. Hall. 1985. Make Your Own Language Tests:A Practical Guide to Writing Language Performance Tests [M].Oxford: Oxford University Press.
6Fulcher, G. 1997. The testing of speaking in a second language [A]. In C. Clapham & D. Corson (eds.), Encyclopedia of Language and Education. Vol. 7: Language Testing and Assessment [C].Amsterdam: Kluwer Academic Publishers. 75-85.
7Fulcher, G. & R. M. Reiter. 2003. Task difficulty in speaking tests [J]. Language Testing 20. 321-344.
8Iwashita, N., T. McNamara & C. Elder. 2001. Can we predict task difficulty in an oral proficiency test? Exploring the potential of an information-processing approach to task design [J]. Language Learning 51:401-436.
9McNamara, T. E. 1995. Modeling performance: Opening Pandora's box [J].Applied Linguistics 16:159-179.
10McNamara, T.E.1996. Measuring Second Language Performance [M].London:Longman.

共引文献12

1高见.论影响口语测试的相关重要因素[J].科技信息,2007(22):215-215. 被引量：2
2李清华,孔文.二/外语写作测试评分研究综述[J].外语测试与教学,2011(4):18-26. 被引量：9
3李燕.计算机化口语考试的构念效度问题研究——定性分析[J].考试周刊,2008,0(25):6-8. 被引量：1
4孙玮钧,刘芹.中国大学生英语口语能力评分表构建研究[J].外语测试与教学,2012(2):9-20. 被引量：5
5张洁.PETS三级口语考试评分误差研究——结合定量统计和定性描述的方法[J].外语测试与教学,2012(2):33-42. 被引量：8
6林敦来.《实践语言测试》评介[J].外语测试与教学,2012(2):58-63. 被引量：2
7周时娥,刘鸿鹄.多媒体环境下的大学英语口语测试模式研究[J].大学教育,2012(3):15-16. 被引量：3
8徐鹰.大学英语写作能力构念的操作定义研究[J].大学英语教学与研究,2012,51(6):70-75.
9刘力,麦陈淑贤,金檀.第二语言口语评估研究与实践纵览——《口语测评》(2011)评介[J].外语测试与教学,2013(2):60-64. 被引量：1
10陈之权,孙晓曦.新加坡小学一年级华文口语诊断评量表的开发[J].对外汉语研究,2013(2):1-12. 被引量：1

同被引文献33

1陈茂建.浅析网上阅卷[J].福建教育学院学报,2002(7):17-19. 被引量：8
2马世晔.网上阅卷的回顾与思考[J].中国考试,2004(7):24-26. 被引量：19
3刘建达.话语填充测试方法的多层面Rasch模型分析[J].现代外语,2005,28(2):157-169. 被引量：46
4王跃武,朱正才,杨惠中.作文网上评分信度的多面Rasch测量分析[J].外语界,2006(1):69-76. 被引量：28
5Alderson, J. C. , Clapham, C. and Wall D. Language test construction and evaluation [ M ]. Cambridge/北京 : Cambridge University Press/北京外语教育与研究出版社,1995.
6J. C. , Clapham, C. and Wall D. Language test construction and evaluation [ M ]. Cambridge/北京 : Cambridge University Press/北京外语教育与研究出版社,2000.
7Charney, D. The validity of using holistic scoring to evaluate writing: a critical overview [ J ]. Research in the Teaching of English, 1984, (18) :65 -81.
8Constable, E. and Andrich, A. Inter- judge reliability: is complete agreement among judges the ideal[ P]. Paper presented at the annual meeting of the National Council of Measurement in Education, New Orleans, LA. ,1984.
9Elder, C. , Knoch, U. , Barkhuizen, G. , & Randow, J. v. Individual feedback to enhance rater training: Does it work[ J]. Language Assessment Quarterly,2005,2(3) :175 - 196.
10Elder, C. , Barkhuizen, G. , Knoeh, U. , & Randow, J.v. Evaluating rater responses to an online training program for L2 writing assessment[ J]. Language Testing,2007,24 ( 1 ) :37 -64.