In medical image segmentation,it is often necessary to collect opinions from multiple experts to make the final decision.This clinical routine helps to mitigate individual bias.However,when data is annotated by multip...In medical image segmentation,it is often necessary to collect opinions from multiple experts to make the final decision.This clinical routine helps to mitigate individual bias.However,when data is annotated by multiple experts,standard deep learning models are often not applicable.In this paper,we propose a novel neural network framework called Multi-rater Prism(MrPrism)to learn medical image segmentation from multiple labels.Inspired by iterative half-quadratic optimization,MrPrism combines the task of assigning multi-rater confidences and calibrated segmentation in a recurrent manner.During this process,MrPrism learns inter-observer variability while taking into account the image's semantic properties and finally converges to a self-calibrated segmentation result reflecting inter-observer agreement.Specifically,we propose Converging Prism(ConP)and Diverging Prism(DivP)to iteratively process the two tasks.ConP learns calibrated segmentation based on multi-rater confidence maps estimated by DivP,and DivP generates multi-rater confidence maps based on segmentation masks estimated by ConP.Experimental results show that the two tasks can mutually improve each other through this recurrent process.The final converged segmentation result of MrPrism outperforms state-of-the-art(SOTA)methods for a wide range of medical image segmentation tasks.The code is available at https://github.-com/WuJunde/MrPrism.展开更多
This study consists of two questionnaire surveys conducted in two stages to investigate factors that high-stakes exam essay raters believe to affect their rating behavior. Raters were all university Chinese teachers o...This study consists of two questionnaire surveys conducted in two stages to investigate factors that high-stakes exam essay raters believe to affect their rating behavior. Raters were all university Chinese teachers of English majors. Seventy-three participants in stage one and 75 in stage two responded to the same questionnaire. Both exploratory factor analysis and confirmatory factor analysis were used in data analysis. Results showed that there were generally six broad factors interfering with the rating process: rating scale, rater training, rating supervision, rater characteristics, eye-catching text features and rating condition. The interaction of those factors reflected the tension between the constraints executed by the test institution and raters' own knowledge and understanding of essay rating. This study may shed light on measures taken to improve essay rating quality.展开更多
This explorative study investigates 1) whether and how quantitative measures of writing can be applied in finding out about scoring raters' specific tendency in their scoring of EFL writing; 2) how the knowledge of...This explorative study investigates 1) whether and how quantitative measures of writing can be applied in finding out about scoring raters' specific tendency in their scoring of EFL writing; 2) how the knowledge of raters' tendency and scoring results would help verify the best way of combining raters' scores; and 3) how the prediction of the writing scores of EFL writing obtained by quantitative writing performance measures would match the real scores given by raters. Based on a tentative CAF framework of writing measures, raters' performance or tendency in their scoring was observed and certain patterns of similarities as well as differences were found among the raters. The resuks of multiple linear regressions indicate that all raters give prior attention to the aspect of accuracy in their scoring. Differences among raters are also obvious. When it comes to the combination of different raters' scores, the study also finds that weighted average is the best of the three ways of combining scores for this group of raters because it has yielded the best predicting scores than the "pure average". It is even slightly better than the results obtained by facet analysis in terms of some important indices such as R square and Durbin-Watson value. The matching of the predicted scores with the real scores is well over 50 percent. The results of the study are further discussed in relation to the application of wpm and the possible improvement of wpm framework. The methodological, theoretical and practical implications of the study have also been touched upon in the relevant part of the article.展开更多
The coefficient of reliability is often estimated from a sample that includes few subjects. It is therefore expected that the precision of this estimate would be low. Measures of precision such as bias and variance de...The coefficient of reliability is often estimated from a sample that includes few subjects. It is therefore expected that the precision of this estimate would be low. Measures of precision such as bias and variance depend heavily on the assumption of normality, which may not be tenable in practice. Expressions for the bias and variance of the reliability coefficient in the one and two way random effects models using the multivariate Taylor’s expansion have been obtained under the assumption of normality of the score (Atenafu et al. [1]). In the present paper we derive analytic expressions for the bias and variance, hence the mean square error when the measured responses are not normal under the one-way data layout. Similar expressions are derived in the case of the two-way data layout. We assess the effect of departure from normality on the sample size requirements and on the power of Wald’s test on specified hypotheses. We analyze two data sets, and draw comparisons with results obtained via the Bootstrap methods. It was found that the estimated bias and variance based on the bootstrap method are quite close to those obtained by the first order approximation using the Taylor’s expansion. This is an indication that for the given data sets the approximations are quite adequate.展开更多
This study investigates how raters make their scoring decisions when assessing tape-mediated speaking test performance. 24 Chinese EFL teachers were trained before scoring analytically five sample tapes selected from ...This study investigates how raters make their scoring decisions when assessing tape-mediated speaking test performance. 24 Chinese EFL teachers were trained before scoring analytically five sample tapes selected from TEM4-Oral, a national EFL speaking test designed for college English major sophomores in China. The raters' verbal reports concerning what they were thinking about while making their scoring decisions were audio-recorded and collected during and immediately after each assessment. Post-scoring interviews were used as supplements to the probe of the scoring process. A qualitative analysis of the data showed that the raters tended to give weight to the content, to punish both grammar and pronunciation errors and to reward the use of impressive and uncommon words. Moreover, the whole decision-making process was proved to be cyclic in nature. A flow chart describing the cyclic process of hypothesis forming and testing was then proposed and discussed.展开更多
基金supported by the Excellent Young Science and Technology Talent Cultivation Special Project of China Academy of Chinese Medical Sciences(CI2023D006)the National Natural Science Foundation of China(82121003 and 82022076)+2 种基金Beijing Natural Science Foundation(2190023)Shenzhen Fundamental Research Program(JCYJ20220818103207015)Guangdong Provincial Key Laboratory of Human Digital Twin(2022B1212010004)。
文摘In medical image segmentation,it is often necessary to collect opinions from multiple experts to make the final decision.This clinical routine helps to mitigate individual bias.However,when data is annotated by multiple experts,standard deep learning models are often not applicable.In this paper,we propose a novel neural network framework called Multi-rater Prism(MrPrism)to learn medical image segmentation from multiple labels.Inspired by iterative half-quadratic optimization,MrPrism combines the task of assigning multi-rater confidences and calibrated segmentation in a recurrent manner.During this process,MrPrism learns inter-observer variability while taking into account the image's semantic properties and finally converges to a self-calibrated segmentation result reflecting inter-observer agreement.Specifically,we propose Converging Prism(ConP)and Diverging Prism(DivP)to iteratively process the two tasks.ConP learns calibrated segmentation based on multi-rater confidence maps estimated by DivP,and DivP generates multi-rater confidence maps based on segmentation masks estimated by ConP.Experimental results show that the two tasks can mutually improve each other through this recurrent process.The final converged segmentation result of MrPrism outperforms state-of-the-art(SOTA)methods for a wide range of medical image segmentation tasks.The code is available at https://github.-com/WuJunde/MrPrism.
基金supported by the Youth Foundation of Ministry of Education of China for Humanity and Social Science Research(15YJC740004)the Fundamental Research Funds for the Central Universities in China(16LZUJBWZY032+1 种基金LZUJBWZY069)Fund of School of Foreign Languages of LZU(16LZUWYXSTD002)
文摘This study consists of two questionnaire surveys conducted in two stages to investigate factors that high-stakes exam essay raters believe to affect their rating behavior. Raters were all university Chinese teachers of English majors. Seventy-three participants in stage one and 75 in stage two responded to the same questionnaire. Both exploratory factor analysis and confirmatory factor analysis were used in data analysis. Results showed that there were generally six broad factors interfering with the rating process: rating scale, rater training, rating supervision, rater characteristics, eye-catching text features and rating condition. The interaction of those factors reflected the tension between the constraints executed by the test institution and raters' own knowledge and understanding of essay rating. This study may shed light on measures taken to improve essay rating quality.
基金funded by China National Planning Office of Philosophy and Social Science(No.08XYY007)
文摘This explorative study investigates 1) whether and how quantitative measures of writing can be applied in finding out about scoring raters' specific tendency in their scoring of EFL writing; 2) how the knowledge of raters' tendency and scoring results would help verify the best way of combining raters' scores; and 3) how the prediction of the writing scores of EFL writing obtained by quantitative writing performance measures would match the real scores given by raters. Based on a tentative CAF framework of writing measures, raters' performance or tendency in their scoring was observed and certain patterns of similarities as well as differences were found among the raters. The resuks of multiple linear regressions indicate that all raters give prior attention to the aspect of accuracy in their scoring. Differences among raters are also obvious. When it comes to the combination of different raters' scores, the study also finds that weighted average is the best of the three ways of combining scores for this group of raters because it has yielded the best predicting scores than the "pure average". It is even slightly better than the results obtained by facet analysis in terms of some important indices such as R square and Durbin-Watson value. The matching of the predicted scores with the real scores is well over 50 percent. The results of the study are further discussed in relation to the application of wpm and the possible improvement of wpm framework. The methodological, theoretical and practical implications of the study have also been touched upon in the relevant part of the article.
文摘The coefficient of reliability is often estimated from a sample that includes few subjects. It is therefore expected that the precision of this estimate would be low. Measures of precision such as bias and variance depend heavily on the assumption of normality, which may not be tenable in practice. Expressions for the bias and variance of the reliability coefficient in the one and two way random effects models using the multivariate Taylor’s expansion have been obtained under the assumption of normality of the score (Atenafu et al. [1]). In the present paper we derive analytic expressions for the bias and variance, hence the mean square error when the measured responses are not normal under the one-way data layout. Similar expressions are derived in the case of the two-way data layout. We assess the effect of departure from normality on the sample size requirements and on the power of Wald’s test on specified hypotheses. We analyze two data sets, and draw comparisons with results obtained via the Bootstrap methods. It was found that the estimated bias and variance based on the bootstrap method are quite close to those obtained by the first order approximation using the Taylor’s expansion. This is an indication that for the given data sets the approximations are quite adequate.
文摘This study investigates how raters make their scoring decisions when assessing tape-mediated speaking test performance. 24 Chinese EFL teachers were trained before scoring analytically five sample tapes selected from TEM4-Oral, a national EFL speaking test designed for college English major sophomores in China. The raters' verbal reports concerning what they were thinking about while making their scoring decisions were audio-recorded and collected during and immediately after each assessment. Post-scoring interviews were used as supplements to the probe of the scoring process. A qualitative analysis of the data showed that the raters tended to give weight to the content, to punish both grammar and pronunciation errors and to reward the use of impressive and uncommon words. Moreover, the whole decision-making process was proved to be cyclic in nature. A flow chart describing the cyclic process of hypothesis forming and testing was then proposed and discussed.