By comparing two measurement methods, it is interesting not only to estimate both the distortion and the limits of concordance between the two methods (inter-advisor agreement), but also to evaluate these characteristics for each method itself. It could very well be that the agreement between two methods is bad simply because one of the methods has wide limits of convergence, while the other is narrow. In this case, the method of narrow match limits would be statistically superior, while practical or other considerations could alter this assessment. What constitutes narrow or broad boundaries of convergence or a small or large bias is a matter of practical evaluation. with x1/x2 = values compared and Sdiff=SEM2. The latter indicates the default error of the difference between two test results and therefore describes the distribution of differences in the absence of differences. SEM was calculated as SEM=s11-rxx, with s1 = SD and rxx = measurement reliability. Brown, G. T., Glasswell, K., and Harland, D.

(2004). Accuracy in the evaluation of the letter: reliability and validity studies with a New Zealand writing evaluation system. Evaluate. Document. 9, 105-121 doi: 10.1016/j.asw.2004.07.001 Figure 1. Method of analysis. A total of 53 pairs of evaluations were included in the analysis and divided into two evaluation subgroups (illustrated by round boxes in the top row). On the left of the figure, the purpose of applied statistical analysis is presented as research questions.

The next column shows the analyses within the parent-teacher assessment subgroup (n = 34), the right column shows the corresponding analyses for the parent-father subgroup (n = 19). The centre column lists the tests carried out for the entire study population as well as between group comparisons. The polka dot arrows indicate the analyses performed for the different assessments identified with the reliability of the manual`s re-evaluation tests (no reliable diverrging assessments were identified when CCI was used to calculate the critical difference between the assessments). where σ2bt is the variance of assessments between children, σ2in is the variance within children, and k is the number of assessors. For all CICs, confidence intervals were calculated to determine if they are different. So far, we have reported results on inter-board reliability and the number of divergent ratings within and between subgroups, using two different but equally legitimate insurance estimates. We also examined the factors that could influence the probability of obtaining two statistically divergent assessments and described the magnitude of the differences observed. These analyses focused on reliability and consistency between evaluators, as well as related measures.

In this final section, we turn to Pearson correlation coefficients to study the linear relationship between credit ratings and their strength within and between evaluating subgroups. To give a realistic and targeted example of the research strategy described above, we used the ELAN vocabulary scale (Bockmann and Graviese-Himmel, 2006), a German parental questionnaire designed for screening children`s early expression vocabulary. This instrument consists of a checklist of 250 individual words in total: the evaluator decides for each item on the list whether the child is actively using it or not. General questions about the demographic context and child development complement the vocabulary information. Children who received regular day care were assessed by an educator and a parent, children cared for exclusively in their families were assessed by both parents. . . .