Identifying the most meaningful measure of test quality

Much discussion has taken place over the past few decades concerning the issue of what is the most important criterion for evaluating the quality of a test: is it validity? is it reliability? is it authenticity? or, is it something different from these, or perhaps some combination of them?

Validity, or more specifically construct validity, has received perhaps the most attention and support as the supreme test quality. If we define it as “measuring what a test is designed to measure” then it is clearly vital that a test succeed in measuring the skill (e.g., academic writing ability) or knowledge area (e.g, understanding the pragmatic rules relating to international business negotiations) that it was designed to measure. The Scholastic Aptitude Test (S.A.T.) has often been panned for its supposed inability to accurately assess students’ readiness for post-secondary study. The paper-based TOEFL was so poorly regarded for its lack of validity as a test of academic English proficiency that its creator, the Educational Testing Service (ETS), overhauled the content and design of the test so much so that the current ibt TOEFL bares little if any resemblance to its primary “ancestor”, featuring direct measurement of writing ability in the form of actual writing tasks and semi-direct measurement of speaking (each test-taker speaks into a microphone and has their voice recorded). The ibt TOEFL, far more than the original, merits the designation of a “valid” test.

Reliability, however, has also been touted as an essential ingredient; the focus on psychometrics reflected in tests such as the paper-based TOEFL and the original Michigan English test featured multiple-choice items measuring knowledge of discrete points of grammar and specific vocabulary terms. Tests were long in order to provide coverage of a wide range of knowledge “points”, and the multiple-choice design of items enabled quick, accurate scoring and, therefore, “trustworthy” results. While stakeholders such as teachers and admissions officers demanding validity bemoaned its lack in psychometric tests, reliability advocates could point out the significant challenge of providing accurate scoring of direct speaking and writing tasks for hundreds or thousands of test-takers on a single test.

The reliability vs. validity debate ignored the fact that there are other important criteria for evaluating the quality of a test or other means of assessment. The growing popularity of communicative language teaching in the 1980’s increased the desire to find ways to directly and accurately test the skills of speaking and writing, absent in tests such as the paper TOEFL. The introduction of tests such as the IELTS, with a focus on measuring both receptive and productive communicative skills, highlighted the fact that other criteria affect test quality: authenticity concerns the degree to which a test task reflects real use of language, whether that be in a college classroom, office, or coffee shop, and whether the task be understanding a portion of a lecture or written instructions for getting to a hotel, or writing a letter of complaint to a hotel in response to unsatisfactory service. Practicality addresses the question of how to carry out assessment, including grading, in a manner that does not exceed the resources, including those of time and energy, required to do so. When Bachman and Palmer (1996)1 identified what they considered to be the qualities of test usefulness, they included the four already mentioned, but added interactiveness and impact to their list. Interactiveness refers to the extent to which a test or test task accounts for learners’ topical knowledge, language ability, and affective factors such as interest and feelings (i.e. sensitivity to particular topics). Impact is concerned with the effect that a test has not only on learners but on effects at the classroom, school, and even societal level. The gatekeeping effects of admissions tests such as the TOEFL impact not only learners, but also teachers and language schools, as practitioners and institutions seek to find effective ways to prepare learners for these tests. College entrance tests in countries such as Japan have had a similar effect. The decades-long proliferation of cram schools such as juku in Japan and hagwon in South Korea bear testimony to the fact that test results can significantly impact a student’s long-term prospects.

Yet even these qualities do not address all the concerns learners and other stakeholders have with regards to tests and other forms of assessment. The issue of fairness is also clearly a matter of great concern. Reports of test-taker accent having an impact on scores obtained on speaking tests have invited the ire of both learners and others invested in their educational success. An international research study on the effect of accent familiarity on ratings given to samples of responses on an IELTS speaking task indicated that it does indeed affect a rater’s perception of the quality of learner speech2. Group discussions with non-native teachers of English (i.e., those for whom English is not their native tongue) reveal their general strong distrust of human raters of their speaking ability and their resulting preference for automated scoring of speaking, a feature of tests such as the Pearson Academic Test of English. The grading of writing has likewise been called into question; in response to this, evaluation of test-takers’ responses to the two writing tasks on the itp TOEFL involves a combination of automated (“e-rater”) and human scoring to maximize the reliability, validity, and thus the fairness of the scores assigned.

There are clearly several factors to consider when evaluating the quality of an assessment tool or approach. None by itself can cover all the concerns raised about testing. In the light of this, let us ponder a general quality that takes all of the above into account. If we consider what people identify as most important when evaluating a teacher, church leader, or a politician or political party, the word we so often hear is credibility, that is their trustworthiness or, as the morphology of the term implies, “believability”. This believability gives meaning, and thus value, to what they say. In the realm of education, credibility of a teaching approach or a test in the eyes of students, parents, and other stakeholders builds trust and adds value to what is revealed. With the preceding discussion in mind, I propose the following model of test credibility:

This is, of course, a subjective model. It is, however, based on two decades of experience as a language teacher, as well as what I have learned during the past decade as an instructor in a graduate-level TESOL program. It is complex enough to include the criteria discussed above but in even greater detail than what I have mentioned. Finally, it gives central focus to a concern that matters not only to those of us in the field, but to others, such as students, parents, and other stakeholders.

1Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice : designing and developing useful language tests. Oxford University Press.

2Carey, M. D., Mannell, R. H., & Dunn, P. K. (2011). Does a rater’s familiarity with a candidate’s pronunciation affect the rating in oral proficiency interviews? Language Testing28(2), 201–219.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s