It is one thing to understand the criteria by which we may design and evaluate language tests and other assessment tools; it is another to apply them as we seek to create or adopt tests that meet these qualities.
Bachman and Palmer’s qualities of test usefulness¹ are a valuable resource for teachers; however, they also present teachers with a challenging balancing act. For while five of these six qualities–validity, reliability, authenticity, interactiveness, and practicality–do affect the quality of our assessments, their different characteristics means that prioritizing one or two weakens some of the others.
Take a multiple-choice test of grammatical or lexical knowledge, for example, such as the “Structure and Written Expression” section of original paper (PBT) TOEFL test. As veteran TESOL professionals may remember, it consisted of items calling on test-takers to either identify the incorrect word or phrase in a sentence, or to choose the word(s) that best completed a sentence.²
Carefully designed multiple-choice tests have only one correct answer per question and are often machine-scored. They boast high reliability, even if scored by hand, and are practical because of ease of accurate scoring. Their practicality is boosted for learners who are familiar with their design, as millions are worldwide, not only for English but also for other languages and other school subjects. Such tests can also claim to have high construct validity as measures of language knowledge.
However, accurate knowledge and correct use of language are two very different skills, and as many teachers will attest, students may have a high level of knowledge of English grammatical rules yet not to be able to write or speak more than one or two sentences with any degree of accuracy or confidence. This testifies to the fact that multiple-choice tests of grammar and/or vocabulary knowledge have very low construct validity as measures of writing or speaking ability, or even reading comprehension. By virtue of their design, they are likewise very weak in terms of authenticity, since the task of completing a multiple-choice test bears no resemblance to the types of tasks a business-English student will use in the workplace or that an international student will use in a regular course in a Canadian, Australian, or British university. They also tend to have very limited interactiveness (tapping into test-takers’ language ability and topical knowledge) since they measure topical knowledge only at the sentence level, except in cases where a multiple-choice test is designed around a single subject.
A very different challenge faces teachers who seek to design direct tests of writing or speaking. By a “direct test”, I mean a test which contains one or more tasks requiring test-takers to either speak with one or more people, or into a microphone (“semi-direct”), or to complete one or more academic essays (ibt TOEFL), or tasks such as a job-application cover letter or letter of complaint about a product (IELTS General Training, for example). Such tests, assuming they are well designed, boast high construct validity as tests of language performance (speaking, writing), as well as high authenticity, depending on their resemblance to targeted “real world” tasks. They likewise have high interactiveness since they require test takers to make use of both language ability and topical knowledge in the context of extended, authentic tasks. Both the ibt TOEFL and the IELTS represent attempts to create tests that are highly valid, authentic, and interactive. But attainment of these qualities comes at a steep price in terms of practicality and reliability.
This is due in large part to the fact that scoring direct tests of speaking or writing can be extremely time-consuming at two stages in the test-design and grading process: first, direct tests of language performance are graded using rubrics, and rubrics must be constructed in a way that they are detailed enough to provide an adequate measure of language ability for a particular group of test-takers (e.g. international students seeking admission to English-medium universities), yet user-friendly enough so that those scoring the tests can do so as practically and reliably as possible. For a single teacher of an elementary-level or even intermediate-level writing course, this is a relatively simple task, since the amount of language that learners produce at that level is very limited, and such tasks would not be considered high stakes in terms of the test-takers’ future. The rubric, if there is one, would be correspondingly simple and thus easy to use. However, for designing large-scale tests for say, entire programs, the challenge is quite formidable. It is even more so as rubrics are often task specific. The ibt TOEFL and the IELTS (both the Academic and General training versions), for example, each contain two writing tasks, and each task has a separate and detailed rubric. A look at the rubrics included in the links below testifies to the impracticality of the process.³
The complexity of a detailed rubric means that raters for a large-scale test must be trained to use them efficiently and reliably; this is the second part of the overall scoring process which makes it time consuming and impractical. Even for a single teacher of a writing or speaking course, learning to use a rubric efficiently is likely an acquired skill, particularly for novice instructors. However, efficiency is critical for both intra (single-rater) and inter-rater reliability; the process must be efficient so that it can be consistent and therefore reliable. Yet even well-trained raters face a foe that presents the most significant challenge to reliable grading of speaking and writing tests: fatigue. Whether one is interviewing test-takers to assess their speaking ability or grading a set of paragraphs or essays, concentration is vital for maintaining consistency in assessment, and the mental energy required for this creates considerable fatigue, whether one is an individual teacher or a trained rater.
It is for many of us this fatigue, more than any other factor, that affects how we plan assessment, because we recognize the effects it has on grading, particularly when we assess students’ language performance. We want our tests to be valid, whatever skills(s) it is that we are testing, but considerations of time, mental energy, and fatigue mean that we must perform a balancing act and assess the relative importance of practicality against that of validity, authenticity and interactiveness, without losing sight of reliability. Performing this balancing act well is an acquired skill, and one that can take years to master.
¹Bachman, Lyle, F. & Palmer, A. S. (1996). Language Testing in Practice. Oxford, Great Britain: Oxford University Press.
²https://www.examenglish.com/TOEFL/toefl_structure_1.htm
³https://www.ets.org/s/toefl/pdf/toefl_writing_rubrics.pdf
https://www.ielts.org/-/media/pdfs/writing-band-descriptors-task-1.ashx?la=en
https://www.ielts.org/-/media/pdfs/writing-band-descriptors-task-2.ashx?la=en
***To find about more about test design and other issues related to language assessment, contact the Gordon Moulden to arrange a workshop or presentation.