LLT Journal: Considerations in Developing or Using Second/Foreign Language Proficiency CATs

NOTES

1 For an analysis of the research base and the methods of investigating the effectiveness of computer-assisted instruction (CAI) and computer-assisted language learning (CALL), see Chapelle (1997) and Dunkel (1991).

2 In a computer-adaptive test (CAT), each test item is presented by the computer which, given the response made, scores the item and then selects the next item most appropriate for the candidate,s skill level. Questions that are too easy or too difficult for the candidate are not presented (Green, Bock, Humphreys, Linn, & Reckase, 1984). Adaptive testing thus seeks to present only items that are appropriate for the test taker,s estimated level of ability.

3 Studying the relationship between the construct-irrelevant variable of computer familiarity and the construct-relevant variable of performance level on a set of test items in the computer-based Test of English as a Foreign Language (TOEFL), Taylor, Jamieson, Eignor, and Kirsch (1998) found "no practical differences" (p. 26) between computer-unfamiliar and computer-familiar examinees on the computer-based tests of listening, structure, reading, or total scores under the following conditions: (a) when examinees had been administered a TOEFL computer-based testing tutorial and (b) when the language ability of the examinees had been taken into account. Brown (1997) raises a number of concerns about computer equipment, such as limited screen capacity and poor graphical capabilities, that could be viewed as potential disadvantages of using computers in language testing.

4 The Ohio State University is designing a multimedia CAT in French, German, and Spanish. CATs of listening comprehension proficiency in Hausa, ESL, and Russian have also been designed by Dunkel (1996) and a research team at Georgia State University and The Pennsylvania State University. Developers at Brigham Young University have for many years been actively engaged in developing CATs of second/foreign language proficiency (Madsen, 1991). More recently, the Defense Language Institute's English Language Center has been investigating and implementing computer-automated assembly of test forms, according to specific content and psychometric specifications, for use in its large-scale testing program. The automatic assembly of pre-equated language tests has cost-saving implications for large-scale testing programs and systematic test content variation (Henning, Johnson, Boutin, & Rice, 1994).

5 In a computer-based test (CBT), items are presented to the test taker in a fixed and linear fashion. They are not selected according to the examinee's previous right-wrong response patterns as are items in a CAT.

6 As Brown (1997) notes, in CAT "traditional time limits are not necessary. Students can be given as much time as they need to finish a given test because no human proctor needs to wait around for them to finish the test" (p. 46).

7 The well-known unidimensional mathematical models in IRT (e.g., the one-, two-, and three-parameter models) handle dichotomously scored (right/wrong) data. Tung (1985) presents a rather accessible discussion of these commonly used IRT models for language teachers and testers. Briefly, one-, two-, and three-parameter models indicate the difficulty, discrimination, and guessing values, respectively, of test items in the item pool. A number of multidimensional IRT models (e.g., models for items with response formats other than right/wrong, or models that allow for multiple attempts or examinee responses for a single item) have been designed to handle open-ended response formats. For an indepth, albeit technical, discussion of unidimensional IRT models, see Hambleton and Swaminathan (1985); for an explication of multidimensional IRT models, see van der Linden and Hambleton (1997).

8 The Ohio State University is developing Web-based multimedia computer-adaptive language tests in French, German, and Spanish for placement purposes.

9 Listening comprehension CATs in Hausa, ESL, and Russian have been created with funding by the U.S. Department of Education and the National Endowment for the Humanities (Dunkel, 1996). The Hausa listening CAT is presently being used as a placement exam for Americans studying this language at the University of Kansas, and the ESL CAT is being trialed at Georgia State University.

10 Alderson, Clapham, and Wall (1995) suggest that developers of placement, progress, achievement, proficiency, and diagnostic tests offer extensive and comprehensive specifications for their assessment instruments so that users can determine exactly what abilities are being measured, what test methods are being employed, and what scoring and evaluation criteria are being used (p. 19). For a fuller discussion of how test specifications can be drawn up, see Bachman (1990) and Alderson, Clapham, and Wall (1995).

11 According to Bachman and Palmer (1996), reliability can be defined as consistency of measurement whereas construct validity pertains to the meaningfulness and appropriateness of the interpretations that can be made from the test scores. Authenticity is "the degree of correspondence of the characteristics of a given language test task to the characteristics of a TLU [Target Language Use] task" (p. 39).

12 Issues not taken up in this article, though highly deserving of careful consideration and further examination, include the trialing of items to obtain the CAT item calibrations (IRT statistics). See Wainer (1990) and Alderson, Clapham, and Wall (1995) for an illuminating discussion of how to trial and determine item calibrations for items in the CAT item bank.

13 According to Wainer (1990, p. 114), an adaptive test terminates when one or more of the following stopping rules is met: (a) when a target measurement precision level has been attained, (b) when a pre-selected number of items has been given, or (c) when a predetermined amount of time has elapsed. Any of these stopping rules, or a mixture thereof, can be used to halt a CAT.

14 A cut-score is usually a predetermined criterion which divides scores into groups based on the examinees, level of performance. According to Green et al. (1995), a cut-score is often used to separate scores into pass/fail or mastery/non-mastery groups.

15 Alderson, Clapham, and Wall (1995) contend that the one-parameter (Rasch) model requires a minimum of 100 students for pretesting of the item bank. The Rasch model is concerned with two aspects of a test: person ability and item difficulty. The two-parameter model requires a sample of at least 200 students for trialing of the test, adding item discrimination, as well as person ability and item difficulty, to the analysis of trialed test items. In addition to everything the one- and two-parameter models do, the three-parameter model takes examinee guessing into account to determine item calibrations (IRT statistics). The three-parameter model requires a data set of at least 1,000 students.

16 For discussion of a heuristic for selecting CAT items from the item bank, based on both content and statistical properties, see Stocking, Swanson, and Pearlman (1993).

17 Green et al. (1995) note that if the purpose of the CAT is to estimate the examinees, proficiency levels, the stopping rule is usually a function of the conditional standard error of measurement for the proficiency estimate, whereas with a stopping rule used to make pass/fail decisions, the "stopping rule may instead focus on whether the [examinee's] score falls outside a pre-specified confidence band" (p. 10).

18 Wainer (1990) notes that in the Educational Testing Service's Computerized Mastery Test, examinees are initially administered two of the 10-item testlets (20 items); if they receive very low (or very high) scores, they pass (or fail). Examinees who receive less definitive scores on the two initial testlets are administered additional randomly chosen testlets until either a pass/fail decision can be reached or the pool is exhausted. "There is no adaptation of difficulty in the CMT model; its only adaptive features involve the stopping rule. Nevertheless, the computerized testlet version shortens the test for many examinees without reducing the precision of a pass-fall decision" (p. 129).

19 It should be noted that some examinees may present unusual item-response patterns as a result of unstable performance or random guessing at answers. When this occurs, calculation of adequate estimates of the examinees' proficiency levels can be difficult. CAT systems can anticipate such problems by tracking and recording the number of items attempted, the response choices and examinee patterns of response, and the final score achieved. Administrators can then examine individual test results for unusual patterns and make provision for them, for instance, by allowing examinees to retake the test (Green et al., 1995).