2015 | 3(134) | 77–97
Article title

Item analysis and evaluation using a four-parameter logistic model

Title variants
Languages of publication
The four-parameter logistic model (4PLM) assumes that even high ability examinees can make mistakes (e.g. due to carelessness). This phenomenon was reflected by the non-zero upper asymptote (d-parameter) of the IRT logistic curve. Research on 4PLM has been hampered, since the model has been considered conceptually and computationally complicated – and its usefulness has been questioned. After 25 years, following introduction of appropriate software, the psychometric characteristics of 4PLM and the model’s usefulness can be assessed more reliably. The aim of this article is to show whether 4PLM can be used to detect item-writing flaws (which introduce construct-irrelevant variance to the measurement). Analysis was conducted in two steps: (a) qualitative – assessment of compliance of items with the chosen item-writing guidelines, (b) quantitative – fitting 4PLM to compare the results with qualitative analysis – to determine whether the same items were detected as flawed. Other IRT models (3PLM and 2PLM) were also fitted to check the validity of results. Flawed items can be detected by the means of qualitative analysis as well as by 4PLM and simpler IRT models. This model is discussed from the perspective of practical use in educational research.
Physical description
  • Educational Research Institute
  • Aamodt, M. G. and McShane, T. (1992). A meta-analytic investigation of the effect of various test item characteristics on test scores and test completion times. Public Personnel Management, 21(2), 151–160.
  • Albanese, M. A. (1993). Type K and other complex multiple‐choice items: an analysis of research and item properties. Educational Measurement: Issues & Practice, 12(1), 28–33.
  • Albanese, M. A., Kent, T. and Whitney, D. (1977). A comparison of the difficulty, reliability, and validity of complex multiple-choice, multiple-response, and multiple true-false items. Proceedings from the Sixteenth Annual Conference on Research in Medical Education (pp. 105–110). Washington: Association of American Medical Colleges.
  • Albanese, M. A., Kent, T. H. and Whitney, D. R. (1979). Cluing in multiple-choice test items with combinations of correct responses. Academic Medicine, 54(12), 948–50.
  • Barton, M. A. and Lord, F. M. (1981). An upper asymptote for the three-parameter logistic item-response model. Princeton: Educational Testing Service. Retrieved from:
  • Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord and M. R. Novick (eds.), Statistical theories of mental test scores (chapters 17–20). Reading: Addison–Wesley.
  • Boughton, K. A. and Yamamoto, K. (2007). A hybrid model for test speededness. In M. von Davier and C. H. Carstensen (eds.). Multivariate and mixture distribution Rasch models (pp. 147–156). Springer: New York.
  • Casler, L. (1983). Emphasizing the negative: a note on the not in multiple-choice questions. Teaching of Psychology, 10(1), 51–51.
  • Cassels, J. R. T. and Johnstone, A. H. (1984). The effect of language on student performance on multiple-choice tests in chemistry. Journal of Chemical Education, 61(7), 613–615.
  • Chalmers, R. P. (2012). mirt: a multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29.
  • Cizek, G. J. and O’Day, D. M. (1994). Further investigation of nonfunctioning options in multiple-choice test items. Educational and Psychological Measurement, 54(4), 861–872.
  • Cizek, G. J., Robinson, K. L. and O’Day, D. M. (1998). Nonfunctioning options: a closer look. Educational and psychological measurement, 58(4), 605–611.
  • DiBattista, D. and Kurzawa, L. (2011). Examination of the quality of multiple-choice items on classroom tests. Canadian Journal for the Scholarship of Teaching & Learning, 2(2), article 4.
  • Downing, S. M. (2002). Construct‐irrelevant variance and flawed test questions: do multiple‐choice item‐writing principles make any difference? Academic Medicine, 77(10), S103–S104.
  • Downing, S. M. (2005). The effects of violating standard item writing principles on tests and students: the consequences of using flawed test items on achievement examinations in medical education. Advances in health sciences education, 10(2), 133–143.
  • Ellsworth, R. A., Dunnell, P. and Duell, O. K. (1990). Multiple-choice test items: what are textbook authors telling teachers? The Journal of Educational Research, 83(5), 289–293.
  • Frary, R. B. (1991). The none-of-the-above option: an empirical study. Applied Measurement in Education, 4(2), 115–124.
  • Gross, L. J. (1994). Logical versus empirical guidelines for writing test items: the case of ”none of the above”. Evaluation & the Health Professions, 17(1), 123–126.
  • Haladyna, T. M. and Downing, S. M. (1989a). A taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2(1), 37–50.
  • Haladyna, T. M. and Downing, S. M. (1989b). Validity of a taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2(1), 51–78.
  • Haladyna, T. M. and Downing, S. M. (1993). How many options is enough for a multiple-choice test item? Educational and Psychological Measurement, 53(4), 999–1010.
  • Haladyna, T. M. and Downing, S. M. (2004). Construct‐irrelevant variance in high‐stakes testing. Educational Measurement: Issues & Practice, 23(1), 17–27.
  • Haladyna, T. M., Downing, S. M. and Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309–333.
  • Hambleton, R. K. and Swaminathan, H. (1985). Item response theory: principles and applications (vol. 7). New York: Springer.
  • Hansen, J. D. and Dexter, L. (1997). Quality multiple-choice test questions: Item-writing guidelines and an analysis of auditing testbanks. Journal of Education for Business, 73(2), 94–97.
  • Hohensinn, C. and Kubinger, K. D. (2011). Applying item response theory methods to examine the impact of different response formats. Educational & Psychological Measurement, 71(4), 732–746.
  • Huntley, R. M. and Plake, B. S. (1984). An investigation of multiple-choice-option items: item performance and processing demands. Paper presented at the meeting of the National Council on Measurement in Education, New Orleans.
  • Jozefowicz, R.F., Koeppen, B.M., Case, S., Galbraith, R., Swanson, D. and Glew, H. (2002). The quality of in-house medical school examinations. Academic Medicine, 77(2), 156–161.
  • Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16(4), 277–298.
  • Knowles, S. L. and Welch, C. A. (1992). A meta-analytic review of item discrimination and difficulty in multiple-choice items using “none-of-the-above”. Educational and Psychological Measurement, 52(3), 571–577.
  • Kolstad, R. K., Briggs, L. D., Bryant, B. B. and Kolstad, R. A. (1983). Complex multiple-choice items fail to measure achievement. Journal of Research & Development in Education, 17(1), 7–11.
  • Liao, W. W., Ho, R. G., Yen, Y. C. and Cheng, H. C. (2012). The four-parameter logistic item response theory model as a robust method of estimating ability despite aberrant responses. Social Behavior & Personality: an international journal, 40(10), 1679–1694.
  • Linden, W. J. van der (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287–308.
  • Loken, E. and Rulison, K. L. (2010). Estimation of a four‐parameter item response theory model. British Journal of Mathematical and Statistical Psychology, 63(3), 509–525.
  • Magis, D. (2013). A note on the item information function of the four-parameter logistic model. Applied Psychological Measurement, 37(4), 304–315.
  • Masters, G. N. (1988). Item discrimination: when more is worse. Journal of Educational Measurement, 25(1), 15–29.
  • Messick, S. (1989). Validity. In R. L. Linn (ed.), Educational measurement (3rd ed., pp. 13–103). New York: Macmillan.
  • Mroch, A. A., Bolt, D. M. and Wollack, J. A. (2005). A new multi-class mixture Rasch model for test speededness. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, Quebec.
  • Mueller, D. J. (1975). An assessment of the effectiveness of complex alternatives in multiple choice achievement test items. Educational and Psychological Measurement, 35(1), 135–141.
  • Raiche, G., Magis, D., Blais, J.-G. and Brochu, P. (2013). Taking atypical response patterns into account: a multidimensional measurement model from item response theory. In M. Simon, K. Ercikan and M. Rousseau (eds), Improving large-scale assessment in education. New York: Routledge.
  • Reise, S. P. and Waller, N. G. (2003). How many IRT parameters does it take to model psychopathology items? Psychological Methods, 8(2), 164–184.
  • Rodriguez, M. C. (1997). The art & science of item writing: a meta-analysis of multiple-choice item format effects. Paper presented at the Annual meeting of the American Education Research Association, Chicago.
  • Rulison, K. L. and Loken, E. (2009). I’ve Fallen and i can’t get up: can high-ability students recover from early mistakes in CAT? Applied Psychological Measurement, 33(2), 83–101.
  • San Martín, E., González, J. and Tuerlinckx, F. (2014). On the unidentifiability of the fixed-effects 3PL model. Psychometrika, 1–18.
  • Schuwirth, L. W. and Vleuten, C. P. van der (2004). Different written assessment methods: what can be said about their strengths and weaknesses? Medical Education, 38(9), 974–979.
  • Stark, S., Chernyshenko, O. S., Drasgow, F. and Williams, B. A. (2006). Examining assumptions about item responding in personality assessment: should ideal point methods be considered for scale development and scoring? Journal of Applied Psychology, 91(1), 25–39.
  • Tarrant, M. and Ware, J. (2008). Impact of item‐writing flaws in multiple‐choice questions on student achievement in high‐stakes nursing assessments. Medical Education, 42(2), 198–206.
  • Tarrant, M., Knierim, A., Hayes, S. K. and Ware, J. (2006). The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Education Today, 26(8), 662–671.
  • Tarrant, M., Ware, J. and Mohammed, A. M. (2009). An assessment of functioning and non-functioning distractors in multiple-choice questions: a descriptive analysis. BMC Medical Education, 9(1), 40–48.
  • Waller, N. G. and Reise, S. P. (2010). Measuring psychopathology with nonstandard item response theory models: Fitting the four-parameter model to the Minnesota Multiphasic Personality Inventory. In S. E. Embretson (ed). Measuring psychological constructs: advances in model-based approaches (pp. 147–173). Washington: American Psychological Association.
  • Woodford, K. and Bancroft, P. (2005). Multiple choice questions not considered harmful. In A. Young and D. Tolhurst (eds.), Proceedings of the 7th Australasian conference on computing education (vol. 42, pp. 109–116). Darlinghurst: Australian Computer Society.
Document Type
Publication order reference
YADDA identifier
JavaScript is turned off in your web browser. Turn it on to take full advantage of this site, then refresh the page.