Çok Boyutlu Test Deseninin ve Kalibrasyon Yöntemlerinin Çok Boyutlu Bireyselleştirilmiş Bilgisayar Uygulamalarına Etkisi
Özet
A test can be designed for many purposes, including the ranking of people along a continuum or providing diagnostic value about examinees. However, a very common problem that often arises is the reporting diagnostic subscores when items are capable of measuring unwanted dimensions or designed for multidimensional purposes. Multidimensional computer adaptive testing (MCAT) is capable of measuring multiple dimensions efficiently by using multidimensional IRT (MIRT) applications. There have been several research studies about MCAT item selection methods to improve domain and the overall ability score estimations accuracy. According to the literature review it has been found that most studies focused on comparing item selection methods in many conditions except for the structure of test design and multidimensional calibration strategies. In contrast with the previous studies, this study employed unidimensional and multidimensional calibration approach and various test design (simple and complex) which allows the evaluation of domain and subscore ability estimations across multiple real test conditions. The purpose of this study is to compare MCAT item selection methods while estimating domain and the overall ability scores in terms of test design, number of items per dimension, calibration approaches in MCAT framework.
In this study, four factors were manipulated, namely the test design, number of items per dimension, calibration strategies and item selection methods. For each SS, CLS or CHS design 1000x3 and 1000x45 matrix of true ability parameters was randomly generated from the multivariate normal distribution. Using the generated item and ability parameters, dichotomous item responses were generated in by using M3PL compensatory multidimensional IRT model with specified correlations. A three-dimensional item bank was simulated with simple and complex structures. Dimensions correlated at ρ = 0.2, 0.5, and 0.8. Three calibration strategies, separate unidimensional and two multidimensional (Bock and Aitkin’s EM and Metropolis-Hastings Robbins-Monro algorithm) calibration were examined. The multidimensional CAT item selection procedures: minimum angle, minimize the error variance of the composite score with the optimized weight, and Kullback–Leibler (KL) information were also examined. MCAT domain and composite ability score accuracy was evaluated using absolute bias (ABSBIAS), correlation and the root mean square error (RMSE) between true and estimated ability scores.
The results suggest that the calibration approaches, multidimensional test structure and number of item per dimension have significant effect on item selection methods for both domain and the overall score estimations. As the model gets complex absolute biases had decrease significantly for both domain and overall scores.
When the test design change different item selection methods had performed better. For SS test design it was found that V1 item selection has the lowest absolute bias estimations for both SU and BAEM calibration while estimating overall scores when correlation between dimension is moderate (0.5) and test length is long (N=45). For CLS test design it was found that Vol item selection has the lowest absolute bias estimations for in SU calibration while estimating overall scores when correlation between dimension is high (0.8) and test length is long (N=45). For BAEM calibration Vol item selection has the lowest absolute bias estimations while estimating overall scores when correlation between dimension is low (0.2) and test length is long (N=45). For CHS test design it was found that V1 item selection has the lowest absolute bias estimations for in SU calibration while estimating overall scores when correlation between dimension is high (0.8) and test length is long (N=45). For BAEM calibration KL item selection has the lowest absolute bias estimations while estimating overall scores when correlation between dimension is low (0.2) and test length is long (N=45).