Özet
The main purpose of this study is to compare the results obtained from the equating of a multidimensional test with the various application steps at test level and at sub-test level in terms of evaluation criteria. The research was conducted using simulation data. Since the methods and conditions that minimize equating error are tried to be determined in the research, the study is a basic research. In this research, common-item nonequivalent groups (CINEG) design was used to equating the two test forms. The equating process was carried out in six applications according to the execution of the equating process at test level and sub-test level with different parameter estimation paths. In the scope of the research, the performance of the six different applications in which the equating process was conducted was examined according to the relationship level between sub-tests, number of items in sub-tests, common item ratio in sub-tests, sample size, difficulty level between tests and sub-tests and scale conversion methods. Under these conditions, equating processes were realized by the IRT true score equating method. RMSE (equating error), BIAS (equating bias) and SE (standard error) values were calculated for item and ability parameters in order to examine the accuracy of equating results. In the scope of the research, R software was used for data generation, IRTPRO 4.2 was used for estimating item and ability parameters, IRTEQ was used for unidimensional equating and LinkMIRT was used for multidimensional equating. After estimating the parameters of multidimensional test according to unidimensional 3PL MTK model, the highest error values were reached in equating. These error values are followed by the error values obtained from the equating process after estimating the parameters according to the multidimensional 3PL MTK model. In this study, the smallest error values were obtained from the equating after estimating the parameters of each subtest according to unidimensional 3PL MTK model.
Künye
Ackerman, T. (1989). Unidimensional IRT calibration of compensatory and noncompensatory items. Applied Psychological Measurement, 13, 113-127.
Ackerman, T. A. (1994). Using multidimensional item response theory to understand what items and tests are measuring. Applied Measurement in Education, 7(4), 255-278.
Ackerman, T. A., Gierl, M. J., & Walker, C. M. (2003). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22(3), 37-51.
Andersson, B. (2014). Contributions to Kernel Equating. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Social Sciences 106. 24 pp. Uppsala: Acta Universitatis Upsaliensis.
Andrews, B. J. (2011). Assessing first-and second-order equity for the common item nonequivalent groups design using multidimensional IRT (Doctoral dissertation). University of Iowa, Iowa.
Angoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (pp. 508-600). Washington: American Council on Education.
Baker, F. B., & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28(2), 147-162.
Beguin, A. A. (2000). Robustness of equating high-stakes tests (Doctoral dissertation). University of Twente, The Netherlands.
Beguin, A. A., Hanson, B. A., & Glas, C. A. (2000). Effect of multidimensionality on separate and concurrent estimation in IRT equating. Paper presented at the annual meeting of the National council on Measurement in Education, New Orleans, LA.
Birnbaum, A. (1957). Efficient design and use of tests of ability for various decision-making problems (Series Report No. 58-16. Project No. 7755-23). Randolph Air Force Base, TX: USAF School of Aviation Medicine.
Birnbaum, A. (1958). On the estimation of mental ability (Series Report No. 15, Project No. 7755-23). Randolph Air Force Base, TX: USAF School of Aviation Medicine.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397-472). Reading, MA: AddisonWesley.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more latent categories. Psychometrika, 37, 29-51.
Brennan, R. L. (2012). Utility indexes for decisions about subscores. Center for Advanced Studies in Measurement and Assessment (CASMA). Research Report 33.
Bulut, O. (2013). Between-person and within-person subscore reliability: Comparison of unidimensional and multidimensional IRT models (Doctoral dissertation). University of Minnesota, Minnesota.
Cai, L., Thissen, D., & du Toit, S. H. C. (2017). IRTPRO 4.2 for Windows [Computer software]. Skokie, IL: Scientific Software International.
Cao, L. (2008). Mixed-format test equating: Effects of test dimensionality and common item sets (Doctoral dissertation). University of Maryland, College Park.
Chalmers, R. P. (2012). Mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1-29.
Chu, K. L., & Kamata, A. (2000). Nonequivalent Group Equating via 1-P HGLLM. Paper presented at the annual meeting of the American Educational Research Association. LA: New Orleans.
Cook L. L. & Eignor R. E. (1991). An NCME instructional module on IRT equating methods. Instructional topics in educational measurement. Educational Measurement: Issues and Practice, 10(1), 37-45.
Cook, L. L., & Petersen, N. S. (1987). Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances. Applied Psychological Measurement, 11, 225-244.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart & Winston.
de Ayala, R. J. (2009). The theory and practice of item response theory. New York: Guilford Press.
Dongyang, L. (2009). Developing a common scale for testlet model parameter estimates under the common-item nonequivalent groups design (Doctoral dissertation). University of Maryland, Maryland.
Dorans, N. J. (1990). Equating methods and sampling designs. Applied Mesaurement in Education, 3(1), 3-17.
Dorans, N. J., & Holland, P. W. (2000). Population invariance and the equatability of tests: basic theory and the linear case. Journal of Educational Measurement, 37(4), 281-306.
Dorans, N. J., Moses, T. P., & Eignor, D. R. (2010). Principles and practices of test score equating. ETS Research Report, 41.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. London: Lawrence Elbaum Associates, Publishers.
Felan, G. D. (2002). Test equating: Mean, linear, equipercentile and item response theory. Paper presented at the Annual Meeting of the Southwest Educational Research Association, Austin.
French, D. J. (1996). The utility of Stocking & Lord’s equating procedure for equating norm-referenced and criterion-referenced tests with both dichotomous and polytomous components (Doctoral dissertation). University of Texas, Texas.
Gök, B. (2012). Denk olmayan gruplarda ortak madde deseni kullanılarak madde tepki kuramına dayalı eşitleme yöntemlerinin karşılaştırılması (Doktora tezi). Hacettepe Üniversitesi, Ankara.
Haberman, S. J. (2008). When can subscores have value? Journal of Educational and Behavioral Statistics, 33, 204-229.
Haebara, T. (1980). Equating Logistic Ability Scales by a Weighted Least Squares Method. Japanese Psychological Research, 22(3), 144-149.
Hagge, S. L. (2010). The impact of equating method and format representation of common items on the adequacy of mixed-format test equating using nonequivalent groups (Doctoral dissertation). University of Iowa, Iowa.
Haladyna, T. M., & Kramer, G. A. (2004). The validity of subscores for a credentialing test. Evaluation and The Health Professions, 27, 349-368.
Hambleton, R. K., & Jones, R. W. (1993). An NCME instructional module on comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12, 38-47.
Hambleton, R. K., & Murray, L. (1983). Some goodness of fit investigations for item response models. In R. K. Hambleton (Ed.), Applications of item response theory (pp. 74-94). Vancouver: Educational Research Institute of British Columbia.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Han, K. T. (2007). WinGen3: Windows software that generates IRT parameters and item responses [computer program]. Amherst, MA: Center for Educational Assessment, University of Massachusetts Amherst.
Han, K. T. (2008). Impact of item parameter drift on test equating and proficiency estimates (Doctoral dissertation). University of Massachusetts Amherst, Amherst.
Han, K. T. (2009). IRTEQ: Windows application that implements IRT scaling and equating [computer program]. Applied Psychological Measurement, 33(6), 491-493.
Han, T., Kolen, M. J., & Pohlmann, J. (1997). A comparison among IRT true and observed score equating and traditional equipercentile equating. Applied Measurement in Education, 10, 105-121.
Hanson, B. A., & Beguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common item equating design. Applied Psychological Measurement, 26, 3-24.
Harris, D. J., & Crouse, J. D. (1993). A study of criteria used in equating. Applied Psychological Measurement, 6(3), 195-240.
Harris, D. J., & Kolen, M. J. (1986). Effect of examinee group on equating relationships. Applied Psychological Measurement, 10, 35-43.
Harwell, M., Stone, C. A., Hsu, T., & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20(2),101-125.
He, Y. (2011). Evaluating equating properties for mixed-format tests (Doctoral dissertation). University of Iowa, Iowa.
Heh, V. K. (2007). Equating accuracy using small samples in the random groups design (Doctoral dissertation). Ohio University, Ohio.
Holland, P. W., Dorans, N. J., & Petersen, N. S. (2007). Equating test scores. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (Vol. 26, pp. 169-197). Amsterdam: Elsevier B. V.
Hu, H., Rogers, W.T., & Vukmirovic, Z. (2008). Investigation of IRT-based equating methods in the presence of outlier common items. Applied Psychological Measurement, 32, 311-333.
Huggins, A. C. (2012). The effect of differential item functioning on population invariance of item response theory true score equating (Doctoral dissertation). University of Miami, Coral Gables.
Jodoin, M. G. (2003). Measurement efficiency of innovative item formats in computer-based testing. Journal of Educational Measurement, 40(1), 1-15.
Karasar, N. (2010). Bilimsel araştırma yöntemi. Ankara: Nobel Yayınları.
Keller III, R. R. (2007). A comparison of item response theory true score equating and item response theory-based local equating (Doctoral dissertation). University of Massachusetts, Massachusetts.
Kilmen, S. (2010). Madde tepki kuramına dayalı test eşitleme yöntemlerinden kestirilen eşitleme hatalarının örneklem büyüklüğü ve yetenek dağılımına göre karşılaştırılması (Doktora tezi). Ankara Üniversitesi, Ankara.
Kim, D., Choi, S. W., Lee, G., & Um, K. R. (2008). A Comparison of the common item and random-groups equating designs using empirical data. International Journal of Selection and Assessment, 16(2), 83-92.
Kim, S. H., & Cohen, A. S. (2002). A comparison of linking and concurrent calibration under the graded response model. Applied Psychological Measurement, 26(1), 25-41.
Kim, S. Y. (2018). Simple structure MIRT equating for multidimensional tests (Doctoral dissertation). University of Iowa, Iowa.
Kim, S., & Kolen, M.J. (2006). Robustness to format effects of IRT linking methods for mixed-format tests, Applied Measurement in Education, 19(4), 357-381.
Kim, S., & Lee, W. (2004). IRT scale linking methods for mixed-format tests (ACT Research Report 2004-5). Iowa City, IA: ACT, Inc.
Kim, S., & Lee. W.C. (2006). An extension of four IRT linking methods for mixed format tests. Journal of Educational Measurement, 43(1), 53-76.
Klein, L. W., & Jarjoura, D. (1985). The importance of content representation for common-item equating with non-random groups. Journal of Educational Measurement, 22, 197-206.
Kolen, M. J. (1981). Comparison of traditional and item response theory methods for equating tests. Journal of Educational Measurement, 18, 1-11.
Kolen, M. J. (1988). Traditional equating methodology. Educational Measurement Issues and Practice, 7(4), 29-36.
Kolen, M. J. (2007). Data collection designs and linking procedures. In N.J. Dorans, M. Pommerich & P. W. Holland (Eds), Linking and aligning scores and scales. Statistics for social and behavioral sciences. New York: Springer.
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices (3rd edition). New York: Springer.
Kolen, M.J., & Harris, D.J. (1990). Comparison of item preequating using IRT and equipercentile methods. Journal of Educational Measurement, 27(1), 27-39.
Lee, E. (2013). Equating multidimensional tests under a random groups design: A comparison of various equating procedures (Doctoral dissertation). University of Iowa, Iowa.
Lee, G., & Fitzpatrick, A. R. (2008). A new approach to test score equating using item response theory with fixed c-parameters. Asia Pacific Education Review, 3, 248-261.
Lee, W. C., & Ban, J. C. (2010). Comparison of IRT Linking Procedures. Applied Measurement in Education, 23(1), 23-48.
Lehman, R. S., & Bailey, D. E. (1968). Digital computing: Fortran IV and its applications in behavioural science. New York: John Wiley.
Lim, E. (2016). Subscore equating with the random groups design (Doctoral dissertation). University of Iowa, Iowa.
Livingston, S. A. (2004). Equating test scores (Without IRT) (2nd edition). Educational Testing Service.
Livingston, S. A., Dorans, N. J., & Wright, N. K. (1990). What combination of sampling and equating methods works best? Applied Measurement in Education, 3(1), 73-95.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrance Erlbaum Associates, Inc.
Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the rasch model. Journal of Educational Measurement, 17(3), 179-193.
Lyren, P. E., & Hambleton, R. K. (2011) Consequences of violated equating assumptions under the equivalent groups design. International Journal of Testing, 11(4), 308-323.
Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14(2), 139-160.
McDonald, R. P. (1997). Normal-ogive multidimensional model. In W. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 258- 269). New York: Springer-Verlag.
McKinley, R. L., & Reckase, M. D. (1981). A comparison of procedures for constructing large item pools (Research Report 81-3). Missouri: University of Missouri, Department of Educational Psychology.
Meng, Y. (2012). Comparison of Kernel equating and item response theory equating methods (Doctoral dissertation). University of Massachusetts, Boston.
Milli Eğitim Bakanlığı (MEB). (2017). Akademik becerilerin izlenmesi ve değerlendirilmesi (ABİDE) 2016 8. sınıflar raporu. https://odsgm.meb.gov.tr/meb_iys_dosyalar/2017_11/30114819_iY-web-v6.pdf adresinden erişilmiştir.
Mohandas, R. (1996). Test equating, problems and solutions: Equating English test forms for the Indonesian junior secondary schoool final examination administered in 1994 (Doctoral dissertation). Flinders University of South Australia, Australia.
Monaghan, W. (2006). The fact about subscores. ETS Research Report No. RDC-04.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16(2), 159-176.
Norman-Dvorak, R. L. (2009). A comparison of kernel equating to the test characteristic curve method (Doctoral dissertation). University of Nebraska, Lincoln, Nebraska.
Nozawa, Y. (2008). Comparison of parametric and nonparametric IRT equating methods under the common-item nonequivalent groups design (Doctoral dissertation). The University of Iowa, Iowa City.
Ogasawara, H. (2000). Asymptotic standard errors of IRT equating coefficients using moments. Economic Review (Otaru University of Commerce), 51(1), 1-23.
Öztürk Gübeş, N. (2014). Testlerin boyutluluğunun, ortak madde formatının, yetenek dağılımının ve ölçek dönüştürme yöntemlerinin karma testlerin eşitlenmesine etkisi (Doktora tezi). Hacettepe Üniversitesi, Ankara.
Petersen, N. S., Marco, G. L., & Stewart, E. E. (1982). A test of the adequacy of linear score equating method. In P. W. Holland & D. B. Rubin (Ed.), Test equating (pp. 71–135). New York: Academic Press.
Puhan, G. (2010). A comparison of chained linear and poststratification linear equating under different testing conditions. Journal of Educational Measurement, 47(1), 54-75.
Puhan, G., & Liang, L. (2011). Equating subscores under the nonequivalent anchor test (NEAT) design. Educational Measurement: Issues and Practice, 30, 23-35.
Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item characteristic curve estimation. Psychometrika, 56, 611-630.
Rasch, G. (1966). An item analysis which takes individual differences into account. British Journal of Mathematical & Statistical Psychology, 19(1), 49-57.
Reckase, M. D. (1979). Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of Educational Statistics, 4, 207-230.
Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21(1), 25-36.
Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(4, Pt. 2), 100.
Sarkar, D (2018). Lattice: Trellis graphics for R. Retrieved from https://cran.r-project.org/web/packages/lattice/index.html.
Schmeiser, C. B., & Welch, C. J. (2006). Test development. In R. L. Brennan (Ed.), Educational Measurement (4th edition) (pp. 307-353). Washington, DC: American Council on Education.
Shin, M. (2015). An investigation of subtest score equating methods under classical test theory and item response theory frameworks (Doctoral dissertation). University of Massachusetts Amherst, Amherst.
Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47(2),150-174.
Sinharay, S., & Haberman, S. J. (2011). Equating of augmented subscores. Journal of Educational Measurement, 48, 122-145.
Sinharay, S., Haberman, S., Holland, P., & Lewis, C. (2012). A note on the choice of an anchor test in equating. ETS Research Report RR-12-14. Princeton, NJ: Educational Testing Service.
Sinharay, S., & Holland, P. (2006). The correlation between the scores of a test and an anchor test. ETS Research Report.
Sinharay, S., & Holland, P. W. (2007). Is it necessary to make anchor tests mini versions of the tests being equated or can some restrictions be relaxed? Journal of Educational Measurement, 44(3), 249-275.
Skaggs, G. (1990). To match or not to match samples on ability for equating: A discussion of five articles. Applied Measurement in Education, 3(1), 105-113.
Skaggs, G., & Lissitz, R. W. (1986). IRT Test equating: relevant issues and a review of recent research. Review of Educational Research, 56(4), 495-529.
Skorupski, W. P., & Carvajal, J. (2010). A comparison of approaches for improving the reliability of objective level scores. Educational and Psychological Measurement, 70(3), 357-375.
Spence, P. D. (1996). The effect of multidimensionality on unidimensional equating with item response theory (Doctoral dissertation). University of Florida America.
Speron, E. (2009). A comparison of metric linking procedures in item response theory (Doctoral dissertation). University of Illinois, Chicago, Illinois.
Stahl, J. A., & Masters, J. (2009). Variable pass rates resulting from equating short tests. Paper presented at the Annual Meeting of the American Educational Research Association, San Diego, CA.
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in Item Response Theory. Applied Psychological Measurement, 7(2), 201-210.
Stone, C. A. (1992). Recovery of marginal maximum likelihood estimates in the two-parameter logistic response model: An evaluation of MULTILOG. Applied Psychological Measurement, 16, 1-16.
Stout, W. (1987). A nonparametric approach for assessing latent trait dimensionality. Psychometrika, 52, 589-617.
Sukin, T., & Keller, L. (2008). The effect of deleting anchor on the classification of examinees. Northeastern educational research association (NERA) Annual Conference. NERA Conference Proceedings.
Tate, R. (2000). Performance of a proposed method for the linking of mixed format tests with constructed response and multiple-choice items. Journal of Educational Measurement, 37(4), 329-346.
Tian, F. (2011). A comparison of equating/linking using the Stocking-Lord method and concurrent calibration with mixed-format tests in the non-equivalent groups common-item design under IRT (Doctoral dissertation). Boston College University, Boston.
Tong, Y., & Kolen, M.J. (2005). Assessing equating results on different equating criteria. Applied Psychological Measurement, 29, 418-432.
Traub, R. E. (1983). A priori considerations in choosing an item response model. In R. K. Hambleton (Ed.), Applications of item response theory. Vancouver, BC: Educational Research Institute of British Columbia.
Tsai, T. H. (1997). Estimating minumum sample sizes in random groups equating. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago.
von Davier, A. A. (2008). New results on the linear equating methods for the non-equivalent-groups design. Journal of Educational and Behavioral Statistics, 33(2), 186-203.
von Davier, A. A., & Wilson, C. (2008). Investigating the population sensitivity assumption of item response theory true-score equating across two subgroups of examinees and two test formats. Applied Psychological Measurement, 32(1), 11-26.
von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of equating. New York: Springer.
Walker, C. M., Azen, R., & Schmitt, T. (2006). Statistical versus substantive dimensionality the effect of distributional differences on dimensionality assessment using DIMTEST. Educational and Psychological Measurement, 66(5), 721-738.
Wang, T. (2006). Standard errors of equating for equipercentile equating with log-linear pre-smoothing using the delta method (CASMA Research Report, No. 14). Center for Advanced Studies in Measurement and Assessment, Iowa.
Wang, T., Lee, W. C., Brennan, R. J., & Kolen, M. J. (2008). A comparison of the frequency estimation and chained equipercentile methods under the common-item non-equivalent groups design. Applied Psychological Measurement, 32, 632-651.
Weeks, J. P. (2010). Plink: An R package for linking mixed-format tests using IRT-based methods. Journal of Statistical Software, 35(12), 1-33.
Woldbeck, T. (1998). Basic concepts in modern methods of test equating. Paper presented at the annual meeting of the Southwest Psychological Association, New Orleans.
Wolkowitz, A. A. (2008). A comparison of classical test theory and item response theory methods for equating number-right scored to formula scored assessments (Doctoral dissertation). University of Kansas, Lawrence.
Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Pres.
Yang, W. L., & Houang, R. T. (1996). The effect of anchor length and equating method on the accuracy of test equating: comparisons of linear and IRT based equating using an anchor-item design. Paper presented at the annual meeting of the American Educational Research Asssociation, New York.
Yao, L. (2009). LinkMIRT: Linking of multivariate item response model. Monterey, CA: Defense Manpower Data Center.
Yao, L. (2011). Multidimensional linking for domain scores and overall scores for nonequivalent groups. Applied Psychological Measurement, 35, 48-66.
Yao, L. (2016). The BMIRT toolkit. Monterey, CA: Defense Manpower Data Center.
Yao, L., & Boughton, K. A. (2009). Multidimensional Linking for Tests with Mixed Item Types. Journal of Educational Measurement, 46(2), 177-197.
Yıldırım, A., & Şimşek, H. (2010). Sosyal bilimlerde nitel araştırma yöntemleri (7. baskı). Ankara: Seçkin Yayıncılık.
Zeng, L. (1991). Standard errors of linear equating for the single-group design. ACT Research Report Series. 91-4.
Zhang, B. (2009). Application of unidimensional item response models to tests with item sensitive to secondary dimensions. The Journal of Experimental Education, 77(2), 147-166.
Zhang, J. (2012). Calibration of response data using MIRT models with simple and mixed structures. Applied Psychological Measurement, 36(5), 375-398.
Zhu, W. (1998). Test equating: What, why, who?. Research Quarterly for Exercise and Sport, 69(1), 11-23.