Skip to main content
Sandip Sinharay
  • Hightstown, New Jersey, United States
Download (.pdf)
ABSTRACT With an increase in the number of online tests, the number of interruptions during testing due to unexpected technical issues seems to be on the rise. For example, interruptions occurred during several recent state tests. When... more
ABSTRACT With an increase in the number of online tests, the number of interruptions during testing due to unexpected technical issues seems to be on the rise. For example, interruptions occurred during several recent state tests. When interruptions occur, it is important to determine the extent of their impact on the examinees' scores. Researchers such as Hill and Sinharay et al. examined the impact of interruptions at an aggregate level. However, there is a lack of research on the assessment of impact of interruptions at an individual level. We attempt to fill that void. We suggest four methodological approaches, primarily based on statistical hypothesis testing, linear regression, and item response theory, which can provide evidence on the individual-level impact of interruptions. We perform a realistic simulation study to compare the Type I error rate and power of the suggested approaches. We then apply the approaches to data from the 2013 Indiana Statewide Testing for Educational Progress-Plus (ISTEP+) test that experienced interruptions.
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
ABSTRACT Generalized residuals are a tool employed in the analysis of contingency tables to examine possible sources of model error. They have typically been applied to log-linear models and to latent-class models. A general approach to... more
ABSTRACT Generalized residuals are a tool employed in the analysis of contingency tables to examine possible sources of model error. They have typically been applied to log-linear models and to latent-class models. A general approach to generalized residuals is developed for a very general class of models for contingency tables. To illustrate their use, generalized residuals are applied to models based on item response theory (IRT) models. Such models are commonly applied to analysis of standardized achievement or aptitude tests. To obtain a realistic perspective on application of generalized residuals, actual testing data are employed.
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
The nonequivalent groups with anchor test (NEAT) design, also known as the common item, nonequivalent groups design (Kolen & Brennan, 2004), is used in equating scores of several large-scale tests such as the SAT® and the... more
The nonequivalent groups with anchor test (NEAT) design, also known as the common item, nonequivalent groups design (Kolen & Brennan, 2004), is used in equating scores of several large-scale tests such as the SAT® and the certification examinations conducted by the American Society for Quality. The two observed-score equating (OSE) methods popular with the NEAT design are chain equating (CE) and poststratification equating (PSE). Here, we consider their nonlinear versions, that is, the frequency estimation equipercentile equating (FEEE) for PSE, and the chained equipercentile equating (CEE) method for CE (see Kolen & Brennan, 2004, for further details on these methods).
Download (.pdf)
Download (.pdf)
Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from: Research Publications Office Mail Stop 7-R ETS Princeton, NJ 08541 ... Detecting item fit for... more
Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from: Research Publications Office Mail Stop 7-R ETS Princeton, NJ 08541 ... Detecting item fit for common dichotomous item ...
Download (.pdf)
Download (.pdf)
ABSTRACT Brennan noted that users of test scores often want (indeed, demand) that subscores be reported, along with total test scores, for diagnostic purposes. Haberman suggested a method based on classical test theory (CTT) to determine... more
ABSTRACT Brennan noted that users of test scores often want (indeed, demand) that subscores be reported, along with total test scores, for diagnostic purposes. Haberman suggested a method based on classical test theory (CTT) to determine if subscores have added value over the total score. One way to interpret the method is that a subscore has added value only if it has a better agreement than the total score with the corresponding subscore on a parallel form. The focus of this article is on classification of the examinees into “pass” and “fail” (or master and nonmaster) categories based on subscores. A new CTT‐based method is suggested to assess whether classification based on a subscore is in better agreement, than classification based on the total score, with classification based on the corresponding subscore on a parallel form. The method can be considered as an assessment of the added value of subscores with respect to classification. The suggested method is applied to data from several operational tests. The added value of subscores with respect to classification is found to be very similar, except at extreme cutscores, to their added value from a value‐added analysis of Haberman.
Download (.pdf)
Download (.pdf)
In many educational tests which involve constructed responses, a traditional test score is obtained by adding together item scores obtained through holistic scoring by trained human raters. For example, this practice was used until 2008... more
In many educational tests which involve constructed responses, a traditional test score is obtained by adding together item scores obtained through holistic scoring by trained human raters. For example, this practice was used until 2008 in the case of GRE(®) General Analytical Writing and until 2009 in the case of TOEFL(®) iBT Writing. With use of natural language processing, it is possible to obtain additional information concerning item responses from computer programs such as e-rater(®) . In addition, available information relevant to examinee performance may include scores on related tests. We suggest application of standard results from classical test theory to the available data to obtain best linear predictors of true traditional test scores. In performing such analysis, we require estimation of variances and covariances of measurement errors, a task which can be quite difficult in the case of tests with limited numbers of items and with multiple measurements per item. As a c...
The application of the existing test statistics to determine differential item functioning (DIF) requires large samples, but test administrators often face the challenge of detecting DIF with small samples. One advantage of a Bayesian... more
The application of the existing test statistics to determine differential item functioning (DIF) requires large samples, but test administrators often face the challenge of detecting DIF with small samples. One advantage of a Bayesian approach over a frequentist approach is that the former can incorporate, in the form of a prior distribution, existing information on the inference problem at hand. Sinharay, Dorans, Grant, and Blew (2009) suggested the use of information from past data sets as a prior distribution in a Bayesian DIF analysis. This paper suggests an extension of the method of Sinharay et al. (2009). The suggested extension is compared to the existing DIF detection methods in a realistic simulation study.
Download (.pdf)
Download (.pdf)
The [Formula: see text] statistic (Drasgow et al. in Br J Math Stat Psychol 38:67-86, 1985) is one of the most popular person-fit statistics (Armstrong et al. in Pract Assess Res Eval 12(16):1-10, 2007). Snijders (Psychometrika... more
The [Formula: see text] statistic (Drasgow et al. in Br J Math Stat Psychol 38:67-86, 1985) is one of the most popular person-fit statistics (Armstrong et al. in Pract Assess Res Eval 12(16):1-10, 2007). Snijders (Psychometrika 66:331-342, 2001) derived the asymptotic null distribution of [Formula: see text] when the examinee ability parameter is estimated. He also suggested the [Formula: see text] statistic, which is the asymptotically correct standardized version of [Formula: see text]. However, Snijders (Psychometrika 66:331-342, 2001) only considered tests with dichotomous items. In this paper, the asymptotic null distribution of [Formula: see text] is derived for mixed-format tests (those that include both dichotomous and polytomous items). The asymptotically correct standardized version of [Formula: see text], which can be considered as the extension of [Formula: see text] to such tests, is suggested. The Type I error rate and power of the suggested statistic are examined from several simulated datasets. The suggested statistic is computed using a real dataset. The suggested statistic appears to be a satisfactory tool for assessing person fit for mixed-format tests.
Diagnostic scores are of increasing interest due to their potential remedial and instructional benefit. Naturally, the number of testing programs that report diagnostic scores is on the rise, as are the number of research works on such... more
Diagnostic scores are of increasing interest due to their potential remedial and instructional benefit. Naturally, the number of testing programs that report diagnostic scores is on the rise, as are the number of research works on such scores. This paper starts by showing examples of diagnostic subscores reported by operational testing programs. Then this paper provides a discussion ofexisting
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
Download (.pdf)
ABSTRACT With an increase in the number of online tests, interruptions during testing due to unexpected technical issues seem unavoidable. For example, interruptions occurred during several recent state tests. When interruptions occur, it... more
ABSTRACT With an increase in the number of online tests, interruptions during testing due to unexpected technical issues seem unavoidable. For example, interruptions occurred during several recent state tests. When interruptions occur, it is important to determine the extent of their impact on the examinees’ scores. There is a lack of research on this topic due to the novelty of the problem. This article is an attempt to fill that void. Several methods, primarily based on propensity score matching, linear regression, and item response theory, were suggested to determine the overall impact of the interruptions on the examinees’ scores. A realistic simulation study shows that the suggested methods have satisfactory Type I error rate and power. Then the methods were applied to data from the Indiana Statewide Testing for Educational Progress-Plus (ISTEP+) test that experienced interruptions in 2013. The results indicate that the interruptions did not have a significant overall impact on the student scores for the ISTEP+ test.
Download (.pdf)
... TOEIC "Bridge" test as a measure of English language skill was evaluated using several approaches, such as factor analysis and computation of correlations between the TOEIC "Bridge" scores and other... more
... TOEIC "Bridge" test as a measure of English language skill was evaluated using several approaches, such as factor analysis and computation of correlations between the TOEIC "Bridge" scores and other measures like local English test scores, student self-assessment scores ...

And 59 more