Item Response Theory · Educational Measurement · R Tutorial

Is This Test Fair? A Practical Introduction to Differential Item Functioning

When two students answer the same question differently, not because of what they know, but because of who they are. This article introduces the concept, the detection methods, and the practical implications for instrument development.

I Nyoman Indhi Wiradika
I Nyoman Indhi Wiradika
Published May 17, 2026
14 min read

Consider two elementary school students: Student A, a female student who grew up in a mountain farming village, and Student B, a male student who lives near the coast. Their teacher administers an ecological awareness test. Both students possess a genuine and comparable understanding of the natural environment. Their true ecological knowledge is, by any reasonable standard, equivalent.

Then comes item 12: "Why is it important to protect mangrove forests for coastal communities?" Student B responds with confidence and detail, drawing on direct experience. Student A struggles, not because she lacks ecological awareness, but because mangroves are simply absent from her lived environment.

The item has not measured a difference in ability. It has measured a difference in context familiarity. This is the core problem that Differential Item Functioning (DIF) analysis is designed to detect, and addressing it is one of the most consequential steps in constructing assessments that are genuinely valid across groups.

What exactly is DIF?

DIF occurs when examinees from different groups who have the same level of the underlying trait still have different probabilities of answering an item correctly. The critical qualifier is "same level of the trait." DIF is not simply about different average scores between groups, as that could reflect a real ability difference. DIF concerns whether the item is functioning consistently across groups once ability is held constant.

DIF is not about whether groups score differently. It is about whether the item measures the same thing for all groups.

In IRT terms, we say an item shows DIF when its Item Characteristic Curve (ICC) is not the same for two groups. One group has a higher probability of success than the other, even at identical ability levels. The two groups used in DIF analysis have standard names:

Impact (Not DIF)
  • Groups score differently on average
  • Reflects real differences in ability
  • ICC curves are parallel across groups
  • The item is functioning correctly
  • No revision needed
Differential Item Functioning
  • Groups differ even at the same ability
  • Reflects bias in the item itself
  • ICC curves diverge or cross
  • The item is measuring something extra
  • Review and possibly revise the item

Seeing DIF through the ICC

The Item Characteristic Curve provides the most direct means of visualizing DIF. On a properly functioning item, examinees with ability θ = 1.0 should have the same probability of a correct response regardless of gender, grade, or background. When this condition is violated, the ICC curves for the two groups will diverge.

Visualization 01
Uniform DIF: same discrimination, different difficulty
The gold curve (focal group) is shifted left; at every ability level, the focal group has a higher probability of a correct response. This advantage is constant across the full ability range, which is the defining characteristic of uniform DIF. The shaded region between the curves represents the magnitude of the differential functioning.
Gap at θ = 0
+0.340
MH D-DIF
−2.82
Direction
Favors Focal
ETS Category
C

Two types of DIF

DIF is not homogeneous in its manifestation. Researchers distinguish between two forms, and the distinction has consequences for both the choice of detection method and the interpretation of findings.

Uniform DIF

The advantage of one group over the other is consistent in direction across all ability levels. One ICC curve is displaced (easier or harder) for the focal group throughout the entire ability continuum. Uniform DIF is more readily detected and more straightforward to interpret substantively.

Non-uniform DIF

The advantage is not consistent in direction; it reverses at some point on the ability scale. At lower ability levels the focal group may outperform the reference group, while at higher ability levels the pattern is reversed. The two ICC curves intersect.

Visualization 02
Non-uniform DIF: ICC curves that intersect
The two curves intersect. At lower ability levels, the focal group (gold) has the higher probability of a correct response. At higher ability levels, the reference group (navy) predominates. This reversal, which defines non-uniform DIF, indicates that the item interacts with ability in a manner that differs across groups. The standard Mantel-Haenszel procedure is insensitive to this pattern; Lord's chi-square or logistic regression are the appropriate alternatives.
DIF Type
Non-uniform
Crossing Point
θ ≈ 1.0
Low θ Advantage
Focal
High θ Advantage
Reference
Why DIF Matters

Even small DIF can compound across a multi-item instrument. If five items each favor one group by a small margin, the cumulative effect on total scores can be substantial, particularly in high-stakes examinations where marginal score differences determine promotion, scholarship eligibility, or admission decisions. DIF analysis is therefore not merely a technical consideration; it is a matter of measurement validity and equity.

Three methods for detecting DIF

Several statistical procedures have been developed to detect DIF, each with distinct requirements, assumptions, and statistical targets. Three methods are most widely employed in educational measurement research:

Classical
Mantel-Haenszel
Statistic: χ² / ln(α)
No IRT model required. Matches students by observed total score. Best for detecting uniform DIF. Widely used and well-validated.
IRT-Based
Lord's Chi-Square
Statistic: χ²(df = 2)
Compares full item parameter vectors between groups. Sensitive to both uniform and non-uniform DIF. Requires fitting a separate IRT model per group.
Regression
Logistic Regression
Statistic: ΔR² / LRT
Models the interaction of ability and group membership. Flexibly detects both DIF types in a single model. Outputs effect sizes naturally.

In practice, researchers often run Mantel-Haenszel as a first screen for its simplicity and interpretability, then follow up with Lord's chi-square or logistic regression for items that are flagged or where non-uniform DIF is suspected.

The DIF analysis workflow

1
Collect Response Data
Item responses + group label for each examinee
2
Define Groups
Reference vs focal group (gender, grade, region…)
3
Condition on Ability
Total score (MH) or θ (IRT methods)
4
Test & Classify
Run MH or Lord's χ², apply ETS criteria
5
Interpret & Revise
Review flagged items, consult content experts

Worked example: DIF analysis in R

The following example demonstrates a complete DIF analysis in R, from data simulation to graphical inspection. The context is drawn from educational measurement research: an ecological awareness instrument administered to elementary school students, with gender as the grouping variable of interest.

Packages needed

This tutorial uses difR (DIF detection), mirt (IRT modelling and ICC plots), and ggplot2 (visualization). All are available on CRAN.

1
Install and load packages
R setup
# Run once to install
install.packages(c("difR", "mirt", "ggplot2"))

library(difR)
library(mirt)
library(ggplot2)
2
Simulate an ecological awareness dataset

The simulation generates 300 examinees (150 male, 150 female) responding to 15 items. DIF is introduced into items 5 and 10 by shifting the difficulty parameter for the female group.

R simulate data
set.seed(2024)

N      <- 300    # total students
n_item <- 15     # number of items

# group: 0 = male (reference), 1 = female (focal)
group <- rep(0:1, each = N / 2)

# simulate latent ability
theta <- rnorm(N, mean = 0, sd = 1)

# item parameters (same for both groups by default)
a_ref <- runif(n_item, 0.8, 1.8)   # discrimination
b_ref <- seq(-2, 2, length.out = n_item)  # difficulty

# introduce DIF on items 5 and 10
b_foc     <- b_ref
b_foc[5]  <- b_ref[5]  - 0.9   # item 5: easier for females (uniform DIF)
b_foc[10] <- b_ref[10] + 0.7   # item 10: harder for females (uniform DIF)

# helper: simulate binary responses from 2PL model
sim_item <- function(theta_vec, a, b) {
  p <- 1 / (1 + exp(-a * (theta_vec - b)))
  rbinom(length(theta_vec), 1, p)
}

resp_male   <- sapply(1:n_item, \(j) sim_item(theta[group == 0], a_ref[j], b_ref[j]))
resp_female <- sapply(1:n_item, \(j) sim_item(theta[group == 1], a_ref[j], b_foc[j]))

data_resp <- rbind(resp_male, resp_female)
colnames(data_resp) <- paste0("Item", 1:n_item)

# sanity check
dim(data_resp)     # should be 300 x 15
table(group)       # 150 male, 150 female
3
Mantel-Haenszel DIF detection

The Mantel-Haenszel (MH) method is the traditional starting point. It matches students by their total score and then compares item performance between groups at each score level. The output includes a chi-square test and the MH odds ratio (α), which is then converted to the MH D-DIF scale for interpretation.

R Mantel-Haenszel
mh_result <- difMH(
  Data       = data_resp,
  group      = group,
  focal.name = 1,       # 1 = female (focal group)
  correct    = TRUE     # Yates' continuity correction
)

# Print item-level results
print(mh_result)

# Plot chi-square statistics across items
# Items above the dashed line are flagged as DIF
plot(mh_result, main = "Mantel-Haenszel DIF Statistics")

The printed output shows the MH chi-square statistic, p-value, and the MH D-DIF value for each item. Items with significant chi-square and |D-DIF| exceeding a threshold are flagged. The sign of D-DIF tells you the direction: negative values favor the focal group (female), positive values favor the reference group (male).

4
Lord's chi-square (IRT-based)

Lord's chi-square fits a separate IRT model for each group and compares the full item parameter vectors. Because it compares both the discrimination (a) and difficulty (b) parameters simultaneously, it is sensitive to both uniform and non-uniform DIF.

R Lord's chi-square
lord_result <- difLord(
  Data       = data_resp,
  group      = group,
  focal.name = 1,
  model      = "2PL"    # 2-parameter logistic
)

print(lord_result)
plot(lord_result, main = "Lord's Chi-Square DIF Statistics")
5
Visualize ICC by group for flagged items

Statistical flagging alone is insufficient for diagnosis. Plotting the ICC separately for each group provides direct visual evidence of the nature and magnitude of DIF. Group-specific 2PL models are fitted using the mirt package, from which ICC probabilities are extracted for graphical comparison.

R ICC visualization
# Fit separate 2PL models for each group
mod_male   <- mirt(data_resp[group == 0, ], 1, itemtype = "2PL", verbose = FALSE)
mod_female <- mirt(data_resp[group == 1, ], 1, itemtype = "2PL", verbose = FALSE)

# Extract ICC probabilities for flagged Item 5
theta_seq <- seq(-3, 3, by = 0.01)
p_male    <- probtrace(extract.item(mod_male,   5), theta_seq)[, 2]
p_female  <- probtrace(extract.item(mod_female, 5), theta_seq)[, 2]

# Build tidy data frame
df_icc <- data.frame(
  theta = rep(theta_seq, 2),
  prob  = c(p_male, p_female),
  group = rep(c("Male (Reference)", "Female (Focal)"),
              each = length(theta_seq))
)

# Plot
ggplot(df_icc, aes(theta, prob, color = group)) +
  geom_line(linewidth = 1.3) +
  scale_color_manual(
    values = c("Male (Reference)" = "#032454",
               "Female (Focal)"   = "#fcc209")
  ) +
  labs(
    title    = "ICC Comparison: Item 5",
    subtitle = "Uniform DIF: female students have consistently higher P(correct)",
    x        = expression("Ability " * theta),
    y        = "P(Correct)",
    color    = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "top")

Interpreting the results

Statistical significance alone does not indicate practical importance. The most widely used classification system for contextualising DIF magnitude was developed by the Educational Testing Service (ETS) and operates on the MH D-DIF scale.

Reference Table
ETS DIF Classification System
Category MH D-DIF Criterion Interpretation Recommended Action
A |D-DIF| < 1.0 Negligible DIF. No practical concern. Retain the item as-is
B 1.0 ≤ |D-DIF| < 1.5 Moderate DIF. Warrants attention. Flag for expert panel review; contextual judgement needed
C |D-DIF| ≥ 1.5 Large DIF. Serious concern. Revise, replace, or remove from the scoring key
The MH D-DIF is computed as −2.35 × ln(αMH), where αMH is the Mantel-Haenszel odds ratio. Positive values favor the reference group; negative values favor the focal group. For Lord's chi-square, use the critical value at the chosen α level (typically 0.05).

A statistically significant result alone does not justify item removal. Category C items should be subjected to substantive expert review to determine the source of differential functioning. In some cases the cause is identifiable from the item content (e.g., culturally or geographically specific references); in others it requires deeper investigation. The statistical flag initiates the review process; it does not conclude it.

Practical tip

Run DIF analysis iteratively with purify = TRUE in difMH(). This progressively removes flagged items from the ability-matching step, preventing contaminated anchor items from masking additional DIF, a frequent problem in the initial pass.

Applied example: gender and grade DIF in an ecological awareness instrument

Wiradika and Mahendra (2026) applied DIF analysis to an ecological awareness instrument administered to elementary school students. The study examined two grouping variables: gender (male vs. female) and grade level, providing a multidimensional assessment of measurement fairness within the same instrument.

The findings carry direct implications for instrument development. Items referencing specific ecological contexts, such as coastal ecosystems, may systematically advantage students whose lived experience aligns with those contexts. This form of structural bias is undetectable through inspection of aggregate scores alone; DIF analysis provides the methodological basis for identifying and addressing such disparities.

Study this post is based on
Wiradika, I. N. I., & Mahendra, G. S. (2026). Gender and Grade DIF Analysis of Elementary School Students' Ecological Awareness Instrument. Jurnal Evaluasi Dan Pembelajaran, 8(1), 1–11.
https://doi.org/10.52647/jep.v8i1.493

Responding to DIF: practical recommendations

Detection constitutes only the initial phase of DIF analysis. The appropriate response depends on the severity of the finding and the testing context:

  1. Convene a content review panel. Subject matter experts from both the reference and focal groups should examine the flagged item and assess whether a substantive rationale exists for the observed differential performance, independent of the groups' true ability levels.
  2. Revise the item stem or context. Where DIF is attributable to a contextually specific reference (e.g., mangrove, irrigation channel, fishing practice), revision toward a more ecologically neutral stimulus is the preferred course of action. Parallel versions targeting different contexts may also be considered.
  3. Remove items only as a last resort. Item removal reduces instrument length and reliability; revision is ordinarily preferable. When removal is necessary, the item should be retired from the item bank rather than simply excluded from scoring, and replaced in subsequent test versions.
  4. Document and report. DIF findings constitute part of the validity evidence for an instrument. Published reports should specify which items were flagged, the magnitude and direction of DIF, and the remedial actions taken.
A fair assessment does not assume group equivalence. It ensures that each item measures the intended construct consistently, for all examinees.

Conclusion

DIF analysis operates at the interface of psychometric rigour and measurement equity. The central question it addresses, whether an item measures the intended construct equivalently across groups, is methodologically tractable but substantively demanding. The Mantel-Haenszel procedure, Lord's chi-square, and logistic regression approach this question from different statistical vantage points; used in combination, they constitute a principled framework for equitable test development.

For researchers developing or validating educational assessments, regardless of domain, DIF analysis should be incorporated as a standard component of instrument evaluation. The difR package provides an accessible implementation in R. The more demanding task, however, is the substantive interpretation: identifying why an item exhibits differential functioning and determining the appropriate response.

References

[1]
Wiradika, I. N. I., & Mahendra, G. S. (2026). Gender and Grade DIF Analysis of Elementary School Students' Ecological Awareness Instrument. Jurnal Evaluasi Dan Pembelajaran, 8(1), 1–11. https://doi.org/10.52647/jep.v8i1.493
[2]
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test Validity (pp. 129–145). Lawrence Erlbaum Associates.
[3]
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates.
[4]
Magis, D., Béland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862. https://doi.org/10.3758/BRM.42.3.847
[5]
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06
[6]
Penfield, R. D., & Camilli, G. (2007). Differential item functioning and item bias. In C. R. Rao & S. Sinharay (Eds.), Handbook of Statistics: Vol. 26. Psychometrics (pp. 125–167). Elsevier. https://doi.org/10.1016/S0169-7161(06)26005-X
[7]
Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP history assessment. Journal of Educational Measurement, 26(1), 55–66. https://doi.org/10.1111/j.1745-3984.1989.tb00318.x