Is This Test Fair? A Practical Introduction to Differential Item Functioning
When two students answer the same question differently, not because of what they know, but because of who they are. This article introduces the concept, the detection methods, and the practical implications for instrument development.
Consider two elementary school students: Student A, a female student who grew up in a mountain farming village, and Student B, a male student who lives near the coast. Their teacher administers an ecological awareness test. Both students possess a genuine and comparable understanding of the natural environment. Their true ecological knowledge is, by any reasonable standard, equivalent.
Then comes item 12: "Why is it important to protect mangrove forests for coastal communities?" Student B responds with confidence and detail, drawing on direct experience. Student A struggles, not because she lacks ecological awareness, but because mangroves are simply absent from her lived environment.
The item has not measured a difference in ability. It has measured a difference in context familiarity. This is the core problem that Differential Item Functioning (DIF) analysis is designed to detect, and addressing it is one of the most consequential steps in constructing assessments that are genuinely valid across groups.
What exactly is DIF?
DIF occurs when examinees from different groups who have the same level of the underlying trait still have different probabilities of answering an item correctly. The critical qualifier is "same level of the trait." DIF is not simply about different average scores between groups, as that could reflect a real ability difference. DIF concerns whether the item is functioning consistently across groups once ability is held constant.
DIF is not about whether groups score differently. It is about whether the item measures the same thing for all groups.
In IRT terms, we say an item shows DIF when its Item Characteristic Curve (ICC) is not the same for two groups. One group has a higher probability of success than the other, even at identical ability levels. The two groups used in DIF analysis have standard names:
- Reference group: the majority or baseline group (e.g., male students, higher grade)
- Focal group: the group whose performance is under scrutiny (e.g., female students, lower grade)
- Groups score differently on average
- Reflects real differences in ability
- ICC curves are parallel across groups
- The item is functioning correctly
- No revision needed
- Groups differ even at the same ability
- Reflects bias in the item itself
- ICC curves diverge or cross
- The item is measuring something extra
- Review and possibly revise the item
Seeing DIF through the ICC
The Item Characteristic Curve provides the most direct means of visualizing DIF. On a properly functioning item, examinees with ability θ = 1.0 should have the same probability of a correct response regardless of gender, grade, or background. When this condition is violated, the ICC curves for the two groups will diverge.
Two types of DIF
DIF is not homogeneous in its manifestation. Researchers distinguish between two forms, and the distinction has consequences for both the choice of detection method and the interpretation of findings.
Uniform DIF
The advantage of one group over the other is consistent in direction across all ability levels. One ICC curve is displaced (easier or harder) for the focal group throughout the entire ability continuum. Uniform DIF is more readily detected and more straightforward to interpret substantively.
Non-uniform DIF
The advantage is not consistent in direction; it reverses at some point on the ability scale. At lower ability levels the focal group may outperform the reference group, while at higher ability levels the pattern is reversed. The two ICC curves intersect.
Even small DIF can compound across a multi-item instrument. If five items each favor one group by a small margin, the cumulative effect on total scores can be substantial, particularly in high-stakes examinations where marginal score differences determine promotion, scholarship eligibility, or admission decisions. DIF analysis is therefore not merely a technical consideration; it is a matter of measurement validity and equity.
Three methods for detecting DIF
Several statistical procedures have been developed to detect DIF, each with distinct requirements, assumptions, and statistical targets. Three methods are most widely employed in educational measurement research:
In practice, researchers often run Mantel-Haenszel as a first screen for its simplicity and interpretability, then follow up with Lord's chi-square or logistic regression for items that are flagged or where non-uniform DIF is suspected.
The DIF analysis workflow
Worked example: DIF analysis in R
The following example demonstrates a complete DIF analysis in R, from data simulation to graphical inspection. The context is drawn from educational measurement research: an ecological awareness instrument administered to elementary school students, with gender as the grouping variable of interest.
This tutorial uses difR (DIF detection), mirt (IRT modelling and ICC plots), and ggplot2 (visualization). All are available on CRAN.
# Run once to install
install.packages(c("difR", "mirt", "ggplot2"))
library(difR)
library(mirt)
library(ggplot2)
The simulation generates 300 examinees (150 male, 150 female) responding to 15 items. DIF is introduced into items 5 and 10 by shifting the difficulty parameter for the female group.
set.seed(2024)
N <- 300 # total students
n_item <- 15 # number of items
# group: 0 = male (reference), 1 = female (focal)
group <- rep(0:1, each = N / 2)
# simulate latent ability
theta <- rnorm(N, mean = 0, sd = 1)
# item parameters (same for both groups by default)
a_ref <- runif(n_item, 0.8, 1.8) # discrimination
b_ref <- seq(-2, 2, length.out = n_item) # difficulty
# introduce DIF on items 5 and 10
b_foc <- b_ref
b_foc[5] <- b_ref[5] - 0.9 # item 5: easier for females (uniform DIF)
b_foc[10] <- b_ref[10] + 0.7 # item 10: harder for females (uniform DIF)
# helper: simulate binary responses from 2PL model
sim_item <- function(theta_vec, a, b) {
p <- 1 / (1 + exp(-a * (theta_vec - b)))
rbinom(length(theta_vec), 1, p)
}
resp_male <- sapply(1:n_item, \(j) sim_item(theta[group == 0], a_ref[j], b_ref[j]))
resp_female <- sapply(1:n_item, \(j) sim_item(theta[group == 1], a_ref[j], b_foc[j]))
data_resp <- rbind(resp_male, resp_female)
colnames(data_resp) <- paste0("Item", 1:n_item)
# sanity check
dim(data_resp) # should be 300 x 15
table(group) # 150 male, 150 female
The Mantel-Haenszel (MH) method is the traditional starting point. It matches students by their total score and then compares item performance between groups at each score level. The output includes a chi-square test and the MH odds ratio (α), which is then converted to the MH D-DIF scale for interpretation.
mh_result <- difMH(
Data = data_resp,
group = group,
focal.name = 1, # 1 = female (focal group)
correct = TRUE # Yates' continuity correction
)
# Print item-level results
print(mh_result)
# Plot chi-square statistics across items
# Items above the dashed line are flagged as DIF
plot(mh_result, main = "Mantel-Haenszel DIF Statistics")
The printed output shows the MH chi-square statistic, p-value, and the MH D-DIF value for each item. Items with significant chi-square and |D-DIF| exceeding a threshold are flagged. The sign of D-DIF tells you the direction: negative values favor the focal group (female), positive values favor the reference group (male).
Lord's chi-square fits a separate IRT model for each group and compares the full item parameter vectors. Because it compares both the discrimination (a) and difficulty (b) parameters simultaneously, it is sensitive to both uniform and non-uniform DIF.
lord_result <- difLord(
Data = data_resp,
group = group,
focal.name = 1,
model = "2PL" # 2-parameter logistic
)
print(lord_result)
plot(lord_result, main = "Lord's Chi-Square DIF Statistics")
Statistical flagging alone is insufficient for diagnosis. Plotting the ICC separately for each group provides direct visual evidence of the nature and magnitude of DIF. Group-specific 2PL models are fitted using the mirt package, from which ICC probabilities are extracted for graphical comparison.
# Fit separate 2PL models for each group
mod_male <- mirt(data_resp[group == 0, ], 1, itemtype = "2PL", verbose = FALSE)
mod_female <- mirt(data_resp[group == 1, ], 1, itemtype = "2PL", verbose = FALSE)
# Extract ICC probabilities for flagged Item 5
theta_seq <- seq(-3, 3, by = 0.01)
p_male <- probtrace(extract.item(mod_male, 5), theta_seq)[, 2]
p_female <- probtrace(extract.item(mod_female, 5), theta_seq)[, 2]
# Build tidy data frame
df_icc <- data.frame(
theta = rep(theta_seq, 2),
prob = c(p_male, p_female),
group = rep(c("Male (Reference)", "Female (Focal)"),
each = length(theta_seq))
)
# Plot
ggplot(df_icc, aes(theta, prob, color = group)) +
geom_line(linewidth = 1.3) +
scale_color_manual(
values = c("Male (Reference)" = "#032454",
"Female (Focal)" = "#fcc209")
) +
labs(
title = "ICC Comparison: Item 5",
subtitle = "Uniform DIF: female students have consistently higher P(correct)",
x = expression("Ability " * theta),
y = "P(Correct)",
color = NULL
) +
theme_minimal(base_size = 13) +
theme(legend.position = "top")
Interpreting the results
Statistical significance alone does not indicate practical importance. The most widely used classification system for contextualising DIF magnitude was developed by the Educational Testing Service (ETS) and operates on the MH D-DIF scale.
| Category | MH D-DIF Criterion | Interpretation | Recommended Action |
|---|---|---|---|
| A | |D-DIF| < 1.0 |
Negligible DIF. No practical concern. | Retain the item as-is |
| B | 1.0 ≤ |D-DIF| < 1.5 |
Moderate DIF. Warrants attention. | Flag for expert panel review; contextual judgement needed |
| C | |D-DIF| ≥ 1.5 |
Large DIF. Serious concern. | Revise, replace, or remove from the scoring key |
A statistically significant result alone does not justify item removal. Category C items should be subjected to substantive expert review to determine the source of differential functioning. In some cases the cause is identifiable from the item content (e.g., culturally or geographically specific references); in others it requires deeper investigation. The statistical flag initiates the review process; it does not conclude it.
Run DIF analysis iteratively with purify = TRUE in difMH(). This progressively removes flagged items from the ability-matching step, preventing contaminated anchor items from masking additional DIF, a frequent problem in the initial pass.
Applied example: gender and grade DIF in an ecological awareness instrument
Wiradika and Mahendra (2026) applied DIF analysis to an ecological awareness instrument administered to elementary school students. The study examined two grouping variables: gender (male vs. female) and grade level, providing a multidimensional assessment of measurement fairness within the same instrument.
The findings carry direct implications for instrument development. Items referencing specific ecological contexts, such as coastal ecosystems, may systematically advantage students whose lived experience aligns with those contexts. This form of structural bias is undetectable through inspection of aggregate scores alone; DIF analysis provides the methodological basis for identifying and addressing such disparities.
Responding to DIF: practical recommendations
Detection constitutes only the initial phase of DIF analysis. The appropriate response depends on the severity of the finding and the testing context:
- Convene a content review panel. Subject matter experts from both the reference and focal groups should examine the flagged item and assess whether a substantive rationale exists for the observed differential performance, independent of the groups' true ability levels.
- Revise the item stem or context. Where DIF is attributable to a contextually specific reference (e.g., mangrove, irrigation channel, fishing practice), revision toward a more ecologically neutral stimulus is the preferred course of action. Parallel versions targeting different contexts may also be considered.
- Remove items only as a last resort. Item removal reduces instrument length and reliability; revision is ordinarily preferable. When removal is necessary, the item should be retired from the item bank rather than simply excluded from scoring, and replaced in subsequent test versions.
- Document and report. DIF findings constitute part of the validity evidence for an instrument. Published reports should specify which items were flagged, the magnitude and direction of DIF, and the remedial actions taken.
A fair assessment does not assume group equivalence. It ensures that each item measures the intended construct consistently, for all examinees.
Conclusion
DIF analysis operates at the interface of psychometric rigour and measurement equity. The central question it addresses, whether an item measures the intended construct equivalently across groups, is methodologically tractable but substantively demanding. The Mantel-Haenszel procedure, Lord's chi-square, and logistic regression approach this question from different statistical vantage points; used in combination, they constitute a principled framework for equitable test development.
For researchers developing or validating educational assessments, regardless of domain, DIF analysis should be incorporated as a standard component of instrument evaluation. The difR package provides an accessible implementation in R. The more demanding task, however, is the substantive interpretation: identifying why an item exhibits differential functioning and determining the appropriate response.