Bland Altman Analysis: Best Practices, FAQs, and Examples

Introduction 🔗

Bland–Altman analysis (limits of agreement) is an FDA‑preferred way to test whether two continuous measurement methods truly agree, not just correlate. It’s used to compare outputs like lesion size (diameter, area, volume), tumor thickness or depth of invasion, coronary calcium burden, stenosis percent, ejection fraction or strain, aneurysm diameter, carotid intima media thickness, bone density and fracture angulation, periodontal bone loss, implant planning distances, wound area, organ volume, liver fat fraction, and MRI T2* iron.

Evaluating agreement between an AI/ML medical device and a clinical reference standard is essential for regulatory acceptance. While metrics like AUC, Sensitivity/Specificity and Dice scores describe discriminative performance, they do not address whether two measurement methods can be used interchangeably. High discriminative performance can coexist with unacceptable systematic bias or clinically meaningful measurement deviation. Bland-Altman analysis is the FDA-preferred method to quantify agreement, bias, and limits of acceptable error.

What is Bland-Altman Analysis? 🔗

Bland–Altman analysis is a statistical method used to evaluate the agreement between two quantitative measurement techniques, such as assessing whether measurements obtained from an AI-enabled medical device can be considered interchangeable with those from an established clinical reference standard. In the context of clinical validation, it is essential to demonstrate that a new method is both reliable and reproducible. Because every measurement contains some degree of error and neither method can be assumed perfectly correct, Bland–Altman analysis provides a principled way to quantify the extent of disagreement and determine whether it is clinically acceptable.

The approach is typically implemented through the Bland–Altman plot. Bland–Altman plots, also known as difference plots, are a graphical tool for comparing two measurement techniques and assessing the agreement between paired measurements. The plot displays the difference between two measurements on the y-axis and the average of the two measurements on the x-axis. In essence, a Bland–Altman plot is a scatter plot in which the differences between measurements are plotted against their means, enabling visualization of agreement patterns and identification of potential systematic bias.

Why is a Bland–Altman Plot Required? 🔗

The fundamental objective in evaluating two measurement methods is to determine whether they are sufficiently comparable that one may replace the other with acceptable accuracy for the intended clinical application. Traditional statistical approaches such as correlation coefficients or linear regression are often misused to evaluate agreement. However, high correlation does not imply good agreement. Correlation measures the strength of association, not agreement. Two methods can correlate strongly while still exhibiting clinically unacceptable bias or variability.

For example, if one device systematically overestimates all values by a fixed amount, the correlation between the methods may be nearly 1.0, even though the measurements cannot be used interchangeably. Therefore, correlation and regression should not be used as primary tools for method comparison, as emphasized by Bland and Altman.

Bland–Altman analysis addresses these limitations by directly evaluating bias (systematic difference) and random error (dispersion of differences). The method compares paired measurements by plotting the difference between two methods (M1 − M2) against their average ((M1 + M2)/2). Reference lines provide clinically interpretable context:

Horizontal line at zero difference (no bias)
Mean difference (bias) line
Limits of Agreement (LOA): Bias ± 1.96 × SD, representing the 95% confidence interval in which ~95% of differences are expected if normally distributed

This approach allows assessment of:

Magnitude of disagreement
Direction of systematic error
Presence of proportional bias (difference changing with magnitude)
Outliers

Illustrative Example 🔗

Consider the example from Altman and Bland (1983) comparing two systolic blood pressure measurement methods. The dataset includes 25 paired observations. The initial scatter plot of Method 1 vs. Method 2 suggests high agreement.

The below Bland–Altman plot assesses how well two measurement methods agree by plotting the difference between the methods (M1–M2) against their average. The central horizontal line shows the mean difference (bias), indicating whether one method tends to give higher values. The limits of agreement, shown as dashed lines at the mean difference ±1.96 SD, indicate the range in which 95% of differences fall and are used to judge whether the methods are interchangeable. In this plot, the mean difference is above zero showing that M1 tends to read higher than M2 and the limits of agreement are wide, with differences increasing at higher blood pressure levels, suggesting poor agreement and possible proportional bias between the two methods.

Although the scatter plot of the two measurement methods suggested high agreement because the points fell close to the line of equality. A Bland–Altman plot reveals a more realistic picture of their agreement. Scatter plots can hide important differences, especially when both methods are strongly correlated, which is common in biological measurements. The Bland–Altman plot shows that despite the high correlation, the actual differences between the two methods are large and increase with higher blood pressure values, indicating systematic bias and poor interchangeability. This contrast illustrates why correlation alone is insufficient for assessing agreement and why Bland–Altman analysis is the preferred method for evaluating whether two measurement methods can truly be used interchangeably.

Methodology 🔗

Bland and Altman introduced the Limits of Agreement (LOA) framework to quantify the expected variability between two methods. The idea is to examine how much two measurements differ for the same subject and whether that level of disagreement is acceptable.

Key Calculations 🔗

Bias (mean difference)

Bias = mean(A − B)

Limits of Agreement (LOA)

LOA = Bias ± 1.96 × SD

Where:

Bias indicates systematic difference between methods
SD is the standard deviation of the paired differences
LOA represents the 95% confidence interval where approximately 95% of differences are expected to fall

Plot design 🔗

X-axis: mean of the two measurements (A + B) / 2
Y-axis: difference between measurements (A − B)

How to Interpret the Plot 🔗

A well-behaved comparison typically shows:

Around 95% of points falling within the LOA
No systematic trend or proportional bias as measurement magnitude increases
Random scatter without clear patterns or shape change

How can a Bland-Altman Plot be used? 🔗

The Bland-Altman plot can be used to Evaluate agreement, Identify any systematic bias and find outliers in the data.

Evaluate agreement 🔗

The primary use of the Bland-Altman plot is to determine if two methods agree well enough to be used interchangeably in a clinical or practical setting. The Limits of Agreement (LOA) define the expected range for 95% of individual differences. If the LOA are narrow enough to be clinically irrelevant, the methods are interchangeable and if the LOA are too wide, the random error between the methods is too great for them to be considered equivalent, even if the average bias is zero.

Identify Systematic and Proportional Bias 🔗

The plot helps diagnose consistent errors present in one measurement technique compared to the other.

Systematic Bias: Indicated by the mean difference line being consistently above or below zero. This suggests that one method always measures higher or lower than the other on average.
Proportional Bias (Trends): This is identified visually if the data points form a funnel shape or have a clear slope. Proportional bias means the magnitude of the difference changes as the magnitude of the measurement changes (e.g., the methods agree for low values but not for high values).

Find outliers in the data 🔗

Points that fall outside the 95% limits of agreement are considered potential outliers. It is important to investigate these specific measurements to understand if there were errors in the data collection or if the individual had unique characteristics affecting the measurements.

Example plot: 🔗

Example of when Bland–Altman matters: two glucose methods can track together yet still disagree by clinically meaningful amounts.

Simple plot 🔗

An example of a Bland-Altman plot is to compare the measurement of blood sugar using two different measuring systems. In this case the x-axis is the mean of the two measurements and the y-axis is the difference between the two measuring systems. The plot would show the agreement or disagreement between the two measurement techniques.

Examples Demonstrating Different Agreement Scenarios 🔗

The image illustrates four different Bland–Altman plots demonstrating how patterns in measurement differences reveal agreement (or lack thereof) between two measurement methods or two observers. In all plots, the x-axis represents the mean of the paired measurements, and the y-axis represents the difference between the methods (Test2 − Test1). The solid horizontal line shows the mean bias, and the dashed lines indicate the limits of agreement (LOA).

Example A — Good Agreement, Minimal Bias, Narrow LOA 🔗

The mean difference is close to zero, indicating no systematic bias.
Most points lie tightly within the limits of agreement.
Suggests that the two methods can be used interchangeably for clinical or measurement purposes.

Example B — Systematic Bias, Narrow LOA 🔗

Points cluster close together with narrow LOA, showing high precision.
However, the bias is significantly above zero, meaning one method consistently measures higher than the other.
Strong repeatability, but measurements are not interchangeable without bias correction.

Example C — Minimal Bias, Wide LOA 🔗

Mean bias is near zero, indicating little systematic shift.
The spread of differences is large (broad LOA), meaning high variability between measurements.
Agreement is poor due to lack of precision, even though average levels match.

Example D — Minimal Bias, Very Wide LOA (Worst Case) 🔗

Very low bias but extremely broad LOA, indicating major inconsistency.
Measurements are unreliable and not clinically acceptable.
Demonstrates how low bias alone does not guarantee agreement.

Examples Demonstrating Different Biases 🔗

Case of an absolute systematic error On average, TEST2 measures about 5.3 units lower than TEST1 showing the systematic bias of 5.3. However, the difference does not appear to depend on measurement size. Agreement may be acceptable depending on the clinical tolerance for ±13.5 units. If accuracy within that range is acceptable, the methods may be interchangeable.

Case of a proportional error Agreement is poor. The amount of disagreement depends on measurement magnitude. These methods should not be used interchangeably without applying correction.

Case where the variation of at least one method depends strongly on the magnitude of measurements The variability of the difference depends on the magnitude of the measurement, meaning agreement worsens as values increase. Even though the bias is small, the variation is not constant, which violates one of the assumptions for standard Bland–Altman analysis.

Example Code 🔗

Using python function

import numpy as np
import matplotlib.pyplot as plt

def bland_altman_analysis(method_a, method_b, plot=True):
    """
    Perform Bland–Altman agreement analysis between two measurement methods.

    Parameters
    ----------
    method_a : array-like
        Measurements from method A (e.g., AI model).
    method_b : array-like
        Measurements from method B (e.g., clinical reference).
    plot : bool, optional
        If True, generates the Bland–Altman plot.

    Returns
    -------
    results : dict
        Dictionary containing bias, SD, and LOA bounds.
    """

    method_a = np.array(method_a)
    method_b = np.array(method_b)

    # Pairwise differences and means
    diff = method_a - method_b
    mean_vals = (method_a + method_b) / 2

    # Core Bland–Altman statistics
    bias = np.mean(diff)
    sd = np.std(diff, ddof=1)

    loa_low = bias - 1.96 * sd
    loa_high = bias + 1.96 * sd

    results = {
        "bias": bias,
        "sd": sd,
        "loa_low": loa_low,
        "loa_high": loa_high
    }

    # Bland–Altman Plot
    if plot:
        plt.figure(figsize=(8, 6))
        plt.scatter(mean_vals, diff, alpha=0.5)
        plt.axhline(bias, linestyle='--', label="Bias")
        plt.axhline(loa_low, linestyle='--', label="Lower LOA")
        plt.axhline(loa_high, linestyle='--', label="Upper LOA")
        plt.xlabel("Mean of Measurements")
        plt.ylabel("Difference (A - B)")
        plt.title("Bland–Altman Plot")
        plt.grid(alpha=0.3)
        plt.legend()
        plt.show()

    return results
  
# Generate example simulated data
np.random.seed(42)
true_values = np.random.normal(100, 10, 200)

method_a = true_values
method_b = true_values + np.random.normal(1.5, 4, 200)

results = bland_altman_analysis(method_a, method_b, plot=True)
print(results)

Using Pingouin library

import numpy as np
import pingouin as pg
import matplotlib.pyplot as plt


  
# Generate example simulated data
np.random.seed(42)
true_values = np.random.normal(100, 10, 200)

method_a = true_values
method_b = true_values + np.random.normal(1.5, 4, 200)

pg.plot_blandaltman(method_a, method_b)
plt.title("Bland–Altman Plot")
plt.show()

Pitfalls in Bland-Altman analysis 🔗

One of the critical problems in the Bland-Altman analysis is the need to meet the assumption of normal distribution. The continuous measurement variables need not to be normally distributed, but their differences should. If the assumption of normal distribution is not met, data may be logarithmically transformed.

Another problem arises from the sample size. Studies comparing methods of measurements should be adequately sized to conclude that the effects are universally valid. If the sample size is not adequate, it is possible to find a low mean bias and reduced limits of agreement by comparing two methods. Such methods cannot be recommended for general use without verification of the results of other studies. To calculate sample size, maximum allowed difference derived from other studies should be provided.

An other problem in the Bland-Altman analysis is repeated measure designs. The Bland-Altman analysis is not an appropriate method to compare repeated measurements. However, it can be performed by adding a random effects model to the analysis. In addition, some statistical softwares allow to perform analysis for repeated designs using Bland-Altman method.

Frequently Asked Questions 🔗

Why does FDA care about Bland‑Altman instead of just AUC or Dice scores?

Because AUC, sensitivity/specificity, and Dice tell you how well the model discriminates, not whether its measurements are interchangeable with the clinical reference. Bland‑Altman answers a different question: "If we swap in this AI for our current measurement method, how far off will it typically be, and is that clinically acceptable?" That's much closer to the real‑world risk FDA cares about.

What sort of decision does Bland‑Altman actually help with?

It tells you whether the AI's numeric outputs are good enough to drop into existing workflows without people having to mentally adjust them. If the bias and limits of agreement are tight and clinically acceptable, you can reasonably say the AI measurement can be used like the current method. If not, the tool is more like decision support with caveats than a straight replacement.

What should I look for in a Bland‑Altman plot so it "looks okay"?

Three quick checks:

Most points (about 95%) within the limits of agreement.
No obvious trend where error grows with larger or smaller values (proportional bias).
Limits of agreement that sit inside a pre‑defined clinical tolerance that key clinicians agree is acceptable.

Who decides what "clinically acceptable" limits of agreement are?

Clinicians do, not statisticians. You need someone to answer: "For this measurement, how much difference would actually change what we do?" That threshold, in the same units as your measurement, becomes your target LOA width. Bland‑Altman then tells you if the model stays comfortably inside that band.

Sometimes the practice of medicine decides this, other times it is an educated guess from the manufacturer. As a rule of thumb, we recommend setting the lower limit of the LOA as 10% of the mean physiological value of the value being measured.

How does Bland‑Altman feed into labeling and claims?

If agreement looks good, you can lean toward language like "substantially equivalent measurement" or "may be used in place of." If it's shakier, you may have to narrow the population, frame the output as supplemental information, or add instructions and warnings about how to use it. The plot and stats give you the evidence to justify that story to FDA.

If your limits of agreement are tighter than your competitor, you can claim your device is objectively more accurate than your competitor, which can be quite a nice marketing claim.

What goes wrong if we skip Bland‑Altman and just show nice AUCs?

Two things:

Regulatory risk: Reviewers may ask for agreement analysis anyway or start questioning the whole validation package.
Clinical risk: You can end up with a model that looks great on AUC but has a consistent bias that nudges decisions the wrong way.

How does Bland‑Altman tie into sample size and study design?

The tighter you want your limits of agreement, the more data you need. Small, noisy studies can give you limits that look tight but are actually unstable. With enough data you can say, "We understand the range of error and it stays within our agreed tolerances" across sites, scanners, and subgroups.

What if the Bland‑Altman plot looks bad?

You have three main levers:

Tweak the model or workflow to reduce bias and variability.
Adjust labeling to narrow where and how the measurement is used.
Dial back claims so you're not promising full interchangeability and are positioning it more as advisory.

How do we explain Bland‑Altman to people who don't care about stats?

Keep it to one slide:

One plot with clear lines: bias, upper limit, lower limit.
One small table with mean difference, limits of agreement, and the pre‑agreed tolerance band.
One plain sentence: "In X% of cases, the AI measurement is within Y units of the reference, which is inside/outside the range Z that we said we were okay with."

What are two simple sanity‑check questions to ask about any Bland‑Altman analysis?

"What tolerance did we agree on before looking at the data, and who actually agreed to it?"
"Do the observed limits of agreement sit comfortably inside that tolerance for the groups we care about?"

What's the difference between "bias" and "limits of agreement"?

Bias is the average difference between methods. Limits of agreement (LOA) are Bias ± 1.96 × SD and describe where about 95% of individual differences are expected to fall.

Do I need Bland-Altman if my correlation is 0.98?

Yes. Correlation measures association, not interchangeability. Two methods can correlate tightly and still have clinically unacceptable systematic error.

How do I pick what "acceptable" LOA should be?

Decide a clinical tolerance band before looking at results. Clinicians set it based on what difference would change decisions. Then check whether the observed LOA sit comfortably inside it.

What does proportional bias look like on a Bland-Altman plot?

A slope or trend where differences grow or shrink with the mean. It can also show up as a funnel shape or widening spread at higher values.

What if my mean bias is near zero but LOA are wide?

Average agreement can still be poor. Wide LOA means many individual cases are far off, so the methods are not interchangeable.

What percent of points should fall within LOA?

Roughly 95% if differences are approximately normally distributed and assumptions hold. If many points fall outside, investigate outliers, workflow issues, or violated assumptions.

Do differences need to be normally distributed, or the measurements themselves?

The differences should be approximately normal. The raw measurements can be non-normal. If differences aren't normal, consider a transformation (often log).

When should I log-transform before Bland-Altman?

When the variability of differences grows with magnitude (heteroscedasticity) or differences look skewed. A log transform can stabilize variance.

Can I use Bland-Altman for repeated measurements per subject?

Not the basic version. Repeated measures violate independence. Use a repeated-measures approach, typically with a random effects model.

Should I plot (A − B) or (B − A)? Does it matter?

Either is fine, but be consistent and interpret the sign correctly. Positive bias means the method in the first position reads higher on average.

Is Bland-Altman mainly a plot, or should I report numbers too?

Both. The plot helps diagnose patterns (bias, trends, outliers). The reported bias and LOA provide the concrete, unit-based summary.

Conclusion 🔗

If your device outputs a number, FDA is going to ask the obvious question: how far off is it, and when does that matter clinically?

Bland-Altman is the fastest way to turn that into a clearance-ready story. It connects your measurement performance to a pre-defined clinical tolerance, and it supports the claims, labeling, and study design you need for a credible 510(k) or De Novo package.

Done right, this is not "stats for stats' sake." It is risk reduction. Fewer avoidable review questions. Fewer late changes to claims. Less chance you run a study that cannot support the business outcome: FDA clearance.

Want to de-risk FDA clearance for a quantitative AI/ML device? Our FDA pre-sub service helps you set clinically acceptable LOA up front, pressure-test claims, and lock the validation plan before you spend real money on the pivotal study.

Get a second opinion. Let's Chat!