More biomarkers, better diagnosis?

Why adding dimensions to a diagnostic panel can quietly work against you, and why the AI argument, which I listen to sometimes, doesn’t change that significantly.

The instinct is clear, if one biomarker tells you something, five tell you more, hey! And let’s continue, if five aren’t enough, let’s add machine learning to sort out the complexity of plenty of biomarker data points. It sounds almost trivial, but the geometry, for all I know, seems to say otherwise.

A sphere that loses its volume

Why I arrived at that specific topic is related to a video I was looking at a few weeks ago where a mathematician was explaining the relation of the volume of a sphere and the number of dimensions, really fascinating. Then I thought of that in relation to my academic research, multiplex assay development for the IVD field with that geometrical observation.

Briefly, the volume of an n-dimensional sphere is governed by this equation: π^(n/2) / Γ(n/2 + 1). As the number of dimensions grows, the Gamma function in the denominator eventually outpaces the numerator, and so the volume peaks around dimension five, then decreases monotonically toward zero.

The consequence was disorienting to me when I learned that, in high-dimensional spaces, almost all of a sphere’s volume concentrates in a thin shell near the surface. The interior is, for practical purposes, empty. Adding a dimension doesn’t expand the space you can usefully occupy, instead, it redistributes almost everything to the boundary.

To say it quickly: in high dimensions, there is no inside, just a big surface.

This isn’t a curiosity of pure mathematics as it’s the geometric foundation of what statisticians call the curse of dimensionality! As dimensions increase, the volume of your space grows so quickly that any fixed dataset becomes vanishingly sparse. The neighbors dissolve, the distances homogenize, and the structure flattens into into, mmmh, noise.

Your patients are points in that space

In a multiplex immunoassay panel, each patient is a point in an n-dimensional space, one axis per biomarker. In the case of traumatic brain injury, during the acute phase, it could be GFAP, UCH-L1, S100β, NSE, NfL, and Tau. In this scenario, it means, 6 dimensions. The question is what happens when you add more.

Naively, we could say something like «more axes, so richer representation and better separation between healthy and pathological populations». But what actually happens depends entirely on how those new axes affect the geometry of the clinical populations.

Patients are not uniformly distributed across that high-dimensional space. They occupy a structured, low-dimensional region, a submanifold, in the language that Tenenbaum, de Silva & Langford (2000) and Roweis & Saul (2000) introduced for nonlinear dimensionality reduction, and that has since become standard in computational biology (e.g. Pe’er and colleagues for single-cell mass cytometry). The submanifold’s shape is determined by the biology. Healthy controls cluster in one zone (well, we hope so) while severe TBI occupies another. And during the recovery phase, TBI patients follow trajectories that evolve over time, NfL peaks days after injury, IL-6 surges with inflammation and then clears within days.

The question is not how many biomarkers, but which dimensions reveal the structure of that submanifold.

A biomarker earns its place in a panel by doing one of two things. (i) Extending a discriminative direction so that clinically distinct populations separate further, or (ii) resolving a previously hidden or collapsed temporal dimension. A biomarker that correlates strongly with one already in the panel doesn’t add a new axis; worse, it inflates the cloud in an already-occupied direction. Fantastic result with maximum geometric cost, zero diagnostic gain. This intuition is precisely what Peng, Long & Ding (2005) formalised in their mRMR criterion: maximise relevance to the outcome, minimise redundancy with already-selected features.

What the «AI can do it!» argument misses

The appeal to machine learning could arise when classical statistics struggles with high-dimensional, correlated data, yet modern machine learning can handle it. Therefore, add more biomarkers freely and let the model figure out what matters! Geez, so easy, no!? AI will save the world, again!

There’s some truth to this. Regularization, dimensionality reduction, and ensemble methods do mitigate the curse of dimensionality. But they do not eliminate it, and they introduce their own constraints that are often overlooked.

A model trained on n-dimensional inputs still needs sufficient samples to estimate the structure of the underlying distribution. In a disease with heterogeneous presentation, limited biobank access, and batch effects across sites (hello, TBI with the recovery phase, for instance), the required sample size grows faster than most clinical studies can meet. Regularization penalizes complexity, but it cannot create separation that isn’t there. It compresses dimensions that don’t contribute, which is exactly what good biomarker selection would have done upstream, before training.

The deeper issue is that adding a non-classifying biomarker doesn’t give the model more signal; it gives it more noise to suppress. A well-designed four-biomarker panel with strong geometric separation will outperform a ten-biomarker panel with six redundant axes, regardless of what sits on top of the data. The “geometry” is upstream of the algorithm.

The selection problem, stated plainly

Choosing a diagnostic panel is a geometric problem (no, I wanted to do biology, not math!) because we are trying to find the minimal set of axes that maximally separates the clinical submanifolds and aligns those separations with the directions that matter for the decision you need to make. In the formal language, this is a problem of intrinsic dimension estimation (Levina & Bickel (2004)) combined with discriminative feature selection, a long-standing area of statistics and machine learning, with Guyon & Elisseeff (2003) as the entry point.

This reframes what biomarker validation should look like. The question isn’t only whether a new candidate correlates with the outcome. It’s whether it adds a direction in the space that is not already covered and whether that direction carries discriminating information. Orthogonality to existing biomarkers is necessary but not sufficient. The new axis must also separate populations.

IVD specialists develop this intuition empirically. Certain combinations work while others add noise. The instinct resists formal articulation and what the geometry provides isn’t a correction to that intuition but the language to express why it’s right.

A practical asymmetry

There’s an asymmetry worth stating explicitly. Adding a classifying biomarker to a panel is strictly beneficial as it adds a new dimension of separation and improves diagnostic resolution. Adding a redundant biomarker is never neutral because it increases the problem’s dimensionality, the sample size required to characterize the distribution, and the assay’s cost and complexity, and provides no countervailing gain.

The geometry doesn’t tell you which biomarkers to choose, as that task requires biology, clinical context, and validation data. But it does tell you what you’re optimizing for and makes the cost of the wrong choice legible in a way that “we measured more things” doesn’t.

Header image by Galina Nelyubova

Postscript, Scientific context (updated 2026-05-26)

After writing this post, I started to dig into the literature that formalises these intuitions, thanks to a colleague (well, an AI, but a useful one) to help me trace the connections outside of my usual field. A few things clarified, and I want to record them transparently rather than rewrite the post as if I’d known all along.

What I called the *submanifold* of patients is the same object that the manifold learning community has been studying since Tenenbaum, de Silva & Langford (2000) and Roweis & Saul (2000) launched the field with two back-to-back Science papers. The so-called *manifold hypothesis*, that real-world high-dimensional data tend to lie on low-dimensional structures embedded in the ambient space.

What I called *classifying vs. redundant biomarkers* is the same distinction that Peng, Long & Ding (2005) formalised information-theoretically as the mRMR criterion, maximise relevance to the outcome, minimise redundancy with already-selected features. The geometric intuition and the mutual-information formalisation are two faces of the same object.

What I called the *intrinsic dimension* of the clinical submanifold is the formal quantity that Levina & Bickel (2004) provided a maximum-likelihood estimator for. If you have a dataset of n-dimensional measurements but the underlying biology only varies along k independent directions (with k ≪ n), Levina-Bickel gives you a principled way to estimate k from data.

And the closest existing application of all this to multiparameter biology is the impressive work of Dana Pe’er and colleagues on mass cytometry, where individual cells become points in high-dimensional protein-expression space and the biology is read off the structure of the submanifolds they form. PhenoGraph (Levine et al., 2015) and PHATE (Moon et al., 2019) are the two papers I’d start with.

So this post is not a discovery but a small attempt at making the geometric framing legible in the language of multiplex clinical biomarker panels (mTBI, sepsis, cardio, and so on). And I still have a lot of work to fully master and integrate all those concepts in my research. It’s a beautiful world to discover.