Deep data not big data

Bigger isn’t always better. In drug discovery, Dr Michael Ritchie argues that the future belongs not to those with the most data, but to those who understand its biological depth.

Elevate,Healthcare,With,Ai,Technology,Services.virtual,Health,Care,Analytics,Empower

Artificial intelligence (AI) has become a constant presence in discussions about drug discovery and development. Every week seems to bring a new claim about algorithms that can predict binding affinities, optimise ADME properties, or even design first-in-class molecules. Yet, for all the excitement, the gap between computational prediction and biological reality remains wide.

As someone who has spent years at the interface between preclinical biology and data science, I’ve come to believe that this gap has less to do with algorithms and more to do with the data we use to train them. In other words, the future of AI in drug discovery won’t be determined by who has the biggest datasets, but by who has the most biologically rich ones.

Quantity has replaced quality in many AI pipelines

The industry’s early embrace of AI has been shaped by the availability of large, public datasets. These resources have been invaluable for developing and benchmarking algorithms, but they also share a common limitation: they tend to be broad but shallow. Gene expression profiles may lack paired proteomic data. Cell line models may have unclear passage histories. Clinical datasets often include incomplete annotations or limited context about the underlying biology.

The industry’s early embrace of AI has been shaped by the availability of large, public datasets. These resources have been invaluable for developing and benchmarking algorithms, but they also share a common limitation.

Machine learning models are powerful pattern recognisers, but they are also hostages to their training inputs. If the underlying data are noisy, incomplete, or biologically inconsistent, even the most advanced architectures will return outputs that look statistically elegant yet fail to reproduce in vivo.

We often talk about the promise of big data in biology, but more isn’t always better. The truth is, when everything is included, signal and noise become indistinguishable. What we gain in scale, we lose in biological resolution.

We’ve seen this pattern before. Early drug-repurposing models trained on uncurated transcriptomic data produced hit lists that were mathematically impressive but biologically implausible. When those same models were retrained on smaller but well-annotated datasets, their predictive accuracy improved substantially. The takeaway is consistent: machine learning is only as powerful as the biological truth embedded in its training data.

Depth creates context, and context drives discovery

True biological depth means capturing multiple layers of information from the same model system: genotype, transcriptome, proteome, and even the dynamic changes that occur in response to perturbation. It means linking these data back to functional outcomes rather than treating them as isolated features.

By comparing genomic mutations with proteomic signaling and phenotypic response, researchers have identified compensatory survival mechanisms that would never have been apparent from DNA data alone.

When we integrate these layers, patterns begin to emerge that aren’t visible in any single dataset. Subtle regulatory mechanisms, compensatory pathways, and context-specific vulnerabilities become apparent only when the data reflect the complexity of real biology.

One example of this is multi-omic analysis of resistant tumours. By comparing genomic mutations with proteomic signalling and phenotypic response, researchers have identified compensatory survival mechanisms that would never have been apparent from DNA data alone. These insights don’t just improve mechanistic understanding, they reveal entirely new druggable nodes.

This is where AI can become transformative, not as a replacement for biology, but as a means of revealing what we might otherwise miss. The most meaningful models are built not on the largest datasets, but on those with the highest fidelity to the systems we aim to study.

The challenge of biological relevance

The challenge, of course, is that high-resolution biological data are difficult and expensive to generate. Deep multi-omic characterisation requires careful experimental design, validated reference models, and a disciplined approach to data integration.

One key aspect that often gets overlooked is validation. In experimental science, we understand the importance of independent replication, but in AI-driven discovery, validation is frequently reduced to cross-validation within the same dataset.

This is a very different exercise from scraping public repositories or merging datasets with inconsistent metadata. In my experience, the most successful AI applications in drug discovery are those that start with a defined biological question and build a dataset around it, rather than starting with a dataset and asking what the algorithm can find.

That distinction matters. When data are collected with a purpose, they carry the biological signatures of that purpose. When data are assembled opportunistically, those signatures become blurred, and the models that follow often inherit that ambiguity.

One key aspect that often gets overlooked is validation. In experimental science, we understand the importance of independent replication, but in AI-driven discovery, validation is frequently reduced to cross-validation within the same dataset. This can lead to overfitting that looks like accuracy but is really just self-consistency. True validation means testing on independent biological data, ideally from separate studies, models, or institutions. That’s much harder to do when comparable, high-fidelity datasets are scarce, which is precisely why we need to generate more of them.

From algorithmic power to biological insight

We are entering a period where model complexity is outpacing data quality. Deep learning architectures can process enormous amounts of information, but they also magnify the flaws within it. The sophistication of the model does not compensate for the absence of biological truth.

I sometimes think of this as a kind of computational mirage. The outputs look convincing, the predictions correlate neatly, and the visualisations glow with interpretive confidence, until they are tested in a real biological system. It’s only then that we see how brittle some of these predictions can be.

To close that gap, we need to anchor our machine learning efforts in data that carry biological meaning.

To close that gap, we need to anchor our machine learning efforts in data that carry biological meaning. That includes multi-omic datasets that link genotype to phenotype, proteomic maps that capture post-translational regulation, and functional data that connect molecular change to observable behaviour.

For example, integrating phosphoproteomic and transcriptomic data has improved target prioritisation in kinase inhibitor discovery. Combining cell-surface proteomics with gene expression has identified immune signatures predictive of therapeutic response. These efforts share a common principle: the biological system defines the data architecture, not the other way around.

Integration, not isolation

One of the most overlooked aspects of data science in drug discovery is the importance of integration. Biology does not operate in silos, and neither should our data.

When we combine sequencing data with proteomics, imaging, and phenotypic outcomes, we create a foundation for models that understand relationships rather than isolated features. This integrated approach doesn’t just improve prediction accuracy, it changes the kinds of questions we can ask.

Instead of asking whether a compound binds to a target, we can ask how that binding event reshapes downstream signalling or alters the tumour microenvironment. We can move from correlation to causation, from description to mechanism.

Integrated datasets also serve as a natural check against spurious results. When different data layers converge on the same biological interpretation, confidence increases. When they diverge, that tension often reveals something interesting, a hidden mechanism, a context-specific dependency, or a false assumption worth revisiting.

That is the real promise of AI in biology, not to automate discovery, but to illuminate it.

The responsibility of interpretation

With that promise comes responsibility. As AI becomes more embedded in preclinical and clinical decision-making, we must remain cautious about what these models truly represent. A model can only be as good as the data and assumptions it’s built upon.

Transparency, reproducibility, and biological validation are critical safeguards. Without them, even the most elegant AI can mislead. As an industry, we must ask not just whether our models perform statistically, but whether they perform biologically.

That also extends to interpretability. As algorithms become more complex, we risk creating black boxes that even their developers can’t fully explain. Explainable AI approaches are beginning to help, but they work best when paired with well-annotated biological data. The interpretability challenge isn’t purely computational, it’s biological.

A reflective conclusion

As the field continues to evolve, I find myself returning to a simple question: when we build models on data that are shallow, fragmented, or context-poor, are we really modelling biology, or just modelling our own limitations?

Perhaps it’s time to look beyond the largest datasets and focus instead on the ones that are most representative of the living systems we aim to treat. Because in the end, the goal isn’t to make AI smarter. It’s to make our understanding of disease more truthful, reproducible, and predictive - so that when an algorithm tells us something new, we can believe that biology would agree.

Meet the author

Mike Ritchie has over 20 years of experience in oncology research and drug development. He earned his PhD in Biochemistry from Temple University School of Medicine and completed a postdoctoral fellowship in Neuroscience at Harvard Medical School. Earlier in his career, he led therapeutic programs at Pfizer, and for the past decade at Champions Oncology he has advanced the use of patient-derived models and multi-omic platforms. A recognised thought leader, he shares insights on translational models, radiopharmaceuticals and AI-driven analytics to accelerate oncology pipelines and reduce risk.