article

Making science run at the speed of thought: the reality of AI in drug discovery – Part 1

Everyone talks about AI speeding up drug discovery, but Eric Ma explains why, without clean data and statistical discipline, it can actually do the opposite.

Blue capsules floating with scientific icons representing drug discovery and biotechnology.

Eric Ma, a data science lead for Moderna’s research data science and artificial intelligence team, opens our conversation with a mission statement that sounds almost utopian: “Make science run at the speed of thought.” Yet as our discussion unfolds, it becomes clear that this isn’t naive optimism, but a hard-won vision tempered by years of wrestling with the messy realities of applying machine learning to drug discovery. Leading a team of six that works directly with bench scientists across immunology, chromatography and protein engineering, Ma has earned his stripes navigating the treacherous gap between AI’s promise and biology’s stubborn complexity.

The economics of machine learning: a catch-22

One of the most striking insights from Ma’s work – published in ACS Catalysis following his protein engineering project at Novartis – concerns the economics of when machine learning (ML) actually makes sense. The logic is brutally simple yet often overlooked: supervised machine learning models (oracle models) require substantial data to become accurate. But here’s the paradox: if your assay is expensive to run, you cannot generate enough data to train a good model. Conversely, if your assay is cheap enough that you can generate lots of data points, why would you need a machine learning model at all? You could simply brute-force your way through the problem space.

“You just have to accept the risk that a second round of screening may not be needed,” Ma explains, describing scenarios where cheap, high-fidelity assays eliminate the need for predictive models altogether.

This leaves a narrow sweet spot where ML is genuinely valuable: expensive assays where you have historical data, or situations requiring sophisticated handling of uncertainty quantification with small datasets. Even then, Ma cautions, the historical data approach – often considered the ‘nirvana’ solution –comes with its own set of problems.

The hidden crisis in historical data

For large pharma and biotech companies sitting on decades of assay data, the temptation to apply machine learning is irresistible. However, Ma reveals a disturbing truth: much of this historical data is built on shaky foundations.

We’re given a database of just the summarised value; not the underlying measurement values; not the values of the controls measured in the same experiment.

“The assay may drift over time simply because of the operator changing,” he notes. Over ten years, machines change, people change, software changes – yet we make assumptions that IC50 values remain comparable across these shifts. The problem runs deeper than simple experimental variation. What if the software license for JMP wasn’t renewed one year, forcing a switch to a custom Python 2.7 script for calculating IC50 values? What if that script lived in someone’s home directory and was never archived? These are not hypothetical scenarios – they are the reality of how science has been conducted until recently.

“We’re given a database of just the summarised value; not the underlying measurement values; not the values of the controls measured in the same experiment,” Ma explains. Without this metadata, proper statistical estimation becomes nearly impossible. Machine learning models trained on such data are thus being built on fundamentally unstable ground.

Statistical systems require discipline

The solution, according to Ma, lies in what he calls “statistical discipline in statistical systems.” It’s not enough that individual scientists understand statistics or can write R scripts; the problem is systemic. They understand what the math should do in its own sandbox, but they fail to grasp that they’re building a system. Every parameter choice, every software version, every operator change has a causal impact on the output.

“If you don’t log those hyperparameters, all the knobs that you can tune – if you don’t document that, you’re left with holes and gaps in what exactly happened,” Ma emphasises.

This situation has only improved in roughly the past five years, in Ma’s estimation. The biotech and pharma industries have been late to adopt the requisite traceability and workflow orchestration that tech companies take for granted. When leadership isn’t staffed early with quantitative folk who understand the full lifecycle of data – from the bench to digital systems – traceability is not baked into early systems by design.

Consequently, even as data generation becomes cheaper through automation and high-throughput techniques, these fundamental issues don’t disappear. Whether running pooled screens with next-generation sequencing or arrayed assays in 96-well plates, computational models are still needed in the loop. Experimental metadata must still be tracked. Crucially, if functional assays are involved rather than selection assays, computation becomes unavoidable.

“You still need the computational stuff,” Ma states bluntly. The economics may be solved by cheaper data generation, but the systems problem remains.

Stay tuned for Part 2, where Eric Ma discusses how automation, large language models and disciplined systems design can help science move closer to running at the speed of thought.

Meet the expert 

Eric Ma – Senior Principal Data Scientist, Moderna

Eric Ma
As Senior Principal Data Scientist at Moderna, Eric leads the Data Science and Artificial Intelligence (Research) team to accelerate science to the speed of thought. Before joining Moderna, he worked at the Novartis Institutes for Biomedical Research, conducting biomedical data science research with a focus on applying Bayesian statistical methods to support the discovery of new medicines for patients. Earlier in his career, he was an Insight Health Data Fellow in the summer of 2017 and completed his doctoral thesis in the Department of Biological Engineering at the Massachusetts Institute of Technology (MIT) in the spring of the same year.

Eric is also an open-source software developer and has led the development of pyjanitor, a clean API for data cleaning in Python, and nxviz, a visualisation package for NetworkX. He is a core developer for both NetworkX and PyMC. In addition, he contributes to the wider data science community through coding, blogging, teaching and writing.