article

The data fragmentation problem holding drug discovery back

Posted: 5 January 2026 | | No comments yet

The DMTA cycle depends on clear data flow, yet most labs still work across disconnected systems. Sean McGee, Director of Product at Certara, explains how better infrastructure and AI can help teams work faster and make decisions with more confidence.

A laptop displaying a digital biomedical dashboard sits beside a laboratory microscope and test tubes, representing the integration of AI, data analysis, and laboratory research.

Digital transformation has accelerated many areas of early drug discovery, but one longstanding challenge remains: data across the Design–Make–Test–Analyse (DMTA) cycle is still highly fragmented. The DMTA cycle underpins modern discovery workflows, moving from hypothesis generation (design) to compound or reagent production (make) to experimental evaluation (test) and then to interpreting results to decide the next steps (analyse). Each stage generates critical data, yet it is often stored in separate systems, making it difficult for scientists to form a clear, continuous view of a programme. As a result, teams spend significant time finding, aligning and contextualising information before they can move to the next experiment or decision.

For Sean McGee, Director of Product at Certara, this gap represents both a practical challenge and an opportunity to rethink how data flows across discovery teams. His work focuses on building technology that brings together design environments, lab systems, analysis tools and AI models into a unified workflow.

At the heart of all biopharmaceutical R&D lies how data is generated, processed and used to make decisions.

“At the heart of all biopharmaceutical R&D lies how data is generated, processed and used to make decisions,” McGee says. “My focus has been on finding different ways to incorporate artificial intelligence (AI) – in all its various forms – into the different applications and services offerings provided by Certara.”

 

access your free copy

 


Automation now plays a central role in discovery. From self-driving laboratories to real-time bioprocessing

This report explores how data-driven systems improve reproducibility, speed decisions and make scale achievable across research and development.

Inside the report:

  • Advance discovery through miniaturised, high-throughput and animal-free systems
  • Integrate AI, robotics and analytics to speed decision-making
  • Streamline cell therapy and bioprocess QC for scale and compliance
  • And more!

This report unlocks perspectives that show how automation is changing the scale and quality of discovery. The result is faster insight, stronger data and better science – access your free copy today

 

His goal is to ensure that AI and automation support the entire discovery loop, rather than sitting as isolated features. “My primary goal is to look at mapping our capabilities onto workflows that cross between teams, departments and functions, such as how to drive data from first-in-human studies through to final regulatory submissions.”

Before joining Certara, McGee spent a decade working closely with discovery teams at Benchling and BIOVIA, helping researchers accelerate molecules from early discovery into development. These experiences gave him a detailed view of the operational gaps that slow down translational research, particularly when teams must transfer data between systems, groups and external partners.

Closing data gaps in the DMTA cycle

The DMTA cycle depends on a continuous flow of information. A hypothesis guides what to design, synthesis produces the compounds or reagents, assays generate the experimental data and analysis determines what should happen next. In many organisations, however, each stage is supported by its own system.

“Previously, all these pieces of data lived in different systems,” McGee explains. “ELNs, LIMS, inventory management tools, CRO file dumps and many more environments.”

A given set of experiments can be affiliated with a DMTA cycle in Design Hub, which then tracks all the compounds that are produced for those experiments.

Certara’s approach, particularly following its integration of Chemaxon technologies, focuses on creating a traceable, end-to-end data structure. “Certara helps affiliate all these different pieces of information with given biopharmaceutical candidates and the experiments and analyses that produce it.”

He notes that this structure is already changing how teams connect their work across the DMTA cycle. “A given set of experiments can be affiliated with a DMTA cycle in Design Hub, which then tracks all the compounds that are produced for those experiments. It then tracks the stages of testing that a given candidate is going through as a part of this project and integrates with the rich analysis capabilities offered by D360.”

This creates a unifying interface that allows scientists, chemists, biologists and data analysts to work from the same information, accelerating decision-making. “This creates a unified flow of data across the DMTA cycle and sparks its conversion into knowledge as it presents it to the different researchers affiliated with that project.”

How AI is shaping compound design and prioritisation

As discovery teams handle more modality types and more complex datasets, AI has become central to managing, interpreting and generating hypotheses from those data. For McGee, AI’s role is not just prediction, but amplification: extending the number of ideas teams can explore, even under resource constraints.

AI and machine learning are catalysts for deeper exploration.

“AI and machine learning are catalysts for deeper exploration,” he says. “The DMTA cycle, and by extension biopharmaceutical discovery, are multiparameter optimisation questions.”

Most programme decisions, he notes, happen under constraint: limited budget, limited time or limited experimental bandwidth. “Many groups have been exploring property prediction to help prioritise more promising candidates or generate new ones for testing in the lab based on their project’s history.”

Large language models (LLMs) also have a growing role. “GPTs aid researchers in exploring larger swaths of unstructured content, like biopharmaceutical literature and internal files, to help identify new routes for exploration.”

Beyond potency and ADMET predictions, McGee sees machine learning beginning to inform wider aspects of candidate assessment. “Other groups are looking at how machine learning can be used to extrapolate physicochemical and assay data into secondary intelligence, extending predictions beyond ADMET properties into information around a given compound’s PK/PD, bioavailability, or even performance in a simulated clinical trial.”

The boundary on what AI can achieve, he argues, is not conceptual. “The only limitations are the data we have and our imagination.”

Interoperability challenges in modern discovery

Many organisations now investigate a broader range of therapeutic modalities: peptides, ADCs, glycans, engineered biologics and hybrid constructs. These blur the traditional lines between chemical and biological entities, introducing new informatics challenges.

“One of the biggest challenges at the moment is the evolving data landscape around the types of modalities that organisations are exploring,” McGee explains. “The lines between what constitutes a biologic versus a small molecule drug are blurring.”

One of the biggest challenges at the moment is the evolving data landscape around the types of modalities that organisations are exploring.

As researchers expand their toolkits, data management becomes more complex. “How do you register a peptide; does it go in our chemical or biological registration system? How do you affiliate an ADC antibody with its payload and linker, especially since many teams store each piece across different systems?”

Automation, he argues, is a necessary step to resolving these incompatibilities. “Further automation in the lab and in the computational space will help remove the barriers between all these various data silos.”

Standardisation plays an important role. “Increasing adoption of notations like HELM facilitates a common representation and language around these ‘crossover’ modalities.”

Ultimately, McGee believes automation forces clarity in the R&D process. “Automation, as a concept, will drive organisations towards making decisions around how to conduct their R&D at scale in a way that aligns with how they do their work.”

Improving decision-making across the DMTA cycle

When data is more connected and automation supports routine tasks, scientists can focus on interpretation and hypothesis generation rather than data retrieval.

“Greater automation in both wet and dry labs facilitates the handoffs of data affiliated with different candidates as they make their way through successive DMTA cycles.”

McGee highlights three common questions that automation helps teams answer more quickly:

  • “What were we hoping to accomplish with this assay?”
  • “What were the top performing compounds in this round of experimentation?”
  • “What sorts of tests can we conduct next?”

By reducing manual information gathering, organisations can “spend less time trying to find information, increase the capacity of the amount of information their labs can produce, and more rapidly identify and prioritise the next round of experimentation.”

The effect is cumulative: “It simplifies the decision-making process, drives more efficient innovation and accelerates the highest-performing compounds to development, the clinic, and eventually to the patients who need them.”

Democratising AI in drug discovery

Looking beyond DMTA, McGee believes the next frontier in AI is not model sophistication but accessibility.

“To me, the greatest opportunity for AI is an increase in democratisation,” he says. Many prediction methods are already well characterised; the challenge is building, updating and deploying them efficiently.

“Their primary limitation comes in the amount of time and effort required to gather enough data to create a viable model.”

As a result, many teams rely on broad, generalised models rather than ones optimised for their specific projects. “Improving model lifecycle management and machine learning operations (MLOps) can help data science teams serve more groups faster.”

For researchers, the benefit is greater trust and adoption. “They can generate models that are more precisely tuned to a given project or update them based on new data as it becomes available. For end users, this also results in greater trust and therefore adoption of AI.”

The future: agentic, supportive and transparent AI

The next generation of lab automation tools, McGee predicts, will not simply execute tasks. They will help scientists think through experiments.

“The next generation of lab automation should include greater capacity to aid decision-making by automating or suggesting types of analyses that can be done on a given set of data.”

Rather than relying solely on familiar experiments or known chemical modifications, scientists could receive suggestions from an AI system acting as an additional colleague. “Agentic AI, if used in a transparent and traceable way, could help step in as another ‘colleague’ in the lab.”

This support does not replace human expertise. “The researcher still makes the final decision, but I believe that AI can help positively shift perspectives.”

 

Meet the expert

Sean McGee, MS, Director of Product, Certara

Sean McGee HeadshotSean McGee is the Director of Product within Certara’s AI group. He has previously led strategy and go-to-market efforts for software platforms including Benchling’s laboratory informatics system and the AI and molecular modelling portfolio for Dassault Systèmes’ BIOVIA brand. At Certara, he guides the development of AI-focused capabilities that extend across R&D workflows. McGee earned his Master of Science at the University of Notre Dame, where he explored the scientific and commercial applications of medical devices designed to aid in identifying child abuse.

 

Leave a Reply

Your email address will not be published. Required fields are marked *