A UK-led consortium has released the first major dataset and AI model from the OpenBind project, giving researchers worldwide open-access tools to predict protein-drug interactions. 

shutterstock_2368251701

The OpenBind project, led by Diamond Light Source, has released its first public dataset alongside a predictive AI system known as OpenBind v1. Both resources are now freely available to researchers worldwide.

Tackling a critical data shortage

Artificial intelligence has already immesuarbily improved some areas of biology, particularly protein structure prediction. However, its impact on drug discovery has been more limited, largely due to a lack of high-quality experimental data showing how potential drugs bind to disease-related proteins at the atomic level.

Artificial intelligence has already immesuarbily improved some areas of biology, particularly protein structure prediction

OpenBind aims to address this gap by producing large-scale, standardised datasets specifically designed for AI applications. The initiative brings together structural biologists and AI specialists and was supported during its early stages by the UK government’s Department for Science, Innovation and Technology.

“AlphaFold2 revolutionised protein structure prediction by leveraging decades of experimental data on protein structures in the PDB,” said Professor Mohammed Alquraishi of Columbia University. ”The equivalent of such a dataset for protein-drug complexes does not yet exist, but OpenBind aims to create it, and in the process create the next generation of computational tools for modeling interactions between drugs and proteins.”

Rapid progress in data generation

The first release demonstrates how quickly the OpenBind pipeline can now produce high-quality data. In just seven months, the team generated 800 detailed measurements, a process that previously could take years.

This progress has been made possible through an integrated system combining automated chemistry, precise binding measurements and high-throughput crystallography at Diamond’s XChem fragment screening facility. Data processing and AI model training were supported by the UK’s Isambard-AI computing cluster.

The result is a streamlined pipeline capable of producing consistent and reproducible datasets.

Low-Res_PandemicPrep-DrugDisco-Oxford-20260205-34

Researcher Jasmin Aschenbrenner loading samples on the crystallography beamline at Diamond Light Source. Credit: Stuart March-DNDi

Building a foundation for AI-driven discovery

Researchers say the initial dataset has already provided valuable insights into how to optimise data collection and processing. Standardised workflows, strong metadata practices and automation have all proved essential in ensuring quality and consistency.

Standardised workflows, strong metadata practices and automation have all proved essential in ensuring quality and consistency

“High-quality experimental data is essential for developing new and improved AI models and this first data release shows that OpenBind now has this foundation in place,” said Dr Fergus Imrie of the University of Oxford. ”We’re enabling AI to improve model performance and guide future experiments, helping to accelerate discovery. The lessons from these early cycles are already helping us improve the speed, consistency and reproducibility of the pipeline, which will be critical as OpenBind grows.”

Collaboration and future ambitions

This milestone was a huge collaborative effort across multiple institutions and disciplines. Scientists involved say the achievement highlights the importance of coordinated expertise in tackling complex scientific challenges.

“We couldn’t have made such rapid progress without the contributions of our consortium members and operational team,” said Professor Frank von Delft, Principal Beamline Scientist at Diamond Light Source. ”Their expertise and commitment have enabled us to reach this ambitious milestone. We will now implement the lessons from this foundation phase to ramp up a long-term operation that links high-volume production of AI data with active discovery projects.”

Expanding the scope

Looking ahead, OpenBind plans to scale up its efforts by including more biological targets, larger sets of chemical compounds and deeper datasets. It also intends to launch community-wide blind challenges to test how well AI models perform on newly generated data.

Looking ahead, OpenBind plans to scale up its efforts by including more biological targets, larger sets of chemical compounds and deeper datasets

Ultimately, the initiative aims to establish a global open data platform capable of supporting faster, more accurate and more equitable drug development, potentially changing the way new treatments are discovered.