A new open-access repository of carbon-nitrogen bond-forming reactions, described as the largest of its kind, aims to address critical data gaps in AI-assisted drug discovery whilst revealing unexpected chemical insights that could reshape pharmaceutical synthesis.

shutterstock_2757202123

Researchers at the University of Michigan have created what is being described as the largest body of chemical reactions data ever assembled, providing scientists and artificial intelligence systems with a powerful new resource that could help speed up the development of medicines.

The open-access database contains more than 50,000 carefully designed chemistry experiments and is intended to address one of the biggest challenges facing AI-assisted drug discovery: the lack of large, high-quality datasets on chemical reactions.

Developing new medicines often requires thousands of experiments to identify the most effective recipe for a safe and affordable drug. The process is typically slow, labour-intensive and dependent on catalysts made from hard-to-source precious metals.

While AI is increasingly being used to support drug discovery, its effectiveness depends on the availability of extensive data. Researchers say the new database could help fill that gap.

Building a foundation for AI-driven chemistry

The project was led by Tim Cernak and his team at the University of Michigan College of Pharmacy. The database focuses on reactions that form carbon-nitrogen bonds, which are essential components in many pharmaceutical compounds.

According to the researchers, the dataset is only the beginning of what could become a much larger library of chemical reaction conditions designed to support future AI models.

“Building the platform that could pull this off has taken over a decade but it’s still just scratching the surface,” said Cernak, Associate Professor of Medicinal Chemistry at the College of Pharmacy.

Building the platform that could pull this off has taken over a decade but it’s still just scratching the surface

The data has now been made freely available through the Open Reaction Database, a platform dedicated to sharing chemical reaction information with the scientific community.

Reducing reliance on precious metals

The database could help identify more efficient methods for producing medicines and reduce reliance on expensive catalysts based on precious metals.

The study compared the performance of three catalysts commonly used in chemical synthesis: palladium, nickel and copper.

Palladium is widely used in drug manufacturing but global supplies are concentrated in a small number of countries, creating potential supply chain vulnerabilities.

The database could help identify more efficient methods for producing medicines and reduce reliance on expensive catalysts based on precious metals

The research found that some reactions performed just as effectively with nickel catalysts and, in certain cases, with copper catalysts. Both metals are more widely available around the world, potentially offering more sustainable and resilient alternatives for pharmaceutical manufacturing.

“The latest drugs in the pipeline are raising the bar of sophistication for chemical synthesis. At the same time, supply chains for precious metals and other critical reaction components are being exposed as risks,” Cernak said. “Big data drops like this one are going to be needed to build the predictive models that can make better drugs faster.”

Unexpected discoveries emerge

Beyond supporting AI development, the large-scale dataset has already revealed scientific data that may have been difficult to detect through traditional approaches.

“One key takeaway was that large, systematically designed reaction datasets can uncover patterns that are difficult to see from traditional scope studies alone,” Cernak said. “For example, I never would have predicted that the highly reactive intermediate molecules called arynes could form at such low temperatures but it was hard to ignore when we saw it hundreds of times. This is exciting as a possibility to synthesise drugs without precious metal catalysts.”

The team hopes the database will encourage further discoveries and help accelerate the development of new medicines through data-driven chemistry.