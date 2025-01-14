Scientific workflow for hypothesis testing in drug discovery: Part 2 of 3

In part two of the step-by-step scientific workflow for drug discovery series, Dr Raminderpal Singh and Nina Truter describe the functions of the workflow previously outlined and include key considerations.

Drug discovery scientists spend their days developing and testing complex hypotheses, leveraging data and expertise through workflows that utilise available tools. Operating a workflow, such as the one described in Figure 1, involves several key considerations that affect both the accuracy of results and the efficiency of the research process.

Defining the research question

A well-defined research question is the cornerstone of an effective scientific workflow in drug discovery. The more specific your question, the easier it becomes to identify relevant data and design subsequent steps in your workflow. This initial phase often involves an iterative process: refining your question, conducting a literature review, and assessing available data to ensure the right level of specificity and relevance of your research question. AI tools like ChatGPT can help refine your question and provide an overview of the research landscape before commencing a full literature review.

Hypothesis generation

The hypothesis generation process is equally important. Before undertaking data analysis, a hypothesis must be developed based on literature reviews and public datasets. While the scientific question guides the entire investigation, without a clear hypothesis, the research could become unfocused and exploratory. Having a well-defined hypothesis allows researchers to assess datasets critically and ensures that their analysis remains grounded in the biological context. Creating a rough map containing the relevant variables that influence the outcome of the scientific question based on literature review and logic can help structure the hypothesis. This map can be used as a ‘checklist’ when assessing whether a dataset contains the necessary variables to answer the research question.

Data identification

When searching for public data, tools like Perplexity.ai can help identify relevant databases. By posing questions such as, ‘Which database should I use to search for data on the effects of longevity drugs in rodents?’ you can obtain more accurate, fact-based answers, whereas ChatGPT and Claude.ai provide more general information. Google Dataset Search and PubMed’s ‘Associated Data’ feature can identify datasets linked to publications. After discerning a potentially useful dataset, Claude.ai can summarise experimental methods to determine if the dataset is the right fit for your research question. Creating a descriptive spreadsheet to catalogue potential datasets, along with a broad description of their contents, helps streamline the selection process. In some cases, combining multiple datasets may be necessary to comprehensively address your research question.

Understanding data

Before initiating analysis, ample time should be spent reviewing the raw data. Browsing through datasets, often in Excel format, can clarify how the data were generated, helping you choose appropriate analysis methods and the requisite sanity checks. For data types that are less familiar, ChatGPT can be helpful in explaining the experimental method and for establishing potential validation steps. Alternatively, search for review papers or papers using a similar method and understand how it was applied in that context.

Visualisation is another powerful tool for understanding data and experimenting with different methods can provide varied perspectives. ChatGPT can also aid in deciding which visualisation options are available and what information each will provide, based on the data and your research question. Additionally, running analyses on both the raw/‘uncleaned’ and ‘cleaned’ versions of the dataset helps assess the impact of outliers and can guide decisions on whether to include or exclude them.

Analysing and interpreting results

When it comes to data analysis, Claude.ai has tools that offer specific methods to improve the data analysis process. Although ChatGPT is helpful as an initial step to understanding results, it should be used as a tool for creating literature review ideas and hypothesis generation, not as a fact-based system. The scientific question should remain the anchor of the interpretive process, alongside your understanding of the raw data and output from analytics. Here, it is helpful to toggle between two mindsets: one of a creative scientist, useful for creating avenues of exploration, and one of a critic when assessing the merit of these avenues. In the next article, we will discuss the tools scientists can use to execute the workflow described in Figure 1.

Exploratory investigations and missed opportunities

Often, datasets are generated for a specific research question, but they may contain additional information that could be useful for answering new or unrelated questions. This is particularly true for large public datasets, where the breadth of data available can sometimes be overwhelming. Researchers may miss opportunities to generate new insights simply because they are focused on their initial question and do not have the resources to explore other possibilities.

Additionally, exploratory analyses can be valuable for identifying new biological markers or hypotheses. For instance, a dataset generated to study protein expression in one context might also reveal valuable information about other biological pathways or processes. However, exploratory investigations can be resource-intensive, both in terms of time and computational power. Researchers should therefore balance their focused analysis with the potential for broader discoveries.

For more insights, refer to Part 1 of this series, where we explore foundational concepts for early drug discovery workflows.