article

Machine learning’s growing importance in researching cells

As we move towards more generalised AI models, neural networks and natural language interfaces, we’re starting to see machine learning take the place of higher order reasoning and data analysis “sense making.” Traditional scientific inquiry has typically been about asking specific questions of a specific model system under specific conditions. We’re starting to open the door to more generalised questions that yield testable, meaningful conclusions without asking specific questions of our data.

luminous brain hovering above computer motherboard

Life sciences is fundamentally governed by large, complicated, and chaotic datasets with difficult to model interactions. Those in life sciences have relied on statistical modelling, predictive algorithms, and empirically derived data for decades to build on the insight of earlier generations of scientists, and to refine techniques. This differs somewhat from physics, which more classically derives its predictions from theory and maps those to some sort of probability; life sciences for many years leaned on imperfect approximations and existing large datasets to generate testable predictions.

This is especially true in domains such as protein structure, predictive binding kinetics, and even in larger systemic investigations like cell migration models or disease progression. It can be argued that much of the life sciences automation we know and use today grew out of the necessity for large datasets to capture the inherent variability of even model organisms.

Machine learning’s role in life sciences research

As we move towards more generalised AI models, neural networks and natural language interfaces, we’re starting to see machine learning take the place of higher order reasoning and data analysis “sense making.” Traditional scientific inquiry has typically been about asking specific questions of a specific model system under specific conditions. We’re starting to open the door to more generalised questions that yield testable, meaningful conclusions without asking specific questions of our data.

One obvious example of this is image analysis. Machine learning can reduce an image to data patterns, descriptive mathematical paths, and even elucidate features that might not be perceptible to even the most well-trained scientists because you may not know what you’re looking for. The human capacity for analysis in something like a confocal image stack can only be so robust for a given amount of time invested. We as humans look “for” things based on contextual knowledge of the experiment and report back what we see. Inherently there’s some bias there no matter how talented the microscopist. 

Algorithms, however, can be trained to simply look “at” images as agnostic data and report back in less biased fashion. Another good example is really any process optimization / screening domain, whether it’s cloning screening, media formulation, drug screens, etc. These are laborious solution spaces to search and often involve best-guess statistical models and factor analyses to determine the most cost-effective screen that could be run to obtain a set of conditions sufficiently optimised to move forward with. Machine learning has enabled us to create feedback loops where these processes are close to training themselves to find the “best” solution in fewer iterations. Of course, properly defining and measuring “best” in the context of a machine learning algorithm is always the trick.

One of the more niche applications of generative AI such as ChatGPT and other large language models is in “plain language” trouble-shooting and early experimental design. Often you learn so much more from the mistakes and challenges of others when perfecting a specific technique or chasing down an investigative possibility. Large language models are exceptional at collecting vast amounts of disparate information from esoteric websites, forums, book chapters, review articles, and even open access journals and then cramming the sum total of that information into a plain language summary that does a fair job of approximating human knowledge on the subject. 

For example, try asking ChatGPT: What are the most common challenges and failures when performing [insert experimental technique here]? The accuracy of the answer may surprise you because ChatGPT doesn’t particularly care about presenting the technique as fool-proof, as a manufacturer’s literature might be incentivised to do, and it doesn’t need to be frustratingly concise as a manuscript might. Large language models are even starting to replace the classic “seminal paper library” as a means of digesting, amalgamating, and communicating the general breadth of knowledge on a subject to bring new investigators up to speed quickly. 

I believe democratisation of any technology is generally a good thing provided we recognize and implement the proper guardrails. We’ve already seen some cases of bad actors using AI-generated images of particularly well-endowed rats for manuscripts that are entirely non-sensical and inaccurate. Perhaps that says more about peer review than it does AI, but it’s a genuine concern that AI will facilitate the proliferation of “bad science” and create a basal level noise that makes it difficult to tease truth out of. 

Doom saying aside, the jury is still out on just how transformative generative AI will be to life sciences research. Whether it will be just another enabling technology that clears bandwidth for more meaningful pursuits or fundamentally alters how we approach problem solving by forcing us to adopt complementary modes of research that are more “AI-friendly.” It’s too early to tell but I’m optimistic.

The increase of machine learning researching cells and diseases

Investigating the origins of a particular pathology is always an arms race with complexity and reducing that complexity to questions that you can actually answer in a lifetime. Machine learning working in concert with automation has a huge role to play here. The more you make complexity mundane, the closer you get to meaningful answers. 

We can see this philosophy in practice in the recent explosion of 3D culture methods, organoids, and on-chip devices that more closely mimic the biological context of disease with much higher fidelity when compared to conventional 2D culture. Liquid handling automation shines in this domain because culture workflows are long, laborious, and often must be planned around states rather than convenient work-day cycles to be relevant. Robots simply don’t care if it’s 2 am on a Sunday morning when they’re passaging cells. 

More generally speaking, liquid handling automation is quickly earning the trust of scientists to do the “dirty work” of even highly complex workflows and free up human capital to focus on abstracted problems. This dovetails nicely with machine learning because larger and larger datasets can be generated under more well-characterised, if not controlled, conditions to train feedback algorithms. The net results of these multi-variate datasets inform which organoid models are yielding actionable information under what conditions. Humans can then focus on the higher-order “why” questions as opposed to factor level concerns of “which”, “when” and “how much.”

Trends on the horizon

We’re still very much in the “hype” phase of ChatGPT-like services in the near term, and with existing open-source tools I’d expect to see a domain-specific specialisation of these large language models for use in life sciences. I think that will be the first application, even if not the most exciting one. Beyond that, I’m looking for two things: first a trend-shift towards multi-omics and otherwise massively parallel experiments, and secondly the closer marriage of human intuition and in-silico predictive power to inform upstream experimental design and streamline analysis. 

To the first point, imagine a situation where every experiment included phenotypic, genomic, transcriptomic, and proteomic data. I think we’re trending toward a world where that approach actually makes sense, and the classic research question becomes less of a focused, “Does X impact Y” to a more generalised, system-wide “What’s happening here?” where machine learning enables that deep analysis by pointing human research at what’s relevant in the stream of information. When asking those open-ended questions we really want to have as much data along as many axes as possible to give us the best chance of hitting on something crucial to either prophylactic care, diagnosis, or therapeutic design. 

To the second point regarding the marriage of human intuition with in-silico prediction, we all have limited time, resources, money, and expertise bandwidth. I expect the next year of AI and machine learning innovations in life sciences to improve the deployment of those resources down avenues that may not have been immediately obvious. Giving scientists the ability to ask “what if…” type questions using predictive software, and knowing the confidence of those predictions, has the potential to drastically accelerate the search for drug targets, proteins of interest, biomarkers, etc. Already we see these technologies deployed to make more sophisticated high-throughput screens, but I imagine we’ll begin to see similar percolation down towards basic research to inform experimental designs before a scientist even steps up to the wet bench. 

Improving analysis of complex datasets including multiomics

Multiomics is all about relational analysis in large datasets. Something humans alone are terrible at. We don’t know what we don’t know, so we miss a lot. Machine learning doesn’t have this problem; it can agnostically search for and characterise patterns or relations across any investigational axes, and even contrive combinatorial factors such as in Principal Component Analysis! Many of these techniques are entirely “unsupervised” in the sense that the techniques are broadly applicable with minimal guidance from a human operator. While these kinds of “big data” analyses have historically been the realm of computational biologists, many research groups simply don’t have access to that skill set, or, if they do, that person or group doesn’t have bandwidth for high-risk exploratory work. AI tools along with machine learning is beginning to democratize access to these kinds of analyses. 

Biological context is everything when it comes to understanding disease, and the ugly truth is that experimentation often necessarily strips elements of context in the pursuit of factor control. Omics approaches represent a general departure from conventional control schemes by allowing more biological context and variability to remain in place, because omics experiments are themselves large-scale characterisations of the factors one would otherwise have needed to control. Of course, that’s a somewhat reductive explanation but it’s close enough to general case to be useful.

Now, multi-omics takes this a step further and has the potential to characterize systems either vertically as a directly correlative stack in the case of genomics > transcriptomics > proteomics, or as complementary technology as in the case of genomics and metagenomics informing the speciation, diversity, and taxonomy of gut microbiota for microbiomic study and profiling. 

Ultimately, multi-omics represents an opportunity to generate high-fidelity, richly informative datasets that are often internally orthogonal. These datasets can then be used to make surgically precise predictions regarding promising pathways, targets, or therapeutic modalities. So why doesn’t everyone do it? A cursory power analysis often reveals that staggeringly large numbers of biological replicates, wells, or conditions are needed to make statistically relevant inferences, and while liquid handling automation can certainly address many of those challenges, you still have massive amounts of data that’s difficult to directly resolve. However, as machine learning techniques mature and grow alongside our ability to generate large datasets, we’re becoming more adept at untangling the riddles and arriving at answers to questions we didn’t even think to ask, and that’s the real promise of integrated multi-omics.

About the author

bioIan Shoemaker 

Senior Application Scientist

Beckman Coulter Life Sciences

With nearly 15 years of translational lab automation and instrumentation experience in personalised medicine and clinical molecular diagnostics. At Beckman Coulter Life Sciences, he supports the applications development team in NGS, cell-based assays, and proteomics workflows.