Machine learning’s growing importance in researching cells

Share via

Posted: 3 May 2024 | Ian Shoemaker (Beckman Coulter Life Sciences) | No comments yet

As we move towards more generalised AI models, neural networks and natural language interfaces, we’re starting to see machine learning take the place of higher order reasoning and data analysis “sense making.” Traditional scientific inquiry has typically been about asking specific questions of a specific model system under specific conditions. We’re starting to open the door to more generalised questions that yield testable, meaningful conclusions without asking specific questions of our data.

luminous brain hovering above computer motherboard

Life sciences is fundamentally governed by large, complicated, and chaotic datasets with difficult to model interactions. Those in life sciences have relied on statistical modelling, predictive algorithms, and empirically derived data for decades to build on the insight of earlier generations of scientists, and to refine techniques. This differs somewhat from physics, which more classically derives its predictions from theory and maps those to some sort of probability; life sciences for many years leaned on imperfect approximations and existing large datasets to generate testable predictions.

This is especially true in domains such as protein structure, predictive binding kinetics, and even in larger systemic investigations like cell migration models or disease progression. It can be argued that much of the life sciences automation we know and use today grew out of the necessity for large datasets to capture the inherent variability of even model organisms.

Machine learning’s role in life sciences research

One obvious example of this is image analysis. Machine learning can reduce an image to data patterns, descriptive mathematical paths, and even elucidate features that might not be perceptible to even the most well-trained scientists because you may not know what you’re looking for. The human capacity for analysis in something like a confocal image stack can only be so robust for a given amount of time invested. We as humans look “for” things based on contextual knowledge of the experiment and report back what we see. Inherently there’s some bias there no matter how talented the microscopist.

Algorithms, however, can be trained to simply look “at” images as agnostic data and report back in less biased fashion. Another good example is really any process optimization / screening domain, whether it’s cloning screening, media formulation, drug screens, etc. These are laborious solution spaces to search and often involve best-guess statistical models and factor analyses to determine the most cost-effective screen that could be run to obtain a set of conditions sufficiently optimised to move forward with. Machine learning has enabled us to create feedback loops where these processes are close to training themselves to find the “best” solution in fewer iterations. Of course, properly defining and measuring “best” in the context of a machine learning algorithm is always the trick.

One of the more niche applications of generative AI such as ChatGPT and other large language models is in “plain language” trouble-shooting and early experimental design. Often you learn so much more from the mistakes and challenges of others when perfecting a specific technique or chasing down an investigative possibility. Large language models are exceptional at collecting vast amounts of disparate information from esoteric websites, forums, book chapters, review articles, and even open access journals and then cramming the sum total of that information into a plain language summary that does a fair job of approximating human knowledge on the subject.

For example, try asking ChatGPT: What are the most common challenges and failures when performing [insert experimental technique here]? The accuracy of the answer may surprise you because ChatGPT doesn’t particularly care about presenting the technique as fool-proof, as a manufacturer’s literature might be incentivised to do, and it doesn’t need to be frustratingly concise as a manuscript might. Large language models are even starting to replace the classic “seminal paper library” as a means of digesting, amalgamating, and communicating the general breadth of knowledge on a subject to bring new investigators up to speed quickly.

I believe democratisation of any technology is generally a good thing provided we recognize and implement the proper guardrails. We’ve already seen some cases of bad actors using AI-generated images of particularly well-endowed rats for manuscripts that are entirely non-sensical and inaccurate. Perhaps that says more about peer review than it does AI, but it’s a genuine concern that AI will facilitate the proliferation of “bad science” and create a basal level noise that makes it difficult to tease truth out of.

Doom saying aside, the jury is still out on just how transformative generative AI will be to life sciences research. Whether it will be just another enabling technology that clears bandwidth for more meaningful pursuits or fundamentally alters how we approach problem solving by forcing us to adopt complementary modes of research that are more “AI-friendly.” It’s too early to tell but I’m optimistic.

The increase of machine learning researching cells and diseases

Investigating the origins of a particular pathology is always an arms race with complexity and reducing that complexity to questions that you can actually answer in a lifetime. Machine learning working in concert with automation has a huge role to play here. The more you make complexity mundane, the closer you get to meaningful answers.

We can see this philosophy in practice in the recent explosion of 3D culture methods, organoids, and on-chip devices that more closely mimic the biological context of disease with much higher fidelity when compared to conventional 2D culture. Liquid handling automation shines in this domain because culture workflows are long, laborious, and often must be planned around states rather than convenient work-day cycles to be relevant. Robots simply don’t care if it’s 2 am on a Sunday morning when they’re passaging cells.

More generally speaking, liquid handling automation is quickly earning the trust of scientists to do the “dirty work” of even highly complex workflows and free up human capital to focus on abstracted problems. This dovetails nicely with machine learning because larger and larger datasets can be generated under more well-characterised, if not controlled, conditions to train feedback algorithms. The net results of these multi-variate datasets inform which organoid models are yielding actionable information under what conditions. Humans can then focus on the higher-order “why” questions as opposed to factor level concerns of “which”, “when” and “how much.”

Trends on the horizon

We’re still very much in the “hype” phase of ChatGPT-like services in the near term, and with existing open-source tools I’d expect to see a domain-specific specialisation of these large language models for use in life sciences. I think that will be the first application, even if not the most exciting one. Beyond that, I’m looking for two things: first a trend-shift towards multi-omics and otherwise massively parallel experiments, and secondly the closer marriage of human intuition and in-silico predictive power to inform upstream experimental design and streamline analysis.

To the first point, imagine a situation where every experiment included phenotypic, genomic, transcriptomic, and proteomic data. I think we’re trending toward a world where that approach actually makes sense, and the classic research question becomes less of a focused, “Does X impact Y” to a more generalised, system-wide “What’s happening here?” where machine learning enables that deep analysis by pointing human research at what’s relevant in the stream of information. When asking those open-ended questions we really want to have as much data along as many axes as possible to give us the best chance of hitting on something crucial to either prophylactic care, diagnosis, or therapeutic design.

To the second point regarding the marriage of human intuition with in-silico prediction, we all have limited time, resources, money, and expertise bandwidth. I expect the next year of AI and machine learning innovations in life sciences to improve the deployment of those resources down avenues that may not have been immediately obvious. Giving scientists the ability to ask “what if…” type questions using predictive software, and knowing the confidence of those predictions, has the potential to drastically accelerate the search for drug targets, proteins of interest, biomarkers, etc. Already we see these technologies deployed to make more sophisticated high-throughput screens, but I imagine we’ll begin to see similar percolation down towards basic research to inform experimental designs before a scientist even steps up to the wet bench.

Improving analysis of complex datasets including multiomics

Multiomics is all about relational analysis in large datasets. Something humans alone are terrible at. We don’t know what we don’t know, so we miss a lot. Machine learning doesn’t have this problem; it can agnostically search for and characterise patterns or relations across any investigational axes, and even contrive combinatorial factors such as in Principal Component Analysis! Many of these techniques are entirely “unsupervised” in the sense that the techniques are broadly applicable with minimal guidance from a human operator. While these kinds of “big data” analyses have historically been the realm of computational biologists, many research groups simply don’t have access to that skill set, or, if they do, that person or group doesn’t have bandwidth for high-risk exploratory work. AI tools along with machine learning is beginning to democratize access to these kinds of analyses.

Biological context is everything when it comes to understanding disease, and the ugly truth is that experimentation often necessarily strips elements of context in the pursuit of factor control. Omics approaches represent a general departure from conventional control schemes by allowing more biological context and variability to remain in place, because omics experiments are themselves large-scale characterisations of the factors one would otherwise have needed to control. Of course, that’s a somewhat reductive explanation but it’s close enough to general case to be useful.

Now, multi-omics takes this a step further and has the potential to characterize systems either vertically as a directly correlative stack in the case of genomics > transcriptomics > proteomics, or as complementary technology as in the case of genomics and metagenomics informing the speciation, diversity, and taxonomy of gut microbiota for microbiomic study and profiling.

Ultimately, multi-omics represents an opportunity to generate high-fidelity, richly informative datasets that are often internally orthogonal. These datasets can then be used to make surgically precise predictions regarding promising pathways, targets, or therapeutic modalities. So why doesn’t everyone do it? A cursory power analysis often reveals that staggeringly large numbers of biological replicates, wells, or conditions are needed to make statistically relevant inferences, and while liquid handling automation can certainly address many of those challenges, you still have massive amounts of data that’s difficult to directly resolve. However, as machine learning techniques mature and grow alongside our ability to generate large datasets, we’re becoming more adept at untangling the riddles and arriving at answers to questions we didn’t even think to ask, and that’s the real promise of integrated multi-omics.

About the author

Ian Shoemaker

Senior Application Scientist

Beckman Coulter Life Sciences

With nearly 15 years of translational lab automation and instrumentation experience in personalised medicine and clinical molecular diagnostics. At Beckman Coulter Life Sciences, he supports the applications development team in NGS, cell-based assays, and proteomics workflows.

Related organisations
Beckman Coulter Life Sciences

Cookie	Type	Duration	Description
cookielawinfo-checkbox-advertising-targeting	persistent	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertising & Targeting".
cookielawinfo-checkbox-analytics	persistent	1 year	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Analytics".
cookielawinfo-checkbox-necessary	persistent	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	persistent	1 year	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Performance".
PHPSESSID	session	1 year	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	persistent	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
zmember_logged	session	1 year	This session cookie is served by our membership/subscription system and controls whether you are able to see content which is only available to logged in users.

Cookie	Type	Duration	Description
advanced_ads_browser_width	persistent	1 month	This cookie is set by Advanced Ads and measures the browser width.
advanced_ads_page_impressions	persistent	2 years	This cookie is set by Advanced Ads and measures the number of previous page impressions.
advanced_ads_pro_server_info	persistent	1 month	This cookie is set by Advanced Ads and sets geo-location, user role and user capabilities. It is used by cache busting in Advanced Ads Pro when the appropriate visitor conditions are used.
advanced_ads_pro_visitor_referrer	persistent	1 year	This cookie is set by Advanced Ads and sets the referrer URL.
bscookie	persistent	2 years	This cookie is a browser ID cookie set by LinkedIn share Buttons and ad tags.
IDE	persistent	2 years	This cookie is set by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
li_sugr	persistent	3 months	This cookie is set by LinkedIn and is used for tracking.
UserMatchHistory	persistent	1 month	This cookie is set by Linkedin and is used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.
VISITOR_INFO1_LIVE	persistent	5 months	This cookie is set by YouTube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Type	Duration	Description
bcookie	persistent	2 years	This cookie is set by LinkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
GPS	persistent	30 minutes	This cookie is set by YouTube and registers a unique ID for tracking users based on their geographical location
lang	session	1 year	This cookie is set by LinkedIn and is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	persistent	1 day	This cookie is set by LinkedIn and used for routing.
lissc	persistent	11 months	This cookie is set by LinkedIn share Buttons and ad tags.
vuid	persistent	2 years	We embed videos from our official Vimeo channel. When you press play, Vimeo will drop third party cookies to enable the video to play and to see how long a viewer has watched the video. This cookie does not track individuals.
wow.anonymousId	persistent	2 years	This cookie is set by Spotler and tracks an anonymous visitor ID.
wow.schedule	persistent	20 minutes	This cookie is set by Spotler and enables it to track the Load Balance Session Queue.
wow.session	persistent	20 minutes	This cookie is set by Spotler to track the Internet Information Services (IIS) session state.
wow.utmvalues	persistent	20 minutes	This cookie is set by Spotler and stores the UTM values for the session. UTM values are specific text strings that are appended to URLs that allow Communigator to track the URLs and the UTM values when they get clicked on.
_ga	persistent	2 years	This cookie is set by Google Analytics and is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. It stores information anonymously and assign a randomly generated number to identify unique visitors.
_gat	persistent	1 minute	This cookies is set by Google Universal Analytics to throttle the request rate to limit the collection of data on high traffic sites.
_gid	persistent	1 day	This cookie is set by Google Analytics and is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.

Cookie	Type	Duration	Description
cf_ob_info	persistent	1 minute	This cookie is set by Cloudflare content delivery network and, in conjunction with the cookie 'cf_use_ob', is used to determine whether it should continue serving “Always Online” until the cookie expires.
cf_use_ob	persistent	1 minute	This cookie is set by Cloudflare content delivery network and is used to determine whether it should continue serving “Always Online” until the cookie expires.
free_subscription_only	session	1 year	This session cookie is served by our membership/subscription system and controls which types of content you are able to access.
ls_smartpush	persistent	1 month	This cookie is set by Litespeed Server and allows the server to store settings to help improve performance of the site.
one_signal_sdk_db	persistent	Until cleared	This cookie is set by OneSignal push notifications and is used for storing user preferences in connection with their notification permission status.
YSC	session	1 year	This cookie is set by Youtube and is used to track the views of embedded videos.

Recommended

Machine learning’s growing importance in researching cells

Machine learning’s role in life sciences research

The increase of machine learning researching cells and diseases

Trends on the horizon

Improving analysis of complex datasets including multiomics

Leave a Reply Cancel reply

Recommended

Machine learning’s growing importance in researching cells

Machine learning’s role in life sciences research

The increase of machine learning researching cells and diseases

Trends on the horizon

Improving analysis of complex datasets including multiomics

Reprogramming immunity: designing smarter checkpoint receptor agonists

Liver organoids grow functional blood vessels in lab breakthrough

SOD1 protein found to trigger treatable Parkinson’s progression

Future-proofing drug development with GenAI

Disabling the SETD1B enzyme halts leukaemia cell growth

Leave a Reply Cancel reply