Part one: what can scientists do with LLMs today?

Share via

Posted: 24 June 2024 | Dr Raminderpal Singh (Hitchhikers AI and 20/15 Visioneers) | No comments yet

Recently there have been a flurry of announcements from AI-led biotechs around the potential of Large Language Models (LLM) in early drug discovery. In the first of a three-part series, Dr Raminderpal Singh explores what LLMs are, how early stage biotechs can take advantage of them, and what challenges they present.

LLMs, or deep learning models, have been around since the 1990s, but it wasn’t until the 2010s that their use became more popular. They have been used as the basis for AI assistants like Apple’s Siri and ChatGPT. Because of their ability to process vast sets of data there have been attempts to use them to accelerate progress across a range of industries. Early drug discovery is no exception and many AI-led biotechs, including Recursion¹ and Insilico Medicine², have released LLM-related announcements.

How do LLMs work?

ChatGPT³says that an LLM is a type of artificial intelligence (AI) model designed to understand and generate human language. These models are based on deep learning architectures, particularly transformers, which enable them to process and generate text with a high degree of coherence and relevance. LLMs, like GPT-4, are trained on vast amounts of text data from the internet, books, articles and other sources. This training allows them to learn the complexities of language, including grammar, context and even some level of reasoning and common sense. The latter is a large claim, so scientists are right to be wary.

It is important to note that:

LLMs work by learning from user questions and iterations on the system. Essentially, they use and reuse your search terms, which is why some LLMs are cheap to access.
LLMs are very expensive to build and train.
LLMs take advantage of known knowledge, and thereby are not great speculative tools (ie, when looking for insights in ‘dark spaces’).
ChatGPT (which includes several add-ons) can be very useful for scientists. However, to make it effective, you must iterate the prompts and questions, which can be frustrating.

How to reduce LLM hallucinations

Also, the topic of LLM hallucinations needs to be addressed. This has made news⁴several times. For scientists who have played around with LLMs, sometimes it is frustrating how unreal the answers can be. Hallucinations cannot be avoided when using non-curated knowledge. If biotechs can, they should curate their paper cohorts and upload them as the sole source for the analysis. However, hallucination checks should be built into the workflow.

How to use LLMs

Users of LLMs can be divided into two types:

those who use what is available in a simple, and sometimes private, way.
those who build and train their own models.

This series of articles will focus on the former type, mainly because the building of new models is currently too expensive for most.

The simple use of LLMs (point (i) above) can be done in two ways. The first is to go online and use an interactive system like ChatGPT or use software that includes wrappers around an LLM. The latter approach will be discussed in part three of this series, published on Monday 15 July.

LLMs have their limitations and are not best-in-class across modelling domains. For example, AlphaFold 3 uses Diffusion Models (DMs) to get better results,^5,6 as opposed to LLMs. Both DMs and LLMs come under the popular term Generative AI (GenAI). In addition, there is the term Foundation Model, which is similar to, but more powerful than, application-specific LLMs.⁷ As the GenAI field advances there will be more variants, and GenAI will be discussed more broadly in a later article.

In the next article in this series, published Monday 15 July, we will walk through a real science example using ChatGPT and provide the source information for you to replicate the example.

References

Proffit A. The LOWE Down on Recursion’s New LLM Orchestration Work Engine from JP Morgan. BioIT World [Internet]. 2024 January 9 [cited 2024 June]. Available from: https://www.bio-itworld.com/news/2024/01/09/the-lowe-down-on-recursion-s-new-llm-orchestration-work-engine-from-jp-morgan
Insilico Medicine. Insilico and NVIDIA unveil new LLM transformer for solving biological and chemical tasks. News Medical & Life Sciences [Internet]. 2024 May 20 [cited 2024 June]. Available from: https://www.news-medical.net/news/20240520/Insilico-and-NVIDIA-unveil-new-LLM-transformer-for-solving-biological-and-chemical-tasks.aspx
ChatGPT. Available from: https://chatgpt.com/
Sara Merken. New York lawyers sanctioned for using fake ChatGPT cases in legal brief. Reuters [Internet]. 2023 June 26 [cited 2024 June]. Available from: https://www.reuters.com/legal/new-york-lawyers-sanctioned-using-fake-chatgpt-cases-legal-brief-2023-06-22/
Devansh. Will Diffusion Models Be The Next Frontier of Deep Learning. Medium [Internet]. 2024 May 12 [cited 2024 June]. Available from: https://medium.datadriveninvestor.com/will-diffusion-models-be-the-next-frontier-of-deep-learning-7172bea88581
Sora Creators. What is the difference between a diffusion model and LLLM in simple terms. Sora Creators [Internet]. Available from: https://soracreators.ai/blog/What-is-the-difference-between-a-diffusion-model-and-LLM-in-simple-terms
Novita AI. Foundational Model vs. LLM: Understanding the Differences. Medium [Internet]. 2024 May 13 [cited 2024 June]. Available from: https://medium.com/@marketing_novita.ai/foundational-model-vs-llm-understanding-the-differences-820a4428dbc3

About the author

Dr Raminderpal Singh

Dr Raminderpal Singh is a recognised key opinion leader in the techbio industry. He has over 30 years of global experience leading and advising teams on building computational modelling systems that are both cost-efficient and have significant IP value. His passion is to help early to mid-stage life sciences companies achieve novel biological breakthroughs through the effective use of computational modelling.

Raminderpal is currently leading the HitchhikersAI.org open-source community, accelerating the adoption of AI technologies in early drug discovery. He is also CEO and co-Founder of Incubate Bio – a techbio providing a service to life sciences companies who are looking to accelerate their research and lower their wet lab costs through in silico modelling.

Raminderpal has extensive experience building businesses in both Europe and the US. As a business executive at IBM Research in New York, Dr Singh led the go-to-market for IBM Watson Genomics Analytics. He was also Vice President and Head of the Microbiome Division at Eagle Genomics Ltd, in Cambridge. Raminderpal earned his PhD in semiconductor modelling in 1997. He has published several papers and two books and has twelve issued patents. In 2003, he was selected by EE Times as one of the top 13 most influential people in the semiconductor industry.

For more: http://raminderpalsingh.com ; http://hitchhikersAI.org ; http://incubate.bio

Related organisations
HitchhikersAI

Cookie	Type	Duration	Description
cookielawinfo-checkbox-advertising-targeting	persistent	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertising & Targeting".
cookielawinfo-checkbox-analytics	persistent	1 year	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Analytics".
cookielawinfo-checkbox-necessary	persistent	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	persistent	1 year	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Performance".
PHPSESSID	session	1 year	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	persistent	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
zmember_logged	session	1 year	This session cookie is served by our membership/subscription system and controls whether you are able to see content which is only available to logged in users.

Cookie	Type	Duration	Description
advanced_ads_browser_width	persistent	1 month	This cookie is set by Advanced Ads and measures the browser width.
advanced_ads_page_impressions	persistent	2 years	This cookie is set by Advanced Ads and measures the number of previous page impressions.
advanced_ads_pro_server_info	persistent	1 month	This cookie is set by Advanced Ads and sets geo-location, user role and user capabilities. It is used by cache busting in Advanced Ads Pro when the appropriate visitor conditions are used.
advanced_ads_pro_visitor_referrer	persistent	1 year	This cookie is set by Advanced Ads and sets the referrer URL.
bscookie	persistent	2 years	This cookie is a browser ID cookie set by LinkedIn share Buttons and ad tags.
IDE	persistent	2 years	This cookie is set by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
li_sugr	persistent	3 months	This cookie is set by LinkedIn and is used for tracking.
UserMatchHistory	persistent	1 month	This cookie is set by Linkedin and is used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.
VISITOR_INFO1_LIVE	persistent	5 months	This cookie is set by YouTube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Type	Duration	Description
bcookie	persistent	2 years	This cookie is set by LinkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
GPS	persistent	30 minutes	This cookie is set by YouTube and registers a unique ID for tracking users based on their geographical location
lang	session	1 year	This cookie is set by LinkedIn and is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	persistent	1 day	This cookie is set by LinkedIn and used for routing.
lissc	persistent	11 months	This cookie is set by LinkedIn share Buttons and ad tags.
vuid	persistent	2 years	We embed videos from our official Vimeo channel. When you press play, Vimeo will drop third party cookies to enable the video to play and to see how long a viewer has watched the video. This cookie does not track individuals.
wow.anonymousId	persistent	2 years	This cookie is set by Spotler and tracks an anonymous visitor ID.
wow.schedule	persistent	20 minutes	This cookie is set by Spotler and enables it to track the Load Balance Session Queue.
wow.session	persistent	20 minutes	This cookie is set by Spotler to track the Internet Information Services (IIS) session state.
wow.utmvalues	persistent	20 minutes	This cookie is set by Spotler and stores the UTM values for the session. UTM values are specific text strings that are appended to URLs that allow Communigator to track the URLs and the UTM values when they get clicked on.
_ga	persistent	2 years	This cookie is set by Google Analytics and is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. It stores information anonymously and assign a randomly generated number to identify unique visitors.
_gat	persistent	1 minute	This cookies is set by Google Universal Analytics to throttle the request rate to limit the collection of data on high traffic sites.
_gid	persistent	1 day	This cookie is set by Google Analytics and is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.

Cookie	Type	Duration	Description
cf_ob_info	persistent	1 minute	This cookie is set by Cloudflare content delivery network and, in conjunction with the cookie 'cf_use_ob', is used to determine whether it should continue serving “Always Online” until the cookie expires.
cf_use_ob	persistent	1 minute	This cookie is set by Cloudflare content delivery network and is used to determine whether it should continue serving “Always Online” until the cookie expires.
free_subscription_only	session	1 year	This session cookie is served by our membership/subscription system and controls which types of content you are able to access.
ls_smartpush	persistent	1 month	This cookie is set by Litespeed Server and allows the server to store settings to help improve performance of the site.
one_signal_sdk_db	persistent	Until cleared	This cookie is set by OneSignal push notifications and is used for storing user preferences in connection with their notification permission status.
YSC	session	1 year	This cookie is set by Youtube and is used to track the views of embedded videos.

Recommended

Part one: what can scientists do with LLMs today?

How do LLMs work?

How to reduce LLM hallucinations

How to use LLMs

References

Leave a Reply Cancel reply

Recommended

Part one: what can scientists do with LLMs today?

How do LLMs work?

How to reduce LLM hallucinations

How to use LLMs

References

The value of GPCR cell-based assays in drug discovery

New targeted therapies show promise in lung cancer treatment

Refeyn launches software platform to automate mass photometry workflows

Novartis licenses innovative treatment for Huntington’s disease

Toxicology transformed: Why accuracy now leads the way

Leave a Reply Cancel reply