article

Part one: an introduction to data quality

In this four-part series, Dr Singh will discuss the challenges surrounding limited data quality, and some pragmatic solutions. In this first article, the key attributes that define data quality and its requirement for data scientists are elucidated.

Data set

Early drug discovery involves conducting experiments, data generation, and testing scientific hypotheses. For scientists in early drug discovery, their success is heavily dependent on the quality of the data generated. For example, the following questions must be considered: are we generating the right data from the right experiments, and are we applying the best methods for data analysis and interpretation? 

Data quality (sometimes called data integrity) is a frequently used term. It is of special importance from data acquisition to data analysis in a workflow, as the effectiveness of data analyses by Machine Learning (ML) and AI can be severely compromised by poor data quality. Characteristics that define data quality are as follows:

  • Completeness
  • Consistency
  • Lack of bias
  • Accuracy

Data quality is a large topic across many industries where data engineers specialise in “data ops” and build advanced test pipelines to automatically detect issues. In early drug discovery, the data pipelines tend to be more limited (with genomics data being the exception). For example, data quality is often about: 

  • a measurement issue from a lab machine – eg missing data in a csv file
  • a poorly designed experiment – eg too much variability in the data
  • poor data from non-controlled environments – eg observation data from human studies

Data-generation costs are forever reducing, as technology (both hardware and AI) becomes cheaper and more effective. Over the last couple of decades, this has led to ready access to large amounts of data – creating the paradigm of “big data”.

In this new paradigm, the data quality issues multiply up. An error that was previously easy to detect in a small dataset, saved as a csv file for example, can be much harder to detect in a large complex dataset which may need to be part of a structured database. This then results in a need for data scientists. Although we are not quite at the point of requiring data ops, that need may arrive soon.  

Data science and data scientists are terms used across industries and mean different things to different people – primarily defined by the job to be done. For early drug discovery, we should focus on two skills: data engineering and scientific data understanding.

  • Data engineering is the ability to describe, summarise, format and clean data.  
  • Scientific data understanding is the ability to explain the scientific rationale behind the dataset labels and the implication of, for example, missing data.

A data scientist in early drug discovery needs to be able to do both of the above competently, involving a mix of software, science, and ML skills.  The challenge that hiring managers face is that academic courses tend to focus on the individual pillars separately. The above multi-disciplinary skills are often learnt in industry, making good data scientists difficult to find.

In the next article in this series, which will be published Monday 26 August, we will discuss the problems that can occur when data quality is compromised.  

Furthermore, there are several industry and academic efforts to accelerate the sharing and use of data across the life sciences industry, which we will also discuss in a future article.

 

About the author

Dr Raminderpal Singh

Raminderpal SinghDr Raminderpal Singh is a recognised visionary in the implementation of AI across technology and science-focused industries. He has over 30 years of global experience leading and advising teams, helping early to mid-stage companies achieve breakthroughs through the effective use of computational modelling. 

Raminderpal is currently the Global Head of AI and GenAI Practice at 20/15 Visioneers. He also founded and leads the HitchhikersAI.org open-source community. He is also a co-founder of Incubate Bio – a techbio providing a service to life sciences companies who are looking to accelerate their research and lower their wet lab costs through in silico modelling. 

Raminderpal has extensive experience building businesses in both Europe and the US. As a business executive at IBM Research in New York, Dr Singh led the go-to-market for IBM Watson Genomics Analytics. He was also Vice President and Head of the Microbiome Division at Eagle Genomics Ltd, in Cambridge. Raminderpal earned his PhD in semiconductor modelling in 1997. He has published several papers and two books and has twelve issued patents. In 2003, he was selected by EE Times as one of the top 13 most influential people in the semiconductor industry.

For more: http://raminderpalsingh.com; http://20visioneers15.com; http://hitchhikersAI.org; http://incubate.bio