Why curating phenotypic data is crucial to advancing genomic medicine

In recent years, the rise of high-throughput genome sequencing has created the potential for genomic medicine to revolutionise healthcare, by enabling more targeted diagnoses and treatments. Landmark sequencing initiatives such as the Chinese 1 million genomes project and the Million Veterans Program in the US, offer the tantalising possibility of uncovering new information about a range of chronic illnesses.

But in order to interpret these huge databases of genomic information and translate them into the clinical tests of the future, there is a need to integrate them with phenotypic data, which can be anything from a patient’s medical history to detailed imaging data from MRI scans. In recent years there has been a drive for increasing amounts of phenotypic data to advance genomic medicine. The more information known about patients and their physiology, the more likely it is that connections can be established between a particular disease, the genome and different phenotypes.

Genomic scientists believe that these connections could be used for a range of clinical applications from determining the people who are most at risk of developing particular disorders, to predicting disease progression over time. “If you collect enough genomic and phenotypic data, then you can start developing algorithms which make prediction for patient outcomes,” says Julian Barwell, consultant in clinical genetics at the University of Leicester. “Artificial intelligence will start to have a role where you can map how someone’s likely to develop over time based on their genomic profile and their phenotypic profile.”

But trying to utilise all this phenotypic data comes a variety of different challenges, not least the sheer volumes of data involved. In the years to come, the amount of phenotypes being collected on each patient is predicted to expand vastly as genomic scientists seek to mine everything from social media data to biometrics collected by smartphones and smart watches.

“It’s not just text but often very storage intensive files,” says Sobia Raza, a genomic scientist at the University of Cambridge. “So there can be lots of detailed images of people’s hospital pathology slides, characterisation of cancers, and these types of phenotypic data can be a challenge in terms of how much memory they consume.”
However, new advances in technology are already helping to deal with a number of these obstacles. MediSapiens’ data curation solution Accurate has been specifically designed to deal with one of the biggest challenges which genomic scientists face regarding phenotypic data, namely the problem of curating and standardising information originating from a range of different sources.

“Right now there’s a real need for better more standardised phenotypic data,” says Raza. “Without this, you might not be able to discover patterns and correlations in the genomic datasets which may be of interest.”

Because of the sheer number of people involved in genomic studies, phenotypic data is likely to come from many different hospitals and clinics, which presents a variety of difficulties. For example, some of the data will be captured in the form of free text and so will be difficult to extract quickly, while different medical centres will use varying abbreviations or technical terms for referring to the same thing.

“There’s a huge problem with collating data from different sources,” explains Henrik Edgren, Chief Scientific Officer at MediSapiens. “There’s a surprising variety of encodings even for something as simple as sex. And there are lots of alternate names for the same drug, or very creative ways of mis-spelling the names of drugs. For the same medication, you might have ten different ways of writing it wrong, and for a human to go through all of this manually and standardise it all perfectly is an enormous task.”
Accurate can tackle this by utilising machine learning models to learn how a particular scientist would like the data to be standardised before combing through and harmonise entire databases of phenotypic information in the same way, making the whole process as automated as possible.

“The idea is to make it as easy as possible to take phenotypic data across formats and curate, harmonise and standardise it for whatever use case you want,” say Edgren. “For example many of our clients have been working with large amounts of cancer genomic data. These datasets compromise 20-30 different cancer types, and close to 10,000 patients and they use our technology to standardise them. This makes sure that the same ways of encoding information, from diagnosis to how aggressive a cancer is, are used throughout the dataset so everything is comparable.”

Such phenotypic support tools are crucial to the future of genomic medicine, and genomic experts around the world believe that in future, such technologies are likely to become even more intelligent and advanced.

“In future, technology is likely to guide phenotypic profiling,” says Raza. “So tools will be able to learn from existing patterns between phenotypes and genomic information, and be able to analyse phenotypic data and then guide clinicians towards the type of disorder that the patient may have.”

Dr David Cox is a freelance health journalist who has written for major newspapers and broadcasters around the world, covering stories ranging from the rise of genomic medicine to artificial intelligence and personalised medicine. He has a PhD from the University of Cambridge in neuroscience.

Share this article: