The Two Major Challenges with Population-Scale Genomic and Rich Clinical Data, and How MediSapiens Genius and Accurate Software Solutions Enable Real-Time Genomic Medicine

14 Jun 2018
Genomic medicine may facilitate predictive, preventive and personalized (or precision) health care, improve cost efficiency of medical care and promote efficacy, speed and success rates of drug development. The utilization of the world’s increasing genomic and clinical data resources holds the promise of driving new breakthroughs in drug development, disease prevention, diagnostics and patient care.

Genome sequencing has been one of the most rapid technology development areas in the world in the past years. This is highlighted in the dramatic reduction in cost of sequencing a human genome – from $100M in the beginning of the 21st century, to $1M in 2008 and to the current level of roughly $1000 – a hundred thousand fold drop in cost in about 15 years. This has led to it being possible to sequence thousands or even millions of people.

Genomic data can then be linked with rich clinical parameters, such as demographic, laboratory, disease, treatment, drug response, side effect, survival and other follow-up data, as well as other real-world data. Major genomic medicine studies underway include the UK Biobank, the NIH 1 million people ‘All of Us’ program, the Million Veterans Program, Genomics England’s 100,000 Genomes Project, Chinese 1 million genomes project in Nanjing, Genome Korea in Ulsan, Saudi Arabian genomes, FinnGen, Estonian Genome Project, major industrial programs, such as 23andMe and deCode, as well as hundreds of other national, hospital-based, cohort and biobank projects.

There are two major challenges in linking genomic and clinical data sets at population scale. First, as these data sets are highly complex and multidimensional, calculations needed for genome data analysis slow down dramatically as the data volumes become massive. Second, the genomic data are only as useful as the high-quality clinical and phenotypic data associated with them. Unfortunately, this data is rarely sufficiently high-quality, harmonized and standardized. Clinical and phenotypic data are often collected from Electronic Health Records, registries or patient questionnaires, which often results in disparate, non-systematic, unharmonized and error-prone data. Huge manual data curation efforts are often needed to make such data useful.

MediSapiens has recently launched two new software solutions – Genius™ and Accurate™ – to tackle both of these major challenges. These software solutions facilitate population-scale genomic medicine, drug development and precision healthcare efforts. Genius™ enables one to rapidly analyze genomic data in real-time on the level of millions of individuals without the need to invest heavily in server clusters, maintaining low operating costs. Accurate™ enables one to curate and harmonize massive clinical data series automatically by leveraging AI assets.


Genius is a novel genomic data management and analysis solution that allows you to perform real-time queries of population-scale genomic datasets. With Genius, up to millions of genomes can be queried at once, in real-time, speeding up processing by an order of magnitude. Previously, you would need to wait for a response to many such queries for hours, or even days.

We naturally did some benchmarking to test the speed, so we ran analyses on 500,000 samples and 50 trillion genotypes with Genius. Here are some of the results:

  • Genome-wide search for genetic associations for any ontology term – 2 seconds
  • Search for samples with a specific mutation – 0.9 seconds
  • Discovery of protein altering mutations across all genes – 2 seconds

Currently one needs to wait for at least tens of minutes of batch-processing time before such results become available at the scale indicated. An alternative is to acquire very expensive hardware or cloud computing resources to handle such analyses – not very attractive or even feasible for many organizations. The advantage of cost-effective real-time data processing is significant, as one can analyze a result instantaneously, create the next query and undertake discovery and data mining much faster – this we call working real-time on population scale data. We believe that this will fundamentally speed up genomic data mining and analysis for drug development, diagnostics development and genomic medicine.


Accurate enables scalable and automated curation of clinical and phenotypic data, including massive healthcare datasets. Accurate allows you to visualize, curate, enrich and harmonize these data, as well as to automatically add ontology encoding for relevant parameters for the entire dataset at once. We are also working diligently to bring other ground-breaking features to Accurate, such as automated analysis of unstructured data, for example patient discharge notes or other free-text entries, and bringing relevant data to a structured, computationally accessible format. This will be a major improvement on the richness and utility of any clinical data set that encompasses also unstructured data.

Moving from working on some thousands of genomes to working with hundreds of thousands or millions of genomes has so far meant unbearably high costs and inability to scale up clinical data annotation accordingly. We believe that combining Genius and Accurate allows our customers to be at the leading edge of combining any genomic and clinical data and using that for the next big breakthroughs. The more and the higher the quality, the merrier.

We continue to innovate on our path to better health, working together with our valued customers.


Writer: Henrik Edgren, CSO