With the ever-growing amounts of data available and produced in the healthcare and life science sectors, new technologies and solutions leveraging AI and machine learning are vital. One of the challenges in unlocking this data remains to be the understanding and structuring of free text – as getting the vital information from free text tends to be tricky. One of the branches in this field MediSapiens has been working on is natural language processing (NLP). With NLP technologies, computers are aided to understand the natural language of humans.
Our latest NLP project in brief
At MediSapiens, we’ve built a deep-learning based solution to analyse free-text and automatically identify biomedical information considered relevant, so it can then be extracted and organized into a structured format. Examples of those categories could be diseases, medications, infected tissues or protein names. The system then allows different types of clinical data to be converted into tabular format file(s) that contains all the biomedical information found, in an organized way.
Cutting-edge software solutions for healthcare powered by machine learning
The application possibilities of such system in healthcare are vast and exciting. Healthcare entities such as hospitals, clinics or pharmaceutical companies usually have lots of potentially useful information spread out through years of textual data that is not being considered by the current informatics systems in use, and perhaps could be if only that data was analysed and organized into e.g. a tabular format.
Challenges with multiple languages
We wanted to build a versatile system that would work on multiple languages, ideally any language. It can however be technically challenging to have a language-based system to work on any language, given how different languages are in nature and structure. So, we’ve decided to use a language agnostic core algorithm, and add a convertible modular piece for different languages, which allows the system to be successfully used on any language, while not being one-specific-language dependent.
English and Finnish grammar walk into a bar
The amount of data resources available in English and Finnish is not quite comparable, with the latter often being limited and not too diverse. When developing an NLP system it is important to test how it performs in diversified data, even within the same context, so we could confidently assess the system generalization potential, and of course limited resources make this task that much more burdensome.
The majority of published NLP research is done using the English language. However, building a system that aims at being language-independent adds the complexity of having to test those architectures in different languages and see if the state-of-the-art performances hold.
This is specifically true when working with Finnish, as in nature it is radically different from English, when you look at the grammar structure, vocabulary formation, etc. For example, Finnish, similarly to German, has a lot of compound words, which means its vocabulary dimension is immense with many different word combinations being essentially new words. Moreover, the sentence structure is also very different from English, which again challenges the system development. The performance of sentence-based algorithms like the one we’ve used may struggle on languages other than English, which has required architecture changes and therefore development adjustments.
How to pick up the important content
Our NLP system was built using a deep-learning architecture, which is an area of machine learning that has been intensely and quickly advancing in the past couple of years.
To put it simply, the system is entirely data-driven, which means that it contains no module of hand-crafted rule-based analysis whatsoever. The main algorithm was taught how to analyse the data by looking at examples of that data and automatically computing both word and sentence patterns that it identified as important to make the distinction between which part of the data seems relevant and which not.
The system has been designed to be end-to-end, which means that it is fed textual-data as such, and directly outputs the result of the analysis, by disclosing all biomedical relevant information found, without specifying the “logic” that lead the analysis.
Architecture built to be reusable
The system’s algorithmic skeleton is fully data-driven so there is no need to adapt it to specific client-cases. Nonetheless, it does require client’s textual data to train their own system instance, which is then specialized into finding the relevant information the client wants to extract.
At first glance, this need of training the system on the client’s own data may sound like a limitation, but language-based systems always benefit from being tuned on similar data to what they will be used on. If you build a language-based system on English drama novels data, chances of it working well on French poetry are low.
Moreover, considering the specific purpose of our system to identify certain information in free-text, having a client-specific instance allows for the system to solely focus on the information that truly matters for that specific client context, it being the client’s choice to decide what information that is.
Our default output is a simple tabular format, but depending on the needs of the client, the extracted data can be delivered in other formats.
Interested to learn more? Don’t hesitate to contact us!
Beatriz Mano, AI Specialist & Data Scientist
Soila Suhonen, Marketing Manager