Clinical Natural Language Processing (NLP)

Effortless data entry: using AI to automate the completion of a national (gastrointestinal) cancer registry

Cancer research is the most rapidly advancing field of medical science – largely driven by the emergence of accessible and high-quality clinical data. LynxCare's NLP algorithm is benefiting gastrointestinal (GI) cancer research by offering new insights into this group of complicated diseases. In this article you’ll learn how this cutting-edge technology has enabled the mapping of unstructured data from over 24,500 health records, with precision and recall rates reaching 94%. We’ll highlight the unique capabilities of AI in unlocking the potential of real-world evidence (RWE), including examples from our use case for esophageal, pancreatic, and gastric cancers.

Introduction: Gastrointestinal cancers - AI to defeat the silent killer?

Cancer is one of the most significant health challenges in the world, with gastrointestinal cancers (GI cancers) entailing elevated mortality rates. In fact, GI cancers cause 35% of all cancer-related deaths [1]. For positive patient outcomes, these cancers require early detection and timely treatment.

Unfortunately, their symptoms are very complex to spot and current patient screening efforts are insufficient. It makes this group of diseases challenging to diagnose and manage.

To enable and improve cancer diagnosis and treatment, national cancer registry warehouses continue being established worldwide. In Belgium, the Belgian Cancer Registry (BCR) collects granular information on the patient population as part of the cancer registry.

While setting up the registries is important, their usefulness depends on populating them with high quality data. Alas, many registries are manually filled by tireless, but tight-on-time healthcare providers. The costs of manual data input are very high, and often we end up with insufficient data quality and poor data availability and accessibility.

Luckily, technology is starting to make a significant difference in cancer research, including in the diagnosis and management of GI cancers. Natural Language Processing (NLP) algorithms can help researchers quickly and accurately extract valuable information from massive volumes of data hidden in unstructured reports, including electronic health records (EHRs).

This post will explore LynxCare’s approach to using AI technology to advance gastrointestinal cancer research and transform the future of oncology by automating registry data entry.

Goals: Introduction of NLP to cancer registry data collection

Our research was set to demonstrate the power of AI and NLP in extracting valuable insights from vast amounts of unstructured clinical data.

LynxCare researchers aimed to develop an NLP algorithm that could map EHRs for GI cancer data with high precision and recall, meaning that the model delivers accurate and reproducible results.

Our goal was to explore the possibility of using NLP to extract and structure clinical data for the pre-population of the BCR. Additionally, we strived to showcase the benefits of having a unified database for better data access, re-use, and collaboration.

Methods: Streamlining cancer data collection with LynxCare

Anonymized patient records were made available to us through our collaboration with a Belgian university hospital. Our goal was to establish a tailored model for each of the three GI cancer types of interest: pancreatic cancer, esophageal cancer, and neuroendocrine tumors.

To create these models, we identified 50 clinical variables for each cancer type. These variables were factors that answer the key research questions such as patient characteristics, surgery details and post-op trajectory. Per variable, we identified on average 3 data points. After constructing the primary database, we trained the NLP model using 10% of the available dataset and validated its performance internally and externally.

In the next step, we used the NLP algorithm to link clinical concepts to the clinical codes of the Belgian Cancer Registry (BCR), with the aim of prepopulating the registry with accurate and up-to-date information. Finally, our researchers built a structured, harmonized database that is compliant with international data standards. We used the same NLP algorithm to transform the unstructured and structured source data into a hospital-specific OMOP CDM model.

Results: Enhancing cancer research with data management

Our team successfully processed 24,540 clinical records with the LynxCare NLP model. This allowed us to confidently map unstructured information from three different cancer types into a structured and harmonized database.

When it comes to data mining with NLP algorithms, precision and recall are crucial indicators of accuracy. The LynxCare NLP algorithm showed a high level of precision and recall in internal validation, with a minimum precision of 90% and a minimum recall of 88%. External validation of the algorithm on data related to pancreatic cancer and esophageal cancer resulted in a precision of 94% and 88%, respectively.

On top of that, we found that the variables mined from the clinical records were reliable and suitable for prepopulating the BCR. This finding highlights the importance of data quality in ensuring the accuracy of the registry and the potential of technology in improving data accuracy.

A single database that is harmonized and meets international data standards can have numerous benefits. By organizing the clinical data into the OMOP CDM format, we could access patient data quickly and efficiently, which is essential for discovering statistical trends, applying ICD coding, and research.

Conclusions: Towards a harmonized approach to cancer data handling

Data is a precious resource that can help us understand complex diseases like cancer. Our study has shown that data mining and NLP technology can be incredibly powerful tools for prepopulating national cancer registries with rich patient data.

This has significant implications for cancer research, as it allows us to gain deeper insights into the disease and develop more effective treatments.

Our study has also highlighted the importance of data quality and harmonization. By ensuring that data is structured in a consistent and standardized way, we can maximize its value and make it more accessible to researchers and clinicians.

This new approach also encourages cooperation and data sharing with other data providers like hospital networks, which will ultimately enhance the quality of cancer research. This means we can now obtain a more comprehensive understanding of the disease and improve patient outcomes in ways that were inaccessible before.

Heading

Heading

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.

Talk to an Expert

Other articles that might interest you

Visit our Knowledge Center