Cancer is one of the most significant health challenges in the world, with gastrointestinal cancers (GI cancers) entailing elevated mortality rates. In fact, GI cancers cause 35% of all cancer-related deaths [1]. For positive patient outcomes, these cancers require early detection and timely treatment.
Unfortunately, their symptoms are very complex to spot and current patient screening efforts are insufficient. It makes this group of diseases challenging to diagnose and manage.
To enable and improve cancer diagnosis and treatment, national cancer registry warehouses continue being established worldwide. In Belgium, the Belgian Cancer Registry (BCR) collects granular information on the patient population as part of the cancer registry.
While setting up the registries is important, their usefulness depends on populating them with high quality data. Alas, many registries are manually filled by tireless, but tight-on-time healthcare providers. The costs of manual data input are very high, and often we end up with insufficient data quality and poor data availability and accessibility.
Luckily, technology is starting to make a significant difference in cancer research, including in the diagnosis and management of GI cancers. Natural Language Processing (NLP) algorithms can help researchers quickly and accurately extract valuable information from massive volumes of data hidden in unstructured reports, including electronic health records (EHRs).
This post will explore LynxCare’s approach to using AI technology to advance gastrointestinal cancer research and transform the future of oncology by automating registry data entry.
Our research was set to demonstrate the power of AI and NLP in extracting valuable insights from vast amounts of unstructured clinical data.
LynxCare researchers aimed to develop an NLP algorithm that could map EHRs for GI cancer data with high precision and recall, meaning that the model delivers accurate and reproducible results.
Our goal was to explore the possibility of using NLP to extract and structure clinical data for the pre-population of the BCR. Additionally, we strived to showcase the benefits of having a unified database for better data access, re-use, and collaboration.
Anonymized patient records were made available to us through our collaboration with a Belgian university hospital. Our goal was to establish a tailored model for each of the three GI cancer types of interest: pancreatic cancer, esophageal cancer, and neuroendocrine tumors.
To create these models, we identified 50 clinical variables for each cancer type. These variables were factors that answer the key research questions such as patient characteristics, surgery details and post-op trajectory. Per variable, we identified on average 3 data points. After constructing the primary database, we trained the NLP model using 10% of the available dataset and validated its performance internally and externally.
In the next step, we used the NLP algorithm to link clinical concepts to the clinical codes of the Belgian Cancer Registry (BCR), with the aim of prepopulating the registry with accurate and up-to-date information. Finally, our researchers built a structured, harmonized database that is compliant with international data standards. We used the same NLP algorithm to transform the unstructured and structured source data into a hospital-specific OMOP CDM model.
Our team successfully processed 24,540 clinical records with the LynxCare NLP model. This allowed us to confidently map unstructured information from three different cancer types into a structured and harmonized database.
When it comes to data mining with NLP algorithms, precision and recall are crucial indicators of accuracy. The LynxCare NLP algorithm showed a high level of precision and recall in internal validation, with a minimum precision of 90% and a minimum recall of 88%. External validation of the algorithm on data related to pancreatic cancer and esophageal cancer resulted in a precision of 94% and 88%, respectively.
On top of that, we found that the variables mined from the clinical records were reliable and suitable for prepopulating the BCR. This finding highlights the importance of data quality in ensuring the accuracy of the registry and the potential of technology in improving data accuracy.
A single database that is harmonized and meets international data standards can have numerous benefits. By organizing the clinical data into the OMOP CDM format, we could access patient data quickly and efficiently, which is essential for discovering statistical trends, applying ICD coding, and research.
Data is a precious resource that can help us understand complex diseases like cancer. Our study has shown that data mining and NLP technology can be incredibly powerful tools for prepopulating national cancer registries with rich patient data.
This has significant implications for cancer research, as it allows us to gain deeper insights into the disease and develop more effective treatments.
Our study has also highlighted the importance of data quality and harmonization. By ensuring that data is structured in a consistent and standardized way, we can maximize its value and make it more accessible to researchers and clinicians.
This new approach also encourages cooperation and data sharing with other data providers like hospital networks, which will ultimately enhance the quality of cancer research. This means we can now obtain a more comprehensive understanding of the disease and improve patient outcomes in ways that were inaccessible before.